This video introduces a skill that enables AI agents to watch and analyze videos by combining Gemini 2.5 Flash for scene detection with FFmpeg for frame extraction, creating AVT (Agent Video Transcript) files that capture both visual content (slides, diagrams, code) and audio transcripts, overcoming the limitation that transcripts alone miss approximately half of video content.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Claude Code can now INSTANTLY watch any video. Here's How.Added:
You're probably pasting video transcripts into Claude Code thinking that's enough. But that transcript misses everything visually, the diagrams, the code on the screen, the demos, the slides. So Claude only ever gets half the content because half of what's in a video isn't said, it's shown. And after building an AI agent startup for enterprise and now shipping full production apps with agentic coding tools and also using these to automate most of my content production, I am giving you a skill today that I have built for myself that I wish I had before. A way for cloud code to actually watch any video, whether that's a YouTube video, a loom recording, or anything else posted online. And in the next few minutes, I'll walk you through the smart way in which it works, the three use cases that completely change how I consume video with Claude and how you can get it set up in just a few minutes. Let's get straight into it. So, before I show you the behind the scenes on how this works, I think it might be wise to show you this in action. So, we have one of these videos. Okay, this is a lecture on how to start a startup by Sam Alman and Dustin. And you can see already that this is 43 minutes long.
There are a lot of parts where Sam is just talking and although this might be valuable, we can grab that context within a transcript, right? And that's still a lot of value probably from what he's saying, we cannot extract right now the value from the frames. Maybe he might be referring to something on his presentation that might be interesting.
So for example, this, right? Maybe he doesn't say the whole thing and we might want to see what's on the screen. Maybe he might be referring to this. So he might be talking about it. we can see that on the transcript. So maybe he might be saying things like you know based on this slide it's better to do this and that but we can't see the slide. We don't have any context on the slide. So obviously we're missing some context when it comes to this for example this graph as well. So the number of users and the intensity you like. You might be describing the graph but it might be more valuable to see what everybody's watching in the lecture. Okay. So this is an example and therefore I ran this already and all I have to do is just type in the analyze command. So I just do forward slash analyze and then I press space and then I just give in the video URL or path.
I've already done this in this first example. And what it does is it first gives me the timestamps and the slide content. And you see how it's not breaking it frame by frame or every few seconds. That is because we have a Gemini model in the background that automatically detects any changes in scene. So, it's actually watching the entire video and grabbing places where Sam is talking or Dustin is talking and then we're swapping from that onto a slide. And so, we are able to understand when, you know, a lot of the pixels are changing and therefore there's a complete change in scene. And so, this allows Gemini to understand, okay, this is a frame that we need to grab and this means that, you know, the person is not talking, he's actually showing something on the screen and therefore we can grab that. So for example, we can grab a quote over here from Phil Libbon.
Everyone else is your boss. And this is not mentioned implicitly in what we see in the transcript. So this is this is very smart. We can also grab the best moments. So this could be either a mixture of transcripts also what's shown as well on the screen. So this is this is really really really cool. We understand now that with this skill we are able well it allows cloud code to actually watch the slides okay or any key moments that is not described on the transcript. But obviously I might want to be able to check it out. I want to check these slides. I want to check these key moments on the screen as if I were an actual person, right? So think about yourself actually screenshotting these things and putting it on to cloud.
Well, we are automating that by doing it with the skill. And so over here, if you see below, let me try to zoom in. We see that the frames for every slide and moment are in a certain folder that cloud code created. So I want to pull all the frames that show a slide. So we have already captured this smartly because in part of the workflow as I mentioned we are calling Google Gemini okay the Gemini 2.5 flash to actually understand in a smart way where are the scenes where we're changing from the person talking so a talking head onto a slide or some sort of visual. So these equals to 26 slide frames. Okay. And it's going to help me display them all right now which is really really smart.
It saves me a lot of time. And this is perfect when I want to watch a video which is full offormational value on the visual site. I cannot extract all that value just to the transcript. I would have to go in the actual video and extract, you know, these slides by doing a screenshot. And therefore, this avoids for me having to actually do that. Just does it smartly already. As you can see over here, I can actually check the frames themselves. Okay? Because I have access to the folder which I'll show you right now. So here are all the frames.
We have somewhere around 46 frames. If we were to ask claw to just take a frame for every second, you know, the video is 43 minutes. I wouldn't say this is very efficient. So we are doing a pretty smart approach when it comes to grabbing the frames that matter. Okay, there might be tools out there that um grab a frame for every few seconds. I would say that is only useful if you're if you understand that there's a scene within the video that you want to extract everything from. Hey, just wanted to pause the video for a second to let you know that you can access all of the resources in this video and all the other videos I make in this channel inside my free community. So, I will leave a link to that below this video.
You also get access to a large network of businesses and professionals building and learning everything that there is about the AI space. So yeah, go ahead and join my free community. See you there. And this is good. So I I still added it as part of a skill, but only if you want to do it intentionally. If not, it's going to do it in a smart way. And so you'll see here that we have a couple frames that is just Sam talking. And imagine if we were to capture a frame for every 4 seconds or 3 seconds, we will have a lot of pictures from Sam talking, which provide absolutely no value at all. And so by having video model behind it, we can filter through all the noise, right? So we can already see. Okay. So let's have a look. So for frame number six, frame number six, we get the idea screenshot. Great. For frame number 15, we get the why now.
Okay. And this is amazing because we can already we can just skip to the images that have the slides. And the best part as well is that we can match these images to the timestamps as well. So imagine all the crazy things that we could do with this. I want you to see how it works on a separate video. Let's go on to lecture six, okay, which is regarding growth. How to grow a startup that already has users have some sort of traction but is stuck at the growth stage. So let's have a look at this.
Let's just skim through it. We see that we have an important retention graph over here. I think I watched this lecture already. it. Yeah, it's pretty important I would say. Then we have other stuff as well. We got a lot of talking head. We got this. Okay. Has a it has a URL for example. Might be valuable to see, right? Same thing, same style. This will be important to understand as well, but I don't want to take a screenshot for every single thing that I see. Okay, it can be a bit of a tedious thing. And then maybe I also want to match what's on the what's on the slide with what the guy says on the transcript. Maybe I want to match it for some other use case. But anyways, so what we're going to do now is we are just going to to put it here. I'm going to grab this video URL. And that's all I have to say. Just going to go ahead and analyze it. All right. So, we're back.
Um, it actually happened relatively quick. It took just a few minutes. And these are the results. So, we got 22 segments, 22 frames extracted, and we see that we are again, just to remind you, we are working on segments. We're not screenshotting for every frame.
We're capturing segments where we're talking about specific concept whether it's a talking head is just a person talking overall and then you know the next segment would be for example if he switches onto a slide that's a full segment on itself right so we're smartly just cropping out the key moments okay it's not just an equal cut so we're doing it a very smart way and we have the output location so I'm just going to pull that here and you see that it creates this folder called lecture 6 Alex Schultz and we have the frames also cut a specific moment key frames right smartly and then we have something called an AVT file but I'll come back to this file shortly let's just ask it okay because I have a couple questions so I want to know like what's happening with the retention side of things so can you explain me about the retention graph that is shown on the slide we're not going to tell it where it is why this is so powerful. Let's just say that I actually don't remember. So it already has an understanding on you know all the images right taken at a smart intervals and so it should understand where you know this was taken. We also have a description of all of the slides hopefully and we see that on the third frame frame 09 talks about this. So I want to find at what time was this? Uh perhaps it can tell us what time stamp is this. Right? All right. You see? So, it tells us is at 246. So, that's pretty good. So, 246, right? Pretty spot on, very accurate. And I have over here the costs for doing one of those videos. So, this is from the first video, right?
Which is around 43 minutes. This is what it costed to send to Gemini. It costed us 5.5 cents or 5.5 pennies because we're talking about pounds starting over here. So for 47 minutes or 43 minutes of video, it only cost us this.
This essentially is mimicking or nearly replacing the capabilities of a human watching a video, right? So think about that because this is not expensive at all for the power it has, right? And so now let's go through how this skill works under the hood because I think it's pretty interesting. So we start using something called YouTube DLP, which is a video downloader. It's widely supported. It's wellnown and it's supported by more than 1,400 sites. So, we're downloading the video and also we are extracting captions from it. Okay, that's the second step. So, if it's a YouTube video, well, YouTube video allows us to extract transcripts for free. They already do this automatically. And if not, we're going to be using Grock Whisper, which is a transcription model. It's going to turn audio into text. Then, we're passing the downloaded video onto Gemini 2.5 Flash.
It's a pretty lightweight video understanding model. it's able to understand video natively unlike Claude.
Claude cannot understand video and this allows us to understand those changes in scenes. So when we have a talking header versus when it changes into a slide. So we're able to capture key frames and key moments that are not binded by just a transcript or a time stamp, we're able to actually see where to cut relevant parts of the video. And based on this, we are using something called FFmpeg.
Again, it's widely known. It's a widely known tool to uh manipulate video. Okay, basically every video editor that you tried uses ffmpeg in the background and based on how Gemini 2. Flash smartly classifies the different scenes and different parts of the video. We're able to extract frames. Okay, we're able to extract the specific frames, right? The the images from those frames and then we are able to just understand what's happening in those frames. And so that this allows us to not have to extract a frame for every single four seconds.
Okay, we only extract frames whenever there's a change from talking head to slide. So we know this is a slide specifically. So we're going to grab an image from here. We're going to grab an image from the talking head. And in that way we don't have 400 images of a dude talking and 400 images of a slide, right? We are removing that redundancy.
So this is very important. And then at the end we create this AVT file which stands for Agent video transcript. And this is it. This is what you see here.
So you see that this file has a visual description. This file has the audio. So the actual transcript in this case and it also has the frame relating to that particular scene. And so this is a standard video format that I have created because I want there to be a standard with which agents handle videos. There's no standard for this now. So this is my attempt at creating a file format for this. So my thesis with this is that we are going to be working with agents extensively in the future.
And so the issue with this is that for agents that want to work on videoheavy tasks, video is very dense when it comes to data. It consumes a lot of energy to process these videos. And so I want a file format that's very similar to SRT or VTT formats. If you don't know what this is, well, whenever you download a video, you're able to also download a transcript with it. So instead of just downloading a transcript, we download this type of file where we also get a visual description of what's going on in the video. What should allows us to do is to send it to some sort of video editing agent, right? video editing AI agent that will be able to easily edit a video and manipulate the video easier because now we have a visual layer on top of the traditional text layer. And when I mean text, I just mean transcript, right? There's someone talking in the video, but we don't have any visuals on top of it. So instead of this, we also add a visual layer. We describe what's happening in the actual video. And since we have this, then let's split it out by scenes so we can catch, you know, when there's a change from one scene to the other. Let's say we have a transition from a talking head to a presentation as I explained before or explaining a motion graphic or a B-roll comes up. Right? So, this is my best attempt at defining some sort of standard file format for agents to use more efficiently in the future so we can easily manipulate videos with AI agents.
And so this is the repo. Okay. And I will leave this in the description. This is essentially the skill that you can go ahead and download and you can take any video URL from 400 plus sites and you can install as a glo skill. As I said, there was uh someone who suggested me to turn this into an MCP. So if you guys want to turn this into an MCP so you can use it on cloud desktop, let me know. I can spin up really quickly or you can do yourselves as well by just downloading the repo. As again this is open source.
You can do whatever you want with this.
And and yeah, so you can either install it via the plug-in marketplace or you can just manually call this, whatever you prefer. And to use it, you just run the forward/analyze, put in the video.
You can also add a prompt after the URL.
You can drive onto a local video as well. Make sure to add your Gemini API key. And also, if you're not using a YouTube video, okay, you're using something else, make sure to add your Grock API key. But anyways, you can ask clock code for this. It will tell you the exact steps on where you know you can go to grock and extract the API key.
This is what you get. This is an example on the AVT file. And these are some other use cases. So as I said content research, break down the hook. What happens in the first 10 seconds? How did they structure this video? What sessions did they use? You can you can also see the visuals. You can also see the transitions. You can see the motion graphics. Explain the key concepts.
Right? You can learn from videos.
Production analysis. Find every time they show a diagram and show me the screenshots. was the production setup break down the intro visually. So I added these start and end if you want to look at a specific part of the video, right? If that's the case, then you can extract frame by frame. If you really need this amount of granularity, debug screen recordings, you can compare multiple videos as well. So you can class call to find patterns across the AVT files, which is why the AVT files are so nice as well. You can use it for comparison and analysis later on with Claude without having to call a video analysis tool anymore. What action items were discussed? Give me the structure notes from this meeting. Right? A whole ton of things. You can also see that it has some limits. Obviously, this is not a a polished version of the skill. So, if you want to contribute to this, that would be great if you if you do want to do this. And yeah, let me know about this in the community. I'll be happy to help out whatever needed. So, Gemini 2 and Flash Whisper only if no captions.
Most YouTube videos have no captions, so whisper is rarely needed. if you just use it for YouTube where most of the videos are anyways. But yeah, there you go. This is it. We have another CLI options as well. So you can just read them through. But anyways, if there's any doubts, just ask cloud code. It just all needs to do is just skim through the repository and it should figure out what it needs to do based on your requirements.
That's it. I hope this was clear. I hope this is useful. I'm using it myself personally and I'm going to be using it for my apps as well. I think this is a very powerful skill if you use it correctly for content production purposes, content analysis purposes, learning as well, learning from videos, right? You want to create a course, you want to build a presentation from videos, right? You can now extract slides at smart intervals. So yeah, can't wait to hear what you think about this, what you're planning to use this for. And yeah, if you have any other comments about this video, let me know in the comment section. As I said, I'll leave this in the description below. you enjoyed the video, please give it a thumbs up. And if you didn't, let me know why in the comments below. Any feedback is good feedback. Thank you so much, and I'll see you in the next one.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











