Install our extension to search inside any video instantly.

Qwen 3.7 Plus for Free: The AI Agent That Can Actually See Your Screen
Added: 2026-06-03

2,870 views1029:57AIStackEngineerOriginal Release: 2026-06-02

Qwen 3.7 Plus is a multimodal AI agent model that combines visual perception with text-based reasoning, operating in both GUI (graphical user interface) and CLI (command line interface) modes to perform tasks like screen reading, form filling, terminal commands, and code generation. It excels in vision benchmarks (BabyVision: 64.7, ScreenSpot Pro: 79.0, RealWorldQA: 86.9) but trails behind specialized models like Claude in pure reasoning and software engineering tasks. The model represents a new trend in AI development where labs are releasing specialized models for different tasks rather than single general-purpose models, with pricing ranging from $0.40 to $4.80 per million tokens.

[00:00:00]Less than 2 weeks after Qwen 3.7 Max went official, Alibaba followed up with a second model in the same family. And this one takes a different angle. It's called Qwen 3.7 Plus, and instead of pure text reasoning like Max, this one can see. It reads images. It looks at your screen, finds buttons, fills forms, runs terminal commands, writes code, and chains all of that together as one connected workflow. The line they put on the launch announcement is one model sees, thinks, codes, acts, which is honestly a pretty clean way to put it.

[00:00:37]So, let me walk through what 3.7 Plus actually does, where it's strong, where it isn't, what we know about open weights, and a quick test you can run yourself in about 5 minutes.

[00:00:48]The model runs in two modes at the same time. A graphical mode they call GUI, and a command line mode they call CLI.

[00:00:57]That's the hybrid part of what Alibaba is calling a multimodal hybrid agent.

[00:01:02]When you want it to do something on a screen, like fill out a form or click through a settings menu, it actually looks at the page, finds the elements visually, and uses them. When you want it to write code or run commands or move files around on a server, it switches over and handles the text side. There's also search augmented question answering built in. So, when you ask it about an image or a screenshot, it can pull in current information from the web to ground its answer instead of just leaning on what's baked into the model.

[00:01:33]And like Max, it's scaffold agnostic, which means whether you plug it into Claude code, Qwen's own setup, or whatever stack you already built, you get the same behavior out of it. Most agent models so far have only been good at one or two of these things. Having visual screen control, terminal work, web search, and code generation in one model that's also competitive on coding benchmarks is the whole pitch here. On the vision side of the benchmarks Alibaba published, Plus is the clear leader. On baby vision, which tests visual understanding, it hits 64.7 while the next best is Gemini 3.1 Pro at 55.9 with GPT 5.4 at 53.1 and Claude Opus 4.6 way down at 12.6. On screen spot Pro, which is the test that matters for GUI agents because it measures how well a model can find and point to elements on a screen, Plus scores 79.0 while everyone else sits in the 67 to 68 range and Claude drops to 49.5.

[00:02:39]On real world QA, the real world visual question answering test, Plus pulls 86.9 against Claude's 73.9.

[00:02:48]And on MMBC, the multimodal benchmark, Plus and Claude tie at 46.3 with everyone else stuck down in the 18 to 28 range. So, on anything involving screens and images, the gap is not close. The picture flips on traditional coding and pure reasoning though, and I have to be honest with you about that part because most of the launch coverage won't be. On SWE-Bench Pro, the Agentic software engineering test, Plus scores 57.6 but Kimiko 2.6 actually wins at 59.5 with DeepSeek V4 Pro at 59.0 and GLM 5.1 at 58.8 also ahead of it. On SWE-Bench Multilingual, it lands at 75.8 with Claude on top at 77.5.

[00:03:38]On Humanity's last exam, the hardest reasoning test out there, Claude takes it 40.0 to 34.7 and DeepSeek beats Plus there, too. On the Claw Eval real world agent test, Claude wins clearly at 70.4 versus 62.7. And on tool calling, the BFCLV4 test, Claude is at 76.7 with Plus at 72.9. The pattern is pretty clear once you stop reading the headline. If your work needs deep text only reasoning or heavy repository level software engineering, Claude and the open competition like Kimmy and DeepSeek still have the edge. If your work needs a model that can see and act on what it sees, Plus is in a different league.

[00:04:24]Those Claude numbers come with a big caveat though. Everything Alibaba published is against Claude Opus 4.6, but Anthropic shipped Opus 4.8 since then. And 4.8 is currently the top of their stack. So, when you look at these comparisons, just keep in mind they're measured against a Claude version that isn't the latest one.

[00:04:45]We don't have head-to-head numbers against 4.8 yet. It probably doesn't change the vision story much because the lead on something like ScreenSpot Pro is wide enough that even a stronger Claude is unlikely to close it. But on the coding and reasoning tests, where Plus is already behind 4.6, Opus 4.8 could stretch that gap further. So, treat the published numbers as a useful floor on where Plus stands against the Claude family, not the final word. The way Plus fits next to Max is worth being clear about, too. Max is the bigger model in this generation. Text only, built for the hardest reasoning and the longest autonomous runs. It's the one that ran 35 hours straight on a single kernel optimization task with over 1,000 tool calls and nobody touching it. Plus is the more balanced one. It's smaller, cheaper to run, but it's the one with eyes and the one built for that hybrid GUI plus CLI work. Same generation, two completely different jobs. If you're building an agent that has to read screens and click through real apps, Plus is your pick. If you're doing repository-wide refactors or anything where the model has to grind on one hard problem for hours, Max is the better call. They're not really competing with each other. They're covering different lanes in the same family. Qwen has built up a lot of goodwill on open weights over the years, so it's worth tracking the pattern here. The 3.5 generation had open weight versions show up a few weeks after the flagship preview. 3.6 did the same thing with smaller open weight models like 3.6-27B landing after the Max and Plus previews went out. But 3.7 Max is still preview only with no open weights, and 3.7 Plus is in that same spot since it only launched today. If the pattern from the last two generations holds, smaller open weight versions of 3.7 should land on Hugging Face somewhere in the 4-8 week range, which puts us around late June or July. I want to be straight about this though. That's based on the past pattern only. Alibaba has not promised any release window, and there's a real chance the flagship models this time stay closed because the agent and vision pieces are a lot more commercially valuable than the stuff they open-sourced before. So don't build a plan around open weights that may not come. For testing it right now, the easiest path is chat.qwen.ai.

[00:07:17]Sign in, pick the 3.7 Plus model from the drop-down, and throw something at it that actually pushes the multimodal side. Here's the test I'd run. Ask it to build a fully interactive 3D Rubik's Cube in plain HTML and JavaScript with mouse drag to rotate the camera around the cube, click and drag on a face to twist it, a shuffle button, a reset button, and a move counter. That's a real test because it has to handle 3D geometry, two different kinds of user input, state tracking for 26 separate cube pieces, and the animation timing on each turn. Then feed that exact same prompt to whatever you normally use, Claude or GPT, and open both results in two browser tabs side by side. Try to solve the cube in each one. You'll find out fast which model got the rotation math right and which one falls apart you twist a second face. That tells you more than any benchmark chart. Pricing on 3.7 plus is also attractive and it's tiered.

[00:08:19]Input runs from 40 cents up to $1.20 per million tokens and output runs from $1.60 up to $4.80 depending on how much context you feed it. So at the low end it's dirt cheap and even at the top end it's well under what Claude Opus 4.8 or GPT-5 charge.

[00:08:41]That price gap has been Alibaba's real weapon this whole release cycle. Close to the quality of the expensive Western models, especially on vision, at a fraction of the bill.

[00:08:51]If I'm being straight with you, the pattern this generation tells you most of what you need to know about where this is all heading. Two models from one lab in basically the same stretch of days. One tuned for deep text reasoning and one tuned for multimodal agent work.

[00:09:07]That's the new shape of the competition.

[00:09:10]Labs aren't shipping one model that does everything anymore. They're shipping a fleet, each one sharpened for a different kind of task. And the pace makes it almost pointless to stay loyal to any single tool. By the time you get comfortable and build your whole setup around one model, there's a new option that does your exact job better or cheaper. The smart move is to keep testing. Keep your prompts portable so you can move them between models in a minute and don't get attached to any of them. Pick the right tool for the job in front of you and let the leaderboard fight it out. Go give plus a real test on something from your own work. All right, so that's it from the video and I hope you enjoyed it. If you did, please like this video and subscribe to the channel and I'll see you in the next video.

#Qwen 3.7 Plus #Qwen 3.7 Plus review #Qwen 3.7 #multimodal AI #vision AI model

Related Videos

Artificial Intelligence

OpenHuman VS Hermes AI: Who Wins?

JulianGoldieSEO

285 views•2026-05-29

Artificial Intelligence

Long-Running Agents — Build an Agent That Never Forgets with Google ADK

suryakunju

142 views•2026-05-30

Artificial Intelligence

5 Mind Blowing Omni Uses Cases

PaulJLipsky

1K views•2026-06-02

Artificial Intelligence

This computer is made from real human brain cells. And you can buy it.

Talktmsmedia

3K views•2026-05-28

Artificial Intelligence

BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2

aimmediahouse

122 views•2026-06-03

Artificial Intelligence

I Made the Same Anime Fight Scene in Every AI Video Generator

NobleGooseAnime

295 views•2026-05-30

Artificial Intelligence

Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S

cnnnews18

3K views•2026-06-01

Artificial Intelligence

I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)

AICodingDaily

298 views•2026-05-29

Trending

Revisiting The Cat Cafe For The Final Time

BenGtalks

3195K views•2026-05-29

Lil bro is a menace 🤣

NotAirJordan

2037K views•2026-05-31

Political Science

My response to the Police

RecklessBen

1496K views•2026-06-01

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30