Microsoft Lens is a lightweight text-to-image model with only 3.8 billion parameters that achieves high-quality image generation at 1440px resolution through efficient training on detailed prompts (Lens-800M dataset) and support for various aspect ratios; it excels at realistic photography, complex scenes, and English text generation but struggles with Chinese text and precise object counting, making it suitable for users seeking efficient AI image generation without requiring large models.
Inmersión profunda
Prerrequisito
- No hay datos disponibles.
Próximos pasos
- No hay datos disponibles.
Inmersión profunda
Microsoft Lens in ComfyUI: Tiny Model, Big Images|5 Lens Tests: Realism, Text, Prompts & MoreAñadido:
Today we’re going to introduce a very interesting model called Lens from Microsoft This model is very small with only three point eight billion parameters but it can generate images at fourteen-forty resolution including realistic photos and large-scale concept art It also supports multilingual prompts So how is this model different from the other text-to-image models we’ve talked about before?
How do we use it inside ComfyUI?
And after actually testing it how good are the results?
Today we’ll use one video to explain all of that clearly Alright let’s first answer the first question: what exactly is Lens?
Lens is a base text-to-image model released by Microsoft As we just mentioned it only has three point eight billion parameters So from the very beginning this model was not designed to crush other models simply by having more parameters What it really focuses on is training efficiency and the actual quality of the generated images So why can Lens achieve this kind of result at only three point eight billion parameters?
The first key point is the data In the official materials they mention a dataset called Lens-800M The special thing about this dataset is that the images inside it are not paired with short tags Instead they are paired with longer and more detailed text-to-image prompts The second point is resolution Lens was not trained only on square images It supports a very wide range of resolutions Let’s explain this using aspect ratio The aspect ratios it supports range from one-to-two to two-to-one and everything within that range is supported Right now ComfyUI already supports this model You can go to the Comfy-Org/Lens Hugging Face page and you’ll see that all the model files are already prepared There are roughly four categories of files The first category is the main model There are basically two types The first type is the normal main model The second type is what we call the Turbo version You can think of the Turbo version as a fast preview version The normal version usually needs around twenty sampling steps to get a relatively good image while the Turbo version only needs four steps Below that we have the text encoder The name of this encoder sounds a little intimidating: GPT-OSS It is also in NVFP4 format so I recommend using a newer version of ComfyUI Below that is the VAE Here, we use the FLUX2 VAE At this point I still recommend updating ComfyUI because if you look carefully at the ComfyUI update log you’ll see that the Lens model is only supported in the latest versions In this episode our main test model will still be the normal version because the model itself is not large Even if you use the full model it is actually completely fine for ComfyUI For easier demonstration I also built the workflow on RunningHub In the ComfyUI space RunningHub is a really great online workflow platform because whenever new models or new extensions appear it usually follows up very quickly RunningHub also provides an online community platform for ComfyUI creators where you can find a lot of creative inspiration and example workflows The basic structure of this workflow is fairly simple First, we load the main model Here, we use UNETLoader Then, for the text encoder we load GPT-OSS For the VAE we use the FLUX2 VAE we just mentioned The important part is that there are several options you need to pay attention to First is using UNETLoader to load the main model At this point the model you load may be either the normal version or the Turbo version and you need to be very clear about which one you’re using because the parameter settings later will be slightly different Next is the text encoder Here, we choose GPT-OSS One thing you need to pay attention to is that the CLIP type must be set to Lens If you choose something else the encoding may have problems Also if your ComfyUI has not been updated to the latest version Lens will not be available as an option You simply won’t see it For resolution I mainly used two settings For square images I used fourteen forty by fourteen forty For vertical images I used ten twenty-four by fifteen thirty-six The sampling part is also very standard: Euler sampler simple scheduler Since I’m using the full version of the model I use twenty sampling steps CFG is five point zero and denoise is one point zero Next, there are two key points The first is a node called ModelSamplingFlux Inside it there is a group of parameters You don’t need to memorize them Just choose the settings the same way as shown here There is also a node called CFGNorm From the meaning of the name it should be normalizing CFG Here we set strength to one point zero and turn on pre_cfg These two nodes affect the overall sampling and CFG stability Here I want to remind you of a problem that is very easy to overlook If you change the resolution intuitively it looks like you only changed EmptyLatentImage But you also need to pay attention to the fact that the width and height also affect ModelSamplingFlux So in this workflow I separated them into an independent node and connected that node to both places This way you won’t run into problems Now let’s test the whole workflow First let’s look at realistic photography and complex detail rendering The prompt describes a small independent watch repair shop at midnight There is an old watchmaker wearing round glasses adjusting tiny golden gears under a magnifying lamp On the workbench there are many opened mechanical watches placed on velvet trays Outside the window heavy rain is falling and neon lights create red and blue reflections on the wooden workbench This prompt is actually quite difficult because it contains a lot of information The character needs to be correct the scene needs to be correct the details of the small parts need to be correct the rain outside the window and the neon reflections need to make sense and it also needs to have a clear cinematic feeling Even with many common models when you combine all of these elements together problems can easily appear But if we look at the image generated by Lens one of its strengths is very high detail density It does not just give you a rough atmosphere On the table we can indeed see a large number of mechanical parts and watch structures The lighting is also fairly stable There is a clear contrast between the warm tungsten light and the cool neon outside the window This shows that Lens pays a lot of attention to details inside long prompts and the result is pretty good Second let’s test Chinese prompts I directly wrote a Chinese prompt describing a cinematic photo: a rainy night in the mountain city of Chongqing wet stairs extending upward old residential buildings and street food stalls on both sides red lanterns and neon signs reflected on the ground a delivery rider in a raincoat pushing an electric scooter uphill and ferry lights on the river in the distance This type of prompt is also quite challenging First it is not a generic street scene It needs to understand a lot of specific things such as Chongqing as a mountain city street food stalls a delivery rider an electric scooter the river, and ferries Now let’s look at the generated result I think it is usable I wouldn’t say it restores the real Chongqing one hundred percent but the overall feeling and atmosphere are right The wet ground at night the sloped road, the lights and the sense of everyday life are all there Of course I want to remind you that if you are writing prompts it is still better to use English prompts whenever possible Now let’s look at text generation Pay attention here: right now it can only generate English text well Chinese text generation still has some issues In this example it is the cover of a travel notebook In the prompt we clearly specify the title and we also describe that the image should include a hand-drawn map seashells, watercolor tape and sunlight shining on the paper The difficult part of this kind of image is that the text needs to be readable and it also needs to be placed in the right position At the same time the whole image should not turn into a generic poster template I think Lens’s English text ability is very strong so you can see that the overall performance is also very on point The second text example is a perfume product image This case is closer to commercial product photography A black glass bottle a minimal label a reflective tabletop soft spotlighting: these elements and the text label need to blend together very naturally Another thing to note is that Lens currently does not support Chinese text especially well So don’t expect it to generate accurate Chinese text inside images For the fourth group we’re mainly looking at large scenes and aspect ratios Let’s start with the first one This can be called a large-scale scene It describes an ancient floating harbor Of course there are some sci-fi and imaginative elements inside it Above the clouds there are many wooden docks sailing ships huge mechanical cranes travelers carrying packs and distant airships flying like whales What is the biggest problem with this kind of fantasy large scene?
The biggest fear is that it only has atmosphere but no structure It may look grand at first glance but if you look carefully the buildings, ships characters and clouds all blend together and the image does not feel clean or readable Another example is a futuristic data cathedral It includes magnetic panels glowing data pillars transparent glass floors engineers wearing black robes floating holographic interfaces and optical fibers hanging down from the ceiling This kind of image tests the design of sci-fi atmosphere architectural space and glowing materials It feels more like a concept art image Finally let’s look at quantity and composition control For example the prompt here describes a clean overhead product photograph On a light gray desktop there are exactly four pencils two ceramic cups one silver laptop and one square sketchbook All the objects have clear spacing between them and there are no extra items We can use this test to observe whether the model can generate accurate quantities and the relationships between objects according to our requirements My conclusion is that Lens can understand this kind of request but it cannot guarantee perfect accuracy every time Especially when it comes to precise quantities all text-to-image models can easily fail About aspect ratio I want to remind you again: if you are making vertical images don’t only change EmptyLatentImage You also need to change ModelSamplingFlux Finally, let’s summarize I still want to emphasize that Lens is not simply about having the best possible quality What it is really trying to do is use a model that is as small as possible to generate images with quality that is as high as possible And its capabilities are quite diverse For example it can generate images support many different kinds of prompts and even support many different resolutions For large scenes and clear spatial relationships it also performs pretty well That’s all for today So what are you waiting for?
Go try it yourself Follow me and become someone who truly understands AI
Videos Relacionados
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 views•2026-05-30











