Ideogram 4.0’s integration of JSON and bounding boxes finally bridges the gap between generative unpredictability and professional design precision. This move marks the transition of AI from a creative toy to a reliable, structured production tool.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Ideogram X Comfy: Founders Live
Added:Hello friends. Welcome back to the Comfy Org stream. I am always as person I am as always person and this is Rob. What's up Rob?
>> Hey everyone. Glad to be here.
>> Yeah. So today we're going to hang out with the CEO of Ideogram and the CEO of Comfy Org once they arrive. But right now we are just going to hang out and talk about how cool this model is. I don't know if you guys have been playing with it yet. You probably have since why you're here.
But well I'll just let Rob gush about it cuz he's been really stoked about it.
But like this is a very cool new open source model that does B box regional based prompting already has Laura's already has a bunch of open source support. So yeah man. How about some of your experience with the model been like?
>> Yeah no it's been there's honestly so much we could talk about regarding this model cuz it kind of opens up a new modality to kind of working with image models with this B box structure and some custom nodes that you know have popped up.
You really get like fine grain like pixel control with the B box.
And there's you know workflows popping up from the community for kind of doing some image you know image detection on the input images to detect the B boxes and then you can tweak and refine it to your liking. So yeah I've I've been a huge fan. The model also is very stylistically diverse like you can really kind of push texture and especially with graphics. I think that's kind of where the model shines. We have really solid text rendering, so you know, when it comes to graphic design or just, you know, graphics in general, it the the font the flexibility and diversity of fonts you can get out of the model, the texture, you know, it's it's just really impressive and it opens up just so many new um ways of creating.
>> Yeah, um yeah, this is an example here on the left of and what we're talking about basically is with the Bbox regional prompting, I can say, you know, I want this thing here, I want this thing here, I want this thing here by literally painting boxes into the area. You know, like I want the label down at the bottom right. I want the title up here at the top. So, it just unlocks this really interesting way of working with the model, especially from an editorial and design point of view. But, I see our guests have arrived, so let's bring them on.
Hello, everybody. How are we doing today?
>> Hello.
>> Hi, guys.
Glad to be here.
>> Welcome, welcome.
Yeah, we just been gushing about your model.
What uh yeah, like what's been the ride what's the ride been like for you, Mohammed?
>> It's been awesome. We are so happy to see Reddit posts on using the model. It kind of went from everybody saying this model is to oh, actually this model is insane and teaching each other how to do bounding box prompting and everything.
So, it's been a wild ride and excited to kind of release the next version and work with the community.
>> That's great. It's funny how that works, right? The community's always they they they say it's garbage and then they actually start like digging around in it and they're like, "Actually, it's pretty cool." And then, you know, they they dive in a little further, but yeah, so before we get started, if you guys want to introduce yourselves and we can we can do this more formal formal introduction before we get started just so in case anyone doesn't know who you are. Welcome. You want to start Yulin?
>> Yeah, I will start.
Yeah, I I just to add on to that point, I do think it's a example of like real high quality model eventually gets, you know, kind of over whatever early, you know, difficulty or curve.
We've also seen a lot of models that just like claim to be this amazing thing and then merely the community realize it's it's not great um and just, you know, crashes down and no one talk about it afterwards. But anyway, going back, my name is Yulin. I'm the co-founder and CEO of CompVis.
Yeah, it's actually great to be here because Muhammad and I chatted a very long time ago before actually around integrating you know, Ideogram model into into CompVis and I remember back then we're kind of pitching in terms of releasing a open weight version of the model and I do remember let me just hopefully one day plant a seed and make this happen and I'm glad this India happened and super excited on our end overall.
>> Big time. We love it when these companies go open.
And Muhammad, maybe you could tell people about about your background and where where you came from.
>> Yeah, my name is Muhammad. I'm co-founder and CEO at Ideogram.
We've been around over 3 years now.
When we launched our first model, people went crazy again. It was about getting text right in images.
And we've been continue continuously grinding trying to improve the model.
And we think quality is the most important aspect for these models.
And taste is really important, texture is really important, so it's still work to do.
So, we're still working hard. And I know editing is really important, too. And hopefully we're we're going to have something on editing.
Um And what's really interesting is I do believe curation plays an extremely important role.
Um and really going the last mile of, "Okay, this image doesn't meet my bar, so I'm not going to include in the training set."
Or this image Okay, I have to include this, no matter what. And that is a very subjective decision that we don't want to make as a company. We make some decisions, but we would love to have the best artists, the best creatives, the best AI artists to make these decisions and then share back with the community. And I think that's the real value of the open ecosystem with Laura's and everything else. And we we just have done very little so far. Would love to embrace the community with editing, with Laura's and hopefully be the standard for the community to build upon.
>> That's a really refreshing take on curation in general. I feel like a lot of model training companies feel like they need to decide for you what is aesthetically pleasing.
Um so, it's really cool that to to think that you're actually you know, going by metrics and data and and the what people think. That's great.
Um so, what uh maybe you can tell us a bit more about um Ideogram and Imagen, like what what How did you this all start? Where did Where did you guys come from before you decided to switch to open or embrace open source? What was the What was the whole plan beforehand?
>> I guess maybe going back uh before Ideogram, I was at Google. I was there for 7 years.
And I was among the first people that saw the potential in text to image.
Started this project called Imagine.
The Open AI team scooped us with all the two. And then we kind of came back with Imagine and it was really high quality at the time.
Uh, the problem was it was more like a research effort, not a product. We couldn't release the model to the community and we couldn't release API.
That was that's kind of a different age in a way. This is before ChatGPT. Google was a lot more conservative. It's kind of refreshing to see Google also is moving forward and releasing these models, pushing the quality forward, which is good for everybody.
Uh, and but then I left and started my own company. I thought I can have a bigger impact and I I was a creative child. I love creativity. I love the visual space.
And you know, I know some of the large language model companies, some of the uh, coding companies make more money, but at the end of the day for me it's more like about my passion and what I want to be part of and the ambition for the company is to be at the forefront of art and culture and design.
And that's a very kind of multi-faceted space in that you have so many cultures, you have so many communities and each one of them have their own unique aesthetics. So, yeah, that's kind of the story of the company.
Really excited to be uh, part of the ecosystem.
And before kind of the open weight model, when we started the space was so open in that like we were the best model on the market.
Midjourney existed before us. Obviously, they got a lot of attention. Uh but we thought we can build the model and the app and API and everything.
But turns out that's actually a lot of work. And like a small company, we're less than 40 people, can't do everything.
And with this open weight release, what we're saying is, "Hey, like we would love to work with Comfy guys, with you guys. We would love to work with inference providers. We would love to work with app makers and kind of extend ourselves with partnerships with the open-source community, with the uh uh developer community and every startup or enterprise in this ecosystem to kind of build a coalition and propose an alternative to the closed-source big tech option.
So, yeah, a lot of people like OpenAI and Google models, but we think with the help of the community, we can actually build a better alternative, which is much more versatile, much more customizable.
And in this space, it's all about customization. There's so much can be done.
And uh that's our vision for the future, which is slightly different initial vision, but at the end of the day, it's art and culture and design.
>> I I do think uh this is really interesting aspect where uh I I think you uh you mentioned before, uh Muhammad, um as well, that there are a lot of these major labs that are focusing on, you know, kind of post-training, fine-tuning process that make this these models look very good on, you know, uh um um on benchmarks and with simple uh type of prompt, whereas the Ideogram model is in here giving you the full control over the JSON. Uh like walk walk us through a little bit around that kind of philosophy why why you know give the entire spectrum over all versus you know just kind of focusing on combining with an LLM system to allow you to you know just use simple prompts to to to get the the result or guess what the user's intent is.
>> Right. So yes this model is very small relatively compared to some of the closed source models or some of the newer models from uh companies in the image space.
And we've made some design decisions that are unique and different and as a company we want to innovate. We think this is space is still requires a lot of innovation.
Uh to answer to your question why did we go for JSON prompting these detailed descriptions as opposed to short prompts is twofold. One is control.
We think again most professionals especially in design want to control the output.
And language is often not enough to control the composition. It's it's a 2D uh output and people are very specific in terms of okay I want my logo here I want my text here. This is the composition I'm going after. And no matter how hard you try to describe the scene it often ends up not getting what you mean. Uh so we do think uh the community wants control. This is our best idea as of now of how to give the community more control. It's refreshing to see the community is adopting it but I'm sure there's a lot more work to do in terms of giving the community even more control.
And editing is one thing there, but like other alternatives maybe sketches another approach. Maybe something else.
So yes, we are excited to figure out how to give the community more control. And another element is just from a practical AI training point of view, these models perform better the more specific you are.
And we do believe Google and OpenAI also accomplish this level of quality because under the hood they have very detailed prompts. Now, what they do is they don't expose those detailed prompts to the user. They rewrite the prompt under the hood and then they feed that into their image model. Uh but that creates lack of control for again professionals, for the community because they may want to just change one word and sure you can do editing, but that's a little different from knowing exactly what is the structure of prompting for those models and how can you tweak it across many subject, across many different backgrounds.
Uh us being open rate and being part of the community aspiring to be the standard for the community, we are very open in terms of okay, what decisions we've made and what's the language.
And you can take any image and describe it with our own API. I know the community has come up with many different LLM system prompts to describe scenes and that's how we train the model. We train the model on these very long detailed JSONs with bounding box control.
And that that gives us better quality, but that gives our users a lot of control over the output. I I must explain that the bounding box is not a requirement. That is the model has been trained with and without bounding box coordinates. Uh the elements are required. That is um the model expects a JSON structure with with an with a list of elements, but every element doesn't have to have a bounding box coordinates. That's optional. And so, if somebody wants to leave that to the model to decide a cat and a dog dancing, whether the cat is on the left or whether the cat is on the right, that you can leave that decision to the model, but you have the option to specify the bounding box as well.
>> Got you. I I do think it's might be helpful for some of the community members to understand at the beginning there are some misunderstanding between the the safety filter and the JSON prompt training process. Like, I I guarantee you to this day there's still some amount of community member who who are just like, "Oh, you know, they're they're gating behind you know, kind of a some kind of system that doesn't allow me to generate even simple, you know, one prompt one-liner prompt over here."
But, yeah, I'm sure you have more to say about the the misunderstanding itself.
>> I should have used that gray image with images blocked by safety filter as my background.
>> That would be so funny.
>> That seems [laughter] to be the theme for the community. Somebody on Reddit was saying in every image generated we should have that as a requirement as the last image.
Um I also don't know if you guys wanted to show a demo more like visually for the community, but just to keep it short, um we did screw up in that okay, we that safety image shows up when the prompt is not familiar to the model. When the language is not really familiar to the model, it goes into unknown territory and it just blocks the output.
Uh now, that is technically a bug, but it is a feature in that the community tried to understand and read the documentation.
Okay, why am I getting this image? Maybe there's something that I'm doing wrong.
And we had in the documentation, this model expects JSON format in this shape and form. You can use a large language model to rewrite it with this system prompt. You can call our API endpoint for extending prompts as free.
And um and the full specification is in the docs. But then, not surprisingly, the model is strapped and like shared across uh different uh repos and people download and they don't read the documentation.
So, that image safety kind of acted as a way for a reminder that oh, maybe you're you're maybe there is a different prompting language that uh everybody needs to adopt. We would like to address it with a new version that doesn't have as much uh constraint in terms of the language and hopefully can work with a more uh descriptive language without JSON, too.
And that's been a feedback from the community. So, we would love to release that model and then focus more on Laura because we're worried if we release on our Laura training and work with you all on like Laura training notes and everything, then there will be a lot of Lauras on the current foundation. And then if you release the next base, like it's going to be annoying for people to move. So, the plan for us is to release a new base with some of the fixes and then uh go all in on Laura.
>> That's super.
>> Very cool.
>> [snorts] >> Right on. Well, let's jump over and uh we'll have Rob show us how how of this stuff works and uh yeah.
>> Let me pull that up right now.
>> While they pull that up, I guess are there anything that's super surprising for you uh in the releasing to open source community? What what are some of the biggest kind of surprise from the team when they see the reaction from I guess Reddit or you know, the entire open source community?
>> I guess it's it's exactly JSON prompting because when we were thinking about this, we weren't convinced we should build our own user experience and UX around bounding boxes cuz we were like this is too complicated. People will not do it cuz it's like it's very abstract. You just draw a box and call it a chat and then you draw a box and call it whatever background.
>> [clears throat] >> It was really surprising to us that actually community loves this feature and people want to uh control the composition. That wasn't obvious to us at all.
>> Yeah, and um from my perspective I think it scares creatives when you have to structure your thoughts like the JSON, but as you can see here in our template, you know, we have that magic prompt which has um quite a detailed, you know, system prompt under the hood that can really bring like a brief or detailed description um into that structure. So, yeah, I guess we'll get started with this demo. Um I'm just going to quickly mute those nodes, but this is the uh default uh workflow in comfy and we're going to start actually just with, you know, enhancing this pretty simple prompt and getting it into that JSON.
Oh.
See, resolve them before running the workflow.
Oh, excuse me.
Live now.
>> Oh, yeah.
It's okay. This is This is what we're here for.
>> Oh.
>> I think someone just asked if if Muhammad can confirm a bounding box order matters in the JSON.
In their test it does it looks like it does.
Um but just want to get official confirmation.
>> Order of elements does matter if that's the question. And it's designed to be more like back to front.
Uh that is start from the smaller objects in the background and then come forward. Uh I do think there's an there's kind of The problem is now you may have a lot of bounding boxes that are at the same depth.
And then there's also an order. I believe it's left to right to the bottom, right?
Uh and the text usually appears at the end.
>> Yeah, so here we got a quick uh generation You know, you can really see I mean, this image itself is quite nice. It um follows the structure of the prompt pretty well and you know, the high-level description and then we start getting into the color palettes which I think is like such an amazing, you know, >> So handy.
>> to this model.
Um you can get like so granular. Once again, that control is um just been so fun to play with and uh using it to, you know, really refine and tweak. But what I also love about this model is the fact that, you know, you trained it with very flexible um sizes. So, you know, you can get really fun aspect ratios here like um a 4x1 or 1x4.
Um so, yeah, that's an element to the model that is just so great. Um, >> Also, look how fast it is.
>> Yeah, this is on default. Um, you know, we can even get faster here with the uh with the turbo, but um, I also prepared a couple more uh workflows.
Um, we can jump into those.
So, we do have somewhat of an image-to-image workflow where we're using an LLM to, you know, get those bounding boxes from this input image.
Um, yeah, you could use Florence too and uh you know, some open open models, but here we're using uh Gemini 3.1.
But yeah, just on the model itself, I'm been super impressed with the the graphics and the texture that you can get. Um, I feel like there's such like you mentioned, Muhammad, how there's the cultural diversity that you your team has. I feel like you you see that in the model with the amount of like aesthetic styles that you can really push the model towards. So, that's been so fun to play with and I know the community has been raising that um quite a bit.
>> Yeah, that's going to trickle down into the way LoRAs work too, I think. Um, but it's just really going to like amplify what's possible in the model by then like drilling down into specific styles.
I'm I'm really excited to see what happens with the LoRA community.
>> It looks really good. You should train a LoRA for you guys. Comfy brand LoRA.
>> Uh, it is very interesting uh where I think a core problem a lot of times when you uh are using, you know, our own software for brand purposes for, you know, material, you really see where all of the enterprise those customers are coming in from. Like, sure, I can go dump as much contacts as possible into editing model and hoping that will work, or I can really just, you know, kind of fine-tune it against the style and just making sure that every time I will get exactly what I want. Uh, and that's definitely something that like, you know, we're pretty excited to be able to lean in with this type of model to in the end uh, for all of our marketing campaigns to just dump product or dump materials out there.
>> Yeah. And, um, just back here, this node is, you know, this has been used by so many people. Shout out Kajabi. Um, you know, this is kind of where the, I think, the magic of the JSON prompting is. It's not in the actual, you know, text, it's in the visual editor where you can very precisely control the bounding boxes.
And it's just been so much fun to iterate on an initial, you know, an initial bounding box structure and really tweaking things to to your liking. So, you know, That's That's actually That's actually one question I had for you, Mohammad.
Was Was that a decision based on wanting to have a model that's really good for design, for layouts, for people, or was it just something that kind of happened when you started doing the Bbox stuff?
>> No, we we we care a lot about graphic design.
Uh, we think that's how we can differentiate as a company.
And, uh, when you think of image generation, like, obviously design is one of the most important use cases.
And, we were hoping this model gives you editable text.
Uh, that was part of our goal as as as the model release.
Uh, but that's going to come up later, come back later.
Uh, but but because of that ambition to make it editable text and make it really good for graphic design, then we have very specific descriptions of text. We actually have a category of elements for text and object. We care about text rendering and in composition. So, yeah, that was that was very intentional.
>> So, you can describe the type of fonts you want, things like that, and it'll it'll actually cuz like a lot of models will just kind of decide for you, you know, what what what it feels the vibe of the type of typography should be, but yeah, that's really interesting.
>> And we're hoping to actually release a prompt guide with more details cuz we had a documentation with some high-level uh description. I guess high-level description is our uh high-level description of the prompt.
But yes, uh we are hoping to release a prompt guide with more details, and it's really wonderful to see all the community read line by line or their agents have read line by line all of the documentation on prompting. So, we're hoping to release even a uh more in-depth documentation for the community, so.
>> It's amazing.
The other thing the community does a lot of times is go figure out random corners of these model capabilities themselves, which is yeah, pretty surprising. I think they're just people, you know, running on their machine and then training fine-tuning Laura's on top of it and then in the end coming back with the result that's actually somewhat surprising a lot of times for uh even the model researching teams overall. I don't know if this is you guys have gotten this experience yet, but uh I think in some of the other models that people release like, "Oh, I didn't know you can do it this way." or kind of even at the um low-level uh level a low-level kind of control to add all these components or box.
>> Yeah, and just on that, um here's a a JSON prompt that I actually got from the community where you can generate a character sheet. Um so I found this on a Reddit. Uh we'll we'll link the the user in the uh description to the video, but it does such a great job at just, you know, once again, that control and you can generate character sheets, get uh turnaround views, get very descriptive with the outfit of the character. And for an open model, this is, you know, just crazy to think that that um it's at this point where you can, you know, have such control, once again, over >> Yeah, not only that, you can tell it exactly where you want this grid to live. Like with with the text-based natural language models, you really are just I would like a grid. I hope it's the right shape. I can't really decide if I want to chop it up in a certain way every single time because the grid might not come out the same every single time. But in this case, you're actually you're you're, you know, sequestering the the image out into sections so you can say, here, here, here, here. So this is what really works well when you we want to split um aspect ratios and the same thing. You've got accessories, you've got to close up, you've got four tall shots. You get a lot more data into every image than you would otherwise. You would have to have two character sheets to make that work.
So like that's that's really exciting to me.
>> Yeah, this is really wonderful. And on that note, um hopefully we can add the link to that Reddit post that talks about character consistency by, you know, including the picture of the character on the left and getting the model to fill the image on the right.
Uh that's actually how we trained our character consistency model um six, nine months ago. I forget exactly when it was.
Uh so we released our character model, it was the best at the time. Rest in peace. I guess the space is moving so fast.
Uh and Nano Banana launched after.
But that model was trained with that trick of on the left you have the image of the person and on the right you do in-painting to recreate that person. But we trained it we kind of did a fine-tune of that left-right uh conditioning to generate character consistency. And it was so shocking to us that oh the community figured this out and I suppose a lot of these uh community members are just so passionate about what they do and they may not have, you know, research training in AI.
So in terms of surprising results, that was one of the things I saw that blew my mind and this is oh how did you know this trick? And yes, this is how we trained our model back then. We never talked about it publicly cuz we were closed at the time.
But yeah, this community is just phenomenal in their skill and ability and really wonderful to be seeing all of the interaction with with what we do.
>> That's so cool. They yeah well the one thing our viewers love to hear about is like any weird emergent properties that you found like because of the training method you used. Did you Was there anything that really surprised you when you were starting to do inference with this model that would just like whoa we didn't even train that but it does do that. Were there any of those like magical moments?
>> One thing I've seen is like people will specify the color palette on a per bounding box basis. Does that work actually?
Cuz we didn't train it that way.
>> Yeah yeah maybe. But we'll have to have the we'll have to have the audience let us know if they've tried that.
>> Cuz I saw in many JSON uh prompts they do like color palette on a per bounding box basis.
We haven't trained it that way. So if if that's working as well as the community wants, then that's an emerging property, for sure.
And I mean, one thing that that does um surprise us to some extent is like just the stylistic variation that people talked about, and I saw multiple people on the committee commented on that, that is, "Hey, this is the best model for art."
And uh we try to, you know, capture as many different styles as possible and keep the model more like a raw model that's not super opinionated and can capture many styles, but it was really nice to see the community um confirms that and embraces that.
>> One thing that really strikes me is the um the the a lot of models they get very like smooth and shiny and they have a sort of almost glossed over look, but there's a like a cinematic grain or almost like filmic quality to a lot of the shots that I'm getting out of this model, and I really love it. And I'm wondering, was that a decision made in the training data, or did that just sort of happen? Cuz it feels like it it like it feels like it might be from the training material, but yeah, I'm just curious. Uh >> Very, very intentional. Like, this is something we care about day in and day out. We look at texture. We're like, "Okay, this is not acceptable. This is too AI look." I think we still have more work to do sometimes.
Photography starts to look a little more illustration, like very hyper uh realistic illustration. Um but yeah, this is this is kind of the ethos of the company that like we want to combine our passion for AI with our passion for creativity and art. And as a result of that, if you join as a AI researcher, you got to get your training on what is AI look and what is non-AI look and what is texture and like, "Okay, raw images."
We look at images that, you know, uh raw images to really be able to know this is a real image and this is a grain. And yes, so that's that's that's been very intentional. I think we we have still more work to do there.
>> We have a special guest at this moment.
It's the original creator of Comfy. He would like to come on and say hi and ask some questions.
Um >> Well, hello there.
>> Uh Muhammad, meet Yannick. This is >> [laughter] >> the original creator of Comfy as you can tell from the >> [laughter] >> of the unofficial unofficial >> Good to see you.
>> This is uh well, it's the official mascot.
>> Unofficial mascot.
>> Yeah, he likes to say it's his official but Yeah, [laughter] so basically yeah.
Model is uh pretty it's pretty good.
Only issue is I had a bad bad issue prompting this character cuz uh got uh Well, when I didn't use the adjacent prompting, I got safety filtered which uh So, yeah, first uh Is there a reason? Like, what's the reason why you put the safety filter on the model? Cuz I think a lot of our community wants to know like the reason.
>> Yeah, we We have a little bit of a challenging time to kind of prevent illegal use of these models uh from legal use of these models.
And it's our first attempt to release an open weight model. So, we thought there is a way for us to build safety into the weights of the model to prevent uh at least make an attempt to prevent illegal use of the model.
But I do think that was a little rushed and it had other negative impact on the model and we we we heard the community. There have been a lot of reaction and thoughtful feedback and we're going to fix that in the next version of the model.
But that was reasoning to just you know kind of make sure that uh the model doesn't allow for some of the inappropriate use.
>> Yeah, that's I think that's that's fine if you put safety built in the model. Pretty much all models do this. It's just they do it in a more more friendly way I find. Like it's more if you prompt something that's unsafe, it's just the model will generate something that kind of follows the prompt but without the unsafe part instead of just giving you a like refusal image.
So, I think uh Yeah, if you do it you'll have a much better reaction from the community.
>> Feedback received. Yes.
>> And yeah, another thing that some people were asking is like why not publish higher precision weights? Why limit Why only publish quantized weights?
>> That's a good question. That that was our thinking in terms of how can we release a very high quality model that is also easier to serve on a consumer GPU cuz obviously there's a RAM limitation while allowing for kind of deeper partnership with select companies who want higher precision weights.
Uh >> I see.
>> Yeah.
>> Yeah, the reason people were asking for this because a lot of people do their own quantizations of the model because different GPU hardware they certain quantization run better on different GPU hardware and we have a community members that do their own quantizations and for the best results they prefer having the highest quality weights available.
So, that's uh Yeah. So, that's why people were asking that question.
And uh All right. So, yeah. Another question is why the dual model architecture for one model in the positive and one model in the negative icon.
I've seen this I've seen this idea before, but this is the first time I actually see we see an open model with that uses this.
>> Yeah, I mean we we've been uh obviously iterating on different variants of the model and what we found is in order to fix some of the texture and color saturation issues that are associated with AI models having a more customized negative and positive model helps.
And so, we just released that as part of this model.
>> Yeah.
>> So, actually the version of the model that doesn't have the negative model will just produce more plastic skin, more saturated color. I I I think some people in the community tried to And so, that's simply our uh secret sauce that made some of the older models that we had uh high quality and uh and so, we released that in an open weight model. There might be other ideas. There might be ways of having a symmetric uh positive and negative model that achieves the same quality, but this is this is our best effort and the best idea we have so far.
>> Yeah, so but if someone were to train a war, do they only train on the positive or should they train it on both?
>> What we've What we've done internally is only trained it on the positive side.
>> Okay.
>> And that seems to work fine. I don't know if that's the experience from you or the rest of the community as well.
>> Yeah.
>> Yeah, I've seen that.
>> I was just going to say I've seen a lot of people they plug the Laura into both chains. Um if they're adding a Laura, they they make sure it's out like one strength on both chains.
Uh apparently that works better than one. I haven't done enough testing to have an opinion on that.
>> Okay, I got to I got to go back to my team and I'm gaining a better understanding of what which version which cuz there's some uh transformation of the weights as well as part of releasing the open open weight models, so I don't know exactly what amounts to what, so but I got to check with the team. But that's that's very valuable feedback and that's that's a that's an important priority for the team uh to to focus on Laura. Once we fix the base, Laura will be the next high priority for us.
>> Very cool. Is there any upcoming kind of Laura related releases that you guys are going to push out? I've also heard there's tease around kind of editing capabilities on text and uh on further control out there. I don't know if there's anything you can share right now yet.
>> Yeah, so there are three different efforts. One is the new base.
And as part of that, we want to enable Laura in a more uh native way. Release our own code.
All the recipes that we find them optimal. Obviously, the community may have even a better recipe and would love to work with the community to finalize the recipe for Laura.
And then image to image, obviously that's a very very popular ask from the community. We think that's complimentary to Laura.
You know, and then we'd love to work with the community to really build these best spoke editing capabilities. Like imagine like a picture to anime model or like a very high quality product photography model.
That is better than any of the closed source models. I do believe for editing in particular curation is again very very important and would love to work with the community to build these like top notch workflows for editing.
So that is two steps. Like the base for editing and then Laura recipe for editing. And then editable text is another important focus for us, but given the success of open weight, that's kind of taking a backseat and we want to release the editing first and then go back to editable text. And we a lot of our enterprise customers are interested in that for branding, graphic design.
That's very important. You want to kind of keep using your font in images and you want to make it editable so you can change the copy without changing the image. So that's the kind of third category of work that we're very excited about.
>> Right on. Austris, one of our amazing community members asked, "Was unconditional trained specific way with lower quality data set or just zeroed conditioning with the same data set?"
>> Hey Austris, thanks so much for >> Let's bring him on.
>> [laughter] >> Thanks so much for all of the work you've been doing as part of the community and I saw I saw Austrians had the fast model Laura.
I I'm curious. I guess that's the question from me for Austrians. How did you train that Laura? Uh, anyway, so for for the negative past, we often train with a lower quality data set to kind of guide against uh, bad generations.
So, your intuition is right.
>> Thanks. Appreciate that.
>> I guess, uh, should we go through some of the chat questions?
>> Yeah, there was a few more questions here. Um, the most pressing one I think was, uh, do you have plans to do any sort of, uh, control net like conditioning level sort of, um, you know, uh, IC, that kind of thing?
>> We were thinking editing will be potentially a big big better focus, but with a very good editing base that can do sketch to image, that can do depth to image, that can do, uh, you know, regional editing, style transfer, lighting transfer. So, our focus is kind of combine all of the editing capabilities into a model that can be bringing out those capabilities with text-based prompting.
Uh, would love to hear if you guys think there's a better approach and and if control net has some advantages over that approach, but that that's been our plan.
>> All right on. Uh, someone else, uh, All Wave asked, uh, I saw people doing Laura's with multiple characters for Ideogram with good output. Any suggestion on how to achieve that or any general advice on making Laura's for Ideogram in general?
>> Two, I mean, one observation is yes, that was super fun to see. I think the bounding box control helps teaching the model concepts faster and more effectively.
Because, okay, I always use this example. If if you give me a an image and say like let's say the image of us in this uh live stream, you say like four people on the live stream and like one of them has uh you know, a uh mascot, two of them in one scene with a uh Golden Gate is a Golden Gate background with with a mascot etc. and another one has So, you can you can describe the scene in excruciating level of details, but you can't easily point to patches in the image. Whereas with bounding box, you can point to different patches in the image and say exactly, "Okay, this region has a wooden color background with texture etc." And that helps the model train faster because the model doesn't have to decode the description and the association of the description and different patches in the image. It can really use one image to learn more bits of informa to to have more bits of information learn faster. And we think that's probably very very helpful for LoRA cuz in LoRA you only have a few images of character and like training too long often damages your LoRA. So, you want to train for as few steps as possible and as few images as possible.
And so, bounding box is really important for LoRA.
My suggestion for LoRA is is train on bounding boxes, but drop bounding box information with some probability so So the model doesn't have to uh take bounding box as input and another suggestion is try to mimic that language kind of like have the reverse system prompt for your VLM that describes the image and try to kind of follow the JSON language as much as possible with the style description with high-level description.
You can call our describe endpoint.
That's a That's a paid endpoint. It's pretty cheap to just get a sense of how the description works and then you can mimic that on your local machine. I think that will likely give you better results cuz the closer the language to the language of the model the fewer steps of training you need.
>> That makes sense then then it may be good advice for anyone who wants to caption their their dataset for an Laura like this to actually do it through your >> [laughter] >> your endpoint so it's getting you know the closest conditioning that it's going to get and I love the idea of actually telling it where things are without having it have to figure that out itself. So that's very cool.
>> Yeah, I think I think they they should use our endpoint as a starting point.
The challenge is if you have multiple characters and you want to name them our describe our captioning endpoint will not know the name of the character and that's the magic of Laura training that you know members of the community know better than me. So they they can start there but they need to do their own kind of adjustments on top of the of the model and when it comes to small text our describe may not be the most accurate in small text. There may be other more state-of-the-art language models visual language models that are better at text. So you can kind of decipher the language into a system prompt and then use your LLM of choice maybe one that even costs more than our describe to really get the details of the image right.
>> Right on.
Um Issam had a question here. Um he's noticed that the model's generation show very strong text adherence and they're curious about how prompt adherence and text accuracy was measured during training.
Um I'm actually curious about that, too.
What's your method of like determining if it's doing good text or not? Do you point a VLM at it or do you have the other internal methods of like OCR?
We're we're just curious on a very nerdy level how you're doing the uh evaluation.
>> Yeah, evaluation is is very very difficult, unfortunately, with image models. At the end of the day, people got to look at them and do side by side to really see the texture.
Unfortunately, a lot of the public benchmarks also aren't designed to measure some of the aspects of the images that the community cares about.
Uh but we do measure things like the last function for training as well as OCR text accuracy. Uh because we care about text accuracy, those are some of the uh metrics that we use and we also are in the process of developing better uh large language model-based metrics to uh make it more automate-able make it more automated uh for evaluation. That is kind of an evolving area and if somebody has a very good evaluator uh or ranker, we'd love to use it in our in our training.
>> On the benchmarks Um I would say the uh I'm curious on how you guys view it because in general we see this huge gap where um you know, X people are just like, "Oh my god, look at these numbers." or you know, um but on the creative side for people who are actually doing work, they don't even know this is a thing. Uh they they they they don't really evaluate by going in there to like, "Oh my god, this is the top benchmark model. I must use it." They actually test themselves and then understand that. I don't know how you guys internally kind of balance this.
Like, on one hand, there's the, you know, rhetoric around the, you know, the entire AI ecosystem. And then on the other hand, there's professional usage.
>> So, so far, we've kind of completely ignored all of the benchmarks before this latest release that we would just do our internal evaluation. We look at texture, we look at text accuracy, things that we care about, we think our users care about.
And then let the benchmarking companies to benchmark our models. Um so I think we, for model development, you got to have a strong opinion. Okay, what do you care about?
And just measure it internally and move forward. And our hope is the community sees the value and the brand has some recognition and and people uh see the benefit of that. With this release though, we started talking with different benchmarking companies. And we realized, oh, actually there's a market of private evaluation for these models that you can pay them and private evaluation and they release they don't release the the ranking. So, I suspect there's a lot of private evaluation happening for um frontier models and they try to optimize for the benchmark cuz that's what uh this the X community or social media community cares about.
Um but we didn't do any of that. So, we we just had the final checkpoints and we were like, "Okay, could you please uh rate this checkpoint according to your mechanism?" And I think it's an evolving field again. There are companies who care about evaluation and want to improve their evaluation. But the challenge is if you're evaluating language models and coding agents and you know, all sorts of AI agents as well as image and video models, it's very hard to make it perfect across the board. So hopefully there will be some companies that will evolve in the creative space to measure the quality of image and video models cuz I believe the the techniques required are a little different from the techniques required for measuring uh programming agent or um just chatbots.
>> Yeah, 100%. I do find a lot of times the the like the the private being public or private about the evaluation is also really important when you have something out in the open, everyone gameified around that and that evaluation system no longer actually is valid anymore. So it it yeah, like it it makes sense for folks to have this kind of private alpha uh system. And also yeah, like the image space is so wide. There are models that do great on, you know, animation.
There's models that does does great on relighting, upscaling. I don't think any of the benchmark systems are very much looking at these things at all. Uh so yeah, if you want to start a company around this, please do cuz there will be two clients within this same call that would in be in command of it.
>> If that's not some alpha, I don't know what is. Uh you should take that take that and run with it, chat.
Uh we had one more question from the audience too. Um uh would it be possible to have polygonal uh bounding boxes? Uh more than four corners. Would it be possible to draw sort of a bounding box around an object like a mask?
>> The model hasn't been trained that way.
So feel free to test it. I don't I don't think the model will be able to get it.
I I was thinking maybe the rotated boxes is the first place to start, especially for text. If you want to kind of kind of have an angled text, like you don't want to put the bigger bounding box and explain it's angled.
Um but would love to you know, figure out with the community what what makes sense cuz polygon then starts to become too many tokens. People are already complaining, "Oh, you use so many tokens for your bounding box coordinates." And like it's >> when it's five sides, but when it's 300 sides, it's different.
>> [laughter] >> Right. Yeah.
>> It's uh Ideogram 4.0. That's why there's four. If you are looking for the next version of 5.0, then you'll have one more corner over there.
>> How about 4.5?
>> [laughter] >> Well, this has been amazing. If Yulon and Yulon, you guys have any more questions, we can hit them now.
Otherwise, um it's been amazing.
>> Yeah, I really appreciate it for your time, Muhammad. And great launch in here. Looking forward to kind of collaborating more with you guys.
Um last question is actually around this idea of uh you know, you guys have your own interface, your own infra systems. Um how do you balance this view of you know, working on your own interface or with your own audience and also kind of collaborating with folks uh out there like ourselves or any other you know, tooling companies in terms of you know, uh broadening the reach of of this model.
>> Yeah, we want to be the the friendliest partner to everybody in the ecosystem, including you guys.
And I think early days uh our community can kind of attest to We had our own community chats on a weekly basis. That was we were trying to serve our own users directly.
Uh, and now we're more at the infra level. We want to enable the creative ecosystem and uh, be a good partner to you guys, to the developer community, and hear feedback, and act on the feedback, and like feedback is always right, right? Like there there is something that the community uh, is incentivized to come and tell us when we screw up and what they expected from us to do and what they want us to do.
So, uh, we would love to be a friendly partner.
Uh, at the same time we we we are learning from the community. So, for example, uh, this uh, bounding box control, prompt builder, we thought initially maybe the community is not super interested. So, we didn't build it into our own first-party app, but now, given the community, we want to go and build it back into our first-party app. We had some initial insights that customization is really important, so we had our own Laura training product as part of uh, the the first-party Ideogram app.
Uh, that got some traction, but that was based on our previous foundation, so we are very excited to kind of revamp that with bounding box and release a new version of that, but at the same time we want to also enable uh, comfy UI nodes for training or other platforms who want to allow uh, Laura training or even local Laura training to have a standardized recipe that other people can build on top of.
So, we we are here to kind of provide options to the community. We think the community needs options, and that's the value of the open ecosystem that other people can build their own tools, and it's going to be awesome.
>> Amazing. Appreciate that. Um, yeah, that's great. Thanks everybody for coming out. Thanks Mohammed, thanks Yolan, thanks Yannick, thanks Rob.
Thank you yeah, we'll be back next week with more crazy stuff. Who knows what we'll be up to.
But yeah, hopefully we'll >> Nice meeting you guys and very nice meeting you Yannick and big congrats on everything you guys have done. It's really fantastic to see and we're just getting started.
>> Also, I forgot to mention Fennec Earl.
>> [laughter] >> Nice meeting you too.
>> [laughter] >> All right.
>> Thanks guys.
>> Thanks everybody. Have a great Have a great one.
>> Bye.
>> Thanks so much.
Related Videos
AI Agent Mastery Certification Course: Lab 4 – Tools & MCP
arizeai
350 views•2026-06-16
Real-time Voice cloning, Kimi K2.7 CODE, GLM 5.2 and 3D reconstruction | AI News
kaiexplainsYT
111 views•2026-06-16
He Believes AI Could Replace Humanity Faster Than Anyone Expects
LondonRealTV
815 views•2026-06-15
General Session by Rami Rahim-The next generation of networking: From vision to self-driving reality
HPE
108 views•2026-06-17
[PLDI 2026] Flatirons 3 - LCTES (Jun 16th)
acmsigplan
191 views•2026-06-16
Google DeepMind’s AI Halves UK Housing Planning Time
60secondsignals
467 views•2026-06-17
The Creators of Claude Code and OpenClaw don't Prompt Their Agents Anymore?!
ColeMedin
569 views•2026-06-18
Why prompt injection is AI's biggest fail
usemultiplier
1K views•2026-06-17











