Provencher correctly identifies context hygiene as the real bottleneck for AI agents, shifting the focus from raw model power to structural precision. This is a pragmatic masterclass in using modular workflows to overcome the inherent reasoning limits of current LLMs.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Smarter AI Agents with Repo Prompt - With Eric ProvencherAdded:
Hello, Eric. Uh, thank you for Thank you for coming. I I had Adam on recently and um I've been watching your podcast for a long time, so I guess I guess that's a good place to start. It's it's a really nice podcast. Uh I've tried Repo Promprompt a bit. Uh could you please introduce yourself and what you do?
>> Yeah, cheers. So, uh yeah, I'm Eric. Um you know, I um I'm the founder of Reprompt. I've been doing this for about 2 years now, almost like started actually in June 2024.
Um, so for if you've been working on AI tools, uh, I guess there's not that many people that have been like chugging along on actual tools for that long.
And, uh, the space has changed enormously in that time. Um, so it's been interesting to see everything progress. And, uh, little background like I'm a software developer. I was a staff engineer at Unity, so have a lot of game dev experience where I was before doing this. Um, but yeah, really deep in the AI tool space. Now, >> how did you get nerds sniped into all of this? Like cuz you you went pretty deep.
like I've uh >> I think you have a very good feel of one of the biggest problems which is like how do you actually give them context and information in a proper way. Um but yeah, when did you when did you get picked up into all this?
>> Yeah, so I mean I was uh I I I started working with AI for coding kind of like as early as like the first GPT4 release like on CHP. Um, and so I was like initially you know like everyone back then just copy pasting some snippets going back and forth and if you're working on code from that long ago like with models like you'd be familiar with the problem like okay now output the complete file please you know you you'd say that to your model a lot and and they would not do it and you'd have to struggle with that. So like um fast forward uh to like um Opus 3 releasing um that was in like early 2024 um and and it was like actually like competitor to GP4 in terms of intelligence, but it had 200k context. And so then I started using it more seriously to like actually work on a game with it. And I was working on a game for the Vision Pro at the time. Um and I, you know, I was like plugging away at it. Um, but then like one day Sonnet 35 came out and it was like the latency was good. Uh, and and I could like actually like see myself building more stuff with it and I was like, "Okay, well like I need I need a better tool for this." And I looked around and there was some like other little tools to like manage this context, but like I was like none of them really do the way I wanted to do.
So I started just building my own thing and it just kept going and started sharing it with people. Um, and I get it started getting feature requests and it just kept going from there and I I just kept digging in and and understanding more about the problem and you what my users needed and what I needed and yeah, it just kept going from there.
>> And how how has it evolved to fit into like whatever the cuz the models have changed significantly. Like you said, they they used to just like output like kind of a diff and tell you like just add this instead of like actually doing the work. Um I think a lot of the smaller models because I test a lot of models they have that tendency too where they want to like tell you what to do.
Um but h how is your tool and how is your >> like mentality around building these tools changed as these models have gone more and more uh capable?
>> Yeah. I mean so like when I started the best you could do with these tools was to like tell the model okay your problem is to look at these files. um this is the code I want and uh you know like let's get it done and then from there I could also add a prompt so that I could be like okay now instead of just answering the code let's output it in XML and then I can parse that XML and apply the edits directly to your file and that's kind of where I was for a while um and that was great because then it automated this flow and even though I'm copy pasting like the amount of clicks were quite minimal and I can get a lot done um what evolved though is that like I think around like you know like early this no in the last year like um I'd say like early 2025 like mid to mid 2025 like that's when like tool use started becoming decent um you know cloud code had just come out uh MCP was starting to evolve and uh and then around the fall you know we started getting like some pretty good codeex models uh and that's when I was like really starting to realize that the work that I was doing manually of organizing in context that could be automated in a much more powerful way. Um, and I started working on a tool called the context builder and and what I so so just to back up a little bit on what that is and like why it matters like I think the thing that I still find matters a lot with coding with models even with the best reasoning models even though you have coding agents is that like you said context matters a lot and the thing that you want to do is try and find what the model needs to know to solve a given problem and give it the the maximum amount of space it can to reason about your problem. And so by automating the work of collecting this context and packaging it up in a way that's like dense, you're able to like really fill the model's context window with just what it needs like and as much of what it can need as possible and not letting it kind of drift and call tools and do this other stuff. And just that work gives you like you can get plans, you can get reviews, you can get all kinds of really interesting stuff that lets the model just give you the best work possible. And so now that's automated but like having that as part of the pipeline and I call that like Oracle and context builder. Um that's to me is the most important thing in getting like really good outcomes.
>> Yeah. To the people that are I think a lot of people are just now jumping into actually using them in this like agentic way. It was November so my GitHub I've been doing open source for a long time but my GitHub just exploded. I think it was like four or 5x the amount of commits. um like uh 3 or 4,000 now when it was like 700.
>> Um and yeah, because I could just do so much more and I can also just experiment so much more. Um but there's still this problem where whenever you want to tell it to do something, you either have to like write documentation and keep your project like super modular in a way that um is consistent where like the docs cannot drift away from the actual code.
Uh you have to do either that or you have to just like have it waste at least like 60k context every time it wants to go and search for these files and um you know usually they're not super efficient with it like they go and check things that are not important. So how are you autom is it is it like >> do you have like these steps where there's an agent >> collecting things? Um so like is it is it just by breaking down into multiple agent flows or are you doing stuff that's static? It's pretty much that it but it's that with like like very carefully curated prompts with like a curated set of tools um that are like stateful like I have a tool set where the model can go ahead it can select a file and it can select like a subset of a file and then if that file is being edited by another agent I have to like rebase the slice of like that file and keep it in track. So, so there's like a whole system around collecting the context, keeping it in sync, making sure that it makes sense. And then when the work of that agent is done, then I hand that off to a separate model to reason.
Um, and the agent is like calling this in. So like you you talk to the agent, the agent will then call the context builder, which then calls the oracle and it comes back with a plan and then all of a sudden the agent can just work. And now like where I am now is that actually like that plan is pushed to a file and then and then the file is referenced by sub aents that are doing the work and the main agent is just orchestrating this work. So you can go and like have like multi-hour long threads and you don't need anything like goal or anything like that like it just it just runs does all the work. The agent just waits on the sub agents and everything is kind of done in a really efficient way.
Uh, I mentioned that I listened to your podcast and you you guys always talk about like the latest models and how how you're working. If I remember this correctly, you're like a GPT or you know like you use GPT mostly that's your preference. Um, what is it about like that model specifically that you know attracts you?
Cuz I feel like you know there's a a bunch of people that like the handholdiness of Claude that want like that you know for me now I've kind of changed. I used to like the claw behavior. Now I uh now I actually try to focus on um goals because I think that's like the current meta. It's like the most efficient way to squeeze as much of as possible. Um but then the models they have these problems where they are like sneaky. Like they'll tell you they read a file but they only read half of it or they'll tell you they did something but they kind of like hardcode it just so that it looks done. um how how do you deal with these different types of problems?
>> Yeah. Okay. Well, so there's multiple questions there like like I guess like what what I like about the codeex models versus quad like I think they're both great great models to work with. I think the the main thing that I turned to GPT models and and it's it's part of like how I've been working with the models for the last couple years. I think is that is that like when 01 pro came out you know that was like the first like deep reason you wait for a long time the model thinks through the whole thing end to end and gives you an output like if you're doing this work of collecting context for the model and you just hand it to the model it's going to give you something very complete of like end to end here's everything that needs to be considered and that's like the basis of what I I need out of these models and once you have that you can give it to claude you can give it to GPT it it doesn't matter as much though I do trust GPT more a little bit to check its work and and follow instructions. Um, like if you have like a Claude MD file, I'm sure you've noticed Claude will probably skip half the things in there, but GPT will actually really adhere to it and it's just a lot more rigorous with that. And so to the point where sometimes you have to be really careful what you're putting in there because it's it's it's like a program. Um, so for me that's like really important. Um, now how I'm dealing with like the models reasoning like reading only half the files like I mean it comes back to the same thing like I have a flow where I organize what the model needs to read and because the model that's like picking what needs to be read it doesn't have to do all the reading itself like I've different tools like code maps to like let it like quickly navigate and know what lines are relevant put them together and give them to the the big model. The model has all of the files in its context right away off the jump. So it doesn't have to decide to read something or not read it.
It has to read it. It's it's there for it. Um so it has to consider it and and it will reason with that in mind.
>> I I like I like the strategy. I think the strategy of like reducing their freedom is is is always been helpful.
like if the less freedom that they have, the more like the structure um you know with the like what is it called the Ralph loop and all all this where you have like a thing launching something else instead of having one long uh chain which so looking at the harnesses right now I think most of them are focused on that long chain where you just have a prompt and it goes for hours um so loop and and and claude and goal and codeex um there's there's other ones um why have uh have you ever used root code.
>> Uh I've I've tried it. I and I talked to the the the founders of root code. Uh actually they got started uh a little bit like a few months after I started repop. Uh so I was talking to them. They were like upset with some things in in Klein. And the one of the things they did first was actually allow diff editing which was one of the key things that I was working on which is just like like if the issues at the time like the model would only output the full file.
So if you're using cursor they had a second model to do that work. If you're using uh client then then sonnet would pay tons of tokens to do that work. So you had to come up with ways to get them to reliably edit only subsets of the file and that was like a whole space of engineering that took a lot of work. So anyway just a little tangent yesterday >> completely completely like the problem doesn't exist any we don't even think about >> yeah yeah yeah different times. Um, so what what do you think is like the current meta in terms of models like what because you know for me I I use GPT 5.5 a lot. Um I it it's not great at front end. I mean it could be I guess but it just like out of the box it's not so great at it. Um it's not so great at like doing a bunch of things at the same time. They like they like one thing at a time. This is how I feel at least about them is compared to Claude. Claude is like very shallow. It likes making changes all over the place. it it's very good at computer use. Um you know what do you feel is like your meta in terms of like the tools that you're using uh the models that you're using? Uh >> well so so I described you know like like the flows I use but like the main thing that I I kind of find is the right way to work with it is is that like you know if you're thinking about context windows like what am I adding to the context window and what the model has to do like if I'm actively steering the agent and we're making and jumping around and making small changes like I I think you know like people people who do that they they tend to prefer Claude and and there's the reason for that is that like Claude will kind of be more nimble though I will say with like 47 like this probably like in my mind like Opus 47 is a lot more like codecs like older Codex models than than it like a previous sonnet model or claw opus model. Um and in some ways like GPT 5.5 is a little more opus like with you know a little bit worse frontend scale um in how nimble it is. Um, but I I think like in whatever model you use, the thing that you really have to think about is like, okay, what is the model seeing? And if I'm actively steering it, like how confused is it going to get from what I'm telling it? Um, and if you're going through long context windows and you're compacting, like you're just carrying all this junk with you through a task that that doesn't serve you. So like in my mind the best way to go is to have the model that is the one you're talking to delegate work to other models and that way you know you keep things clean.
Um, and if you're planning well and you're able to split up work and you're able to kind of just do this orchestration of delegating, you're able to kind of tack on work on the main model and it can do things in parallel if it needs to, but it's keeping things on track and um each window that's doing actual work is clean and isolated and it it it's just got what it needs and and it gets much more work done. It's much cleaner.
>> Yeah. Yeah. I I I cuz you know a lot of people they're just picking up like the harness that is available to them. They pay $100 or $50 whatever $200 a month.
They get this package and then this package is like essentially you you don't have much choice in the matter. I think you know they have plugins a lot of them have like skills and all this stuff but really like the building blocks and the the way that you uh try to interact with the model I think has been very simplified and it's like it really favors the this the like economies of scale of the labs where they have to just deal with like very large models very large um user bases um and it dilutes the the potential of of these things cuz uh especially okay when when we start talking about like open models which I this is something I'm super invested in. I'm trying to fi find out like how can I use something that is open source to uh the level of quality that I would get from open AI and really I I think it all boils down to a harness problem but then a people are not much incentivized to go look for like different harnesses cuz then like how okay you have to get this whatever uh key or you need to use like this ooth stuff with the CLI proxy like there's a lot of like um bad user experience at every single level um you know as somebody designing like a product like this, how are you thinking about uh this harness problem?
>> So, I actually don't build a harness.
What I built instead is an MCP server that launches harnesses. So, like if you if you have like codec set up, well, then I will use the app server and I will connect the app to via MCP and launch it in like a specific way so that it's like routed in the right way. So like basically I'm creating a harness out of an MCP server and that way you can use the official harnesses of whatever tool you want. Um and and with the MCP tools it's like a consistent experience for the models and you can make consistent prompting um and and create like that whole experience in that way which which is like a little weird and I think I pushed it a little further than most people have gone. Um, but uh I I really think like this is like a way to kind of circumvent the issue entirely and you still get the like benefit of the labs optimizing their harness for their model without having to like forgo custom tooling that can do good work for you.
I uh I like what so I you know this is not a sponsored like thing or anything.
I just really like Droid cuz I like the way that they've dealt with compaction.
And um I've actually so secret is that like you just talk to the model and like point it at anything on your computer and it like doesn't matter if it's compiled or it it'll it'll figure out what's actually inside of it.
So I tend to look inside all of the tools that I use every day. I don't share this like I don't share the actual results because uh I'm trying to respect people's privacy but like there are a lot of different strategies that people have used like sliding context windows like three points. So, you know, let's say you have like a context window. You have like middle, uh, end and start, and then you, you know, you have summaries for each and like there's a bunch of like little tricks that people do, um, >> that are smart. Like, how do you find this kind of information? Are you kind of riffing off the top of your head like whatever you feel like you need at the time or are there references you're looking at?
>> No. Well, so like there's a lot of tricks you can do to kind of keep a context window clean. And I do some tricks around that for like the Oracle that I have set up where it's basically like I nuke the context window. I want just what I want in the Oracle's context window for a given prompt. Even if there's a long thread, I like don't care about the cache for that so much because the oracle is a specific thing where I want the best possible results. But for agentic tool calling, you really need the cache. You need the model to hit every time. I'm not so worried about like the context window for that. Like I there's some gains to be had and Droid is definitely you doing really great work there. But in my opinion, if you're if you're being smart about delegation, then you don't need to worry about compaction very often. Like I can have like sessions that run for like four or five hours and the model hasn't compacted once because it's delegating all the work to sub agents. So because of that like you don't need to worry about it in the same way. And I think >> yeah, >> I'm sorry for interrupting you because like I had this question earlier. So I I used to try to do this because it's super effective where I would have like a like a main thread and within the main thread I'll either ask it to create like a like a sub agent for everything that it's trying to do. So whether the task is research or apply or whatever change like a button you like I try to force it to always make sub agents and that kind of works for a bit but they always tend to like devolve. So what I'm trying to think of and I think this is a problem that you've you've solved in your own way is how do I keep like my entire workday in a single session? if I could get my entire workday like the the the high level in a single session and then I can fit like that can index all of the work that I'm doing elsewhere.
>> Um >> but the models are just like unreliable as like a thing to you need code around it. I think >> I think no like I mean you it's a couple things like so for one like the one thing that I found initially working on this was that the tools for managing sub agents were generally pretty in most cases and like most harnesses just don't have good tools for this and I was like okay I need to solve this problem and like one of the key things I wanted to do was like okay well when you use like the codeex app for instance like what can you do you can like look at the threads so I can like list them I can make a start a thread I can pick the model I can steer the agent as it's working. I can see what it just was doing. So like, you know, the GP models often now they have little blurbs.
They're like, I'm looking at this file now or whatever. Well, if you take just those messages, you strip away the tool calls, you can get really efficient like summaries of like what the agent's doing. So then once you have that um and the ability to like wait on multiple agents and like be interrupted all this all these little details um the agents that are managing these sub agents they they tend to just use it more reliably because they can monitor what's happening they can steer it when there's extra requirements they don't need to cancel them as much um so like just having good observability into what the the sub agent is doing with efficient context usage like that just changes a lot and I think like that's like the area I would say needs the most attention because that's the thing that's causing the most issues. But even with that said, like you still need like really good workflow prompts and too much steering and too much deviation will cause the model to like you know move away from this workflow. But like I find it works quite reliably with this combination. Um and and the models, but you need like the newest models really to do it well. Like 55 is really good at it. Uh Opus 47 it I find it still struggles a little bit. I I needed to like really tweak my workflow to be really clear for it, but like it's a model that doesn't adhere as well um as as GPTs.
>> Oh, sorry, you're muted there.
>> I'm working like a caveman right now.
So, I have like my most common workflow is I'll have uh a single thread that's open on one side and then I'll open a second pane and then in the second like basically the first pane is giving me tickets. It's just writing me what to do and I copy that, paste it into the second one. Um, and then I will like start a new thread usually every time. I like to try to keep the threads shorter than like 100k like ideally. Um, doesn't always work. I get lazy as well.
Sometimes you just get lazy. You just want to get like you just want to move.
>> Yeah. It never works out well.
>> One of the problems of the current day.
So, we talked a little bit earlier uh you know like different tool call types. you some do like replacements, some do like uh additions and um what are the what is it called like where they swap out just like certain lines.
Um >> you know there were a bunch of issues at the like the the model will just tell you what to do instead of do the thing.
What are the current day problems that like the limits the rough edges?
>> Yeah. Um well I think so it's always context right that's always like the main issue is like what is the model like seeing what information does it need and like how reliable is it at following instructions and also like how reliable is it at like understanding your intent and like you know I I work with these models all day so I'm like pretty deliberate with how I prompt them but like I have users that aren't in that same boat and I run into issues where like I have workflows that are really trying to curate around like how I work, but also trying to be considerate of them and um you have to like be able to like pull out what the like user is saying and kind of interpret it and try and fit it into some way in like structured way. Um I I think like you know the the models are getting really good at this. Um, I think it's like the biggest thing is around like just tool efficiency. Um, and like just giving the right rails to people, like writing the right skills that keep them on the rails with the right tools to kind of accomplish something. Um, cuz I I think like that's something that's missing for most of these tools is like a button that says, "Okay, now you're going to orchestrate, do this work." Um, and and don't have to think about it too hard. Um, cuz cuz most tools are just too kind of bare metal, I think.
>> Mhm. Yeah. everything is like you you kind of have to talk to the model and and get it to do the thing instead of >> um and the interfaces as well like our interfaces are not good. So like the 2E >> right >> it works I guess it's it's super flexible but >> uh I just don't know how people are going to be able to scale that. The desktop apps are really nice. Um, but then they, you know, they have their own issues like performance-wise they're not so they I don't think any of our software was built for like the sheer no like volume of these things, right? Like they're writing megabytes every second if not more.
>> It's so much work. I I spent like a couple months just in my app like really optimizing the long thread experience because even in a native app like I think especially in a native app you have to just be so deliberate with like every view invalidation everything. It's just like a huge problem uh dealing with uh these long threads. Um but yeah, >> what is what is the future look like for you? So, you know, yeah, what are you seeing is the current trend that's like going on and how what is going to trickle down to like people later cuz you're definitely at the frontier here.
>> Yeah. Yeah. I I mean so I think the thing that I'm seeing um is that like you know there's different attempts at making a kind of Jira board or like a canban board like just like you you have like tasks that you're setting up and um I I could see myself talking to like my manager agent that that like is your persistent thread that knows like the full context as much as possible that like probably compacts a lot but it's not the one doing the work. So you talk to this guy um and you're like hey like let's set up some tasks. you you have some tasks and it will be the one or maybe it'll delegate that where like it keeps things on track and then that is like a second layer where we do this delegation that I'm kind of currently doing. Um cuz a lot of what I'm doing now is like managing blockers. But I think the thing to get there that is a bit that is really missing is like you like you really need your your project to be set up in a way that these agents can verify and test well. And so there's like a gap from here to there where like people are like changing how they think about working with agents in a way that like most of the software is kind of testable um kind of without the person there. Um and like a part of that is computer use clicking on buttons but a part of that is also like really deeply embedding like a lot of diagnostics data through your app that the agent can query like headlessly. Um, and it it's it's not it's not simple like like a lot of it is like a whole engineering challenge and you need to think about these problems and the agents are going to do it themselves. Uh, you really need to kind of push them to it. Um, but I think once you get there you can get like really good outcomes. But yeah, I I like a lot of things are going to kind of converge at the same time. Like there's the management layer, there's the like codebase standards and then there's like the models just getting smarter. Um, like all of it kind of at the same time.
>> The Yeah. Go go go go go again on uh Droid. They have this like readiness report feature where you just like hit it and it tells you like here are all the things that you could do to make this more agent friendly. I have found it really useful. I mean I could just like say what the like using NIP to find duplicate like they love just slapping more code onto things instead of modifying what exists. That's a huge problem I think that they still haven't figured out how to fix. Every time I open whenever I go too hard on just like not reading the code and I just let it do its thing, I come back and I see like four layers of the same feature, you know, or like four different abstractions. But you know that that specific problem that's like exactly why I do this work with context builder in Oracle because it like you give the right context to the model, the model sees at a bird's eye view the codebase in the parts of it that it needs. It sees these duplications of features right away when it's planning how to solve it. And when you when you output the plan, it it's it's kind of already addressed this. So you don't even have to deal with that problem. And then I have like another like automated in my pipeline. There's like so you have the planning and then the model will run and then it'll call Oracle on the files that's touched and it'll have that like file re code review done all automatically. And so then by the time the the the the orchestrator gets back with work, it it can just spot check, oh yeah, this adhered to the plan and move on. Um and I didn't have to go ahead and and do that manual review. Like I still do some manual review. Um, but like the amount I have to do now with this pipeline, like it's very it's like a lot less than it used to be. It's kind of crazy.
>> Yeah, I see. Like, okay. In in the Codeex app when you when you when you're using it, it'll start taking screenshots and like actually going through the app.
So, I set up um >> I set up like these like testing uh frameworks, front front end testing frameworks, whatever one the model decided on at the time. I think these days I'm using a lot of agent browser, but they like there's a whole bunch of them and uh it'll just like take screenshots of everything that it's doing and clicking through. Um I I would have these uh rules like basically don't use use effect.
>> Yeah, like that's a big one, right? You can just like set up these uh warnings and errors for it to tell it to to fix itself.
>> Yeah, but it's just so much work, right?
I think a lot of the issue I'm currently experiencing is like AI is great, but a lot of AI is AI building AI. Like it just seems like everybody's having this manic building AI like m like the last 3 months at least. And uh >> there's a lot of optimizing your workflow which is very fun but also takes away from the time that's supposedly saved.
>> Yeah. Yeah. Yeah. Well, I mean, that's why like I try to provide tools for other people to to automate that work.
But, um, but like yeah, if you're if you're not spending your whole day optimizing your workflow, like it's very hard to really appreciate what the frontier can offer. And if you're trying to do that, then you're not really getting anything done optimizing your workflow. So, you know, like yeah, like I I think this is a temporary issue though, but um >> I think so, too.
>> Yeah, it's you know, the things have changed so fast. Like, you know, we were just talking like what the state-of-the-art was, you know, a couple years ago is copy pasting and now we're have like full features automation like you know, the agents taking screenshots like that just wasn't the case even 6 months ago really. Um, so in a year things will be fully different again.
It's it's kind of nuts.
>> Well, for my personal life, like even outside of work, uh, I'm traveling a lot, so it's really hard keeping track of all the meetings when I'm three different time zones. just like I'm jumping around and uh I can just have a conversation with with with the model and it can go into my email and my calendar and just like parse all of it and it's gotten really accurate. This was not the case like you said at least 6 months ago. I I remember I set up MCPS in the cloud desktop app. This was about a year ago and it would go and do its thing and get the data but it would just be so like illogical a lot of illogical behavior >> and now it's really consistent even the openw weight models which again uh you know this is something I go back to I really want good openweight stuff so we have like open source you know it it doesn't make sense just for a few companies to own the models but uh they've gotten so consistent and uh I'm very excited like I think the future is bright and happy and great you know, I really think so. What are you working on outside of repo repo prompt right now or side projects, anything like that?
>> Yeah, I mean like I I dabble on a few things, but you know, this is like my full-time thing. Um, so it's I don't have a lot of time and being a dad, like I, you know, I try to, you know, optimize for for family time as well.
Um, but yeah, I'm always like working on different things and um yeah, I don't have much to share at the moment though for that. Yeah. How how is it? These might be like personal questions. We don't have to go into it. Um, but h how is it for me since I started getting into this like a lot? It's been really hard for me to actually sit with my family for like more than 2 hours. Like I get like restless. I want to go and do more stuff, you know, like um it just feels like the world is full of opportunity for the first time and I just want to go and sit on my computer all the time. H how how has this change been for you and you know having a new kid and running a business and all this?
>> I mean it's tough like you know you try to balance all of it but like there's just an endless stream of things to think about to work on to to rework.
Like when you're in a place where you're building a tool that needs to be completely reinvented every few months like it's very hard to like sit down and and just sit sit on what you have like you have to constantly be thinking about what's next. And I think for that like it's exhausting. And I I think a lot of people are feeling that too. like even without building tools like they're just you know trying to stay up to date with what is happening and there's a new model every month and uh oh this new tool just came out and oh you're you're still using that tool oh you you're falling behind you got to use this one um so it it like I I'm kind of excited for that phase to kind of slow down a little bit but it doesn't seem like that will be a while >> so I think it is like if you look at the the release cycles of openweight models which I think is a great reference is like how often they're putting out um it was two months like Zai would publish something every two months gone to three months and other companies it's four months deepseeek is a year like they've released these incremental changes but for like these major steps it's it's a year so I think we're like I think people are finding what the right balance is of like this is exciting enough that it's worth adding on like another cuz I think even like 55 we we are probably experiencing at least like five different checkpoints of the same model right like we're not we don't have the same model every day. They're they're changing it. They're doing these incremental checkpoints.
>> No, I don't think so. If if if you five5 is the 55 model that you're using, I don't think they're changing the checkpoint.
>> I have to Okay. I don't know.
>> I for sure don't know. But what I do know is that like for example with Google, Google will have multiple checkpoints of the same model version.
So how >> they but they call them different things if they do they don't say it's the same model.
>> The only one who does that is like rock.
Grock is the only one that they will change the checkpoint under under the name.
>> You don't think Cursor does that as well? Cuz >> Cursor Yeah. Yeah. Cursor with Composer, they do RL. They're constantly changing the checkpoint. But if you think about like Anthropic and OpenAI and like you know the Gemini as well, like they're they're releasing a checkpoint and that's the checkpoint and they serve it and they don't they don't change it.
>> Interesting. Interesting.
I I would I would love to see I would love to see that. But I I think they're slowing down. This is this is my sense is that like we've settled into some form of meta for a while.
>> I don't I don't really agree. I don't think that's the case unfortunately. I I think like you know like you know GPD has been releasing models every month since the beginning of the year. Like I don't I don't see them slowing down.
Like I get the open weight stuff is maybe changing a little bit but like you know if you look at OpenAI and and and anthropic they're both accelerating I think. And yeah I don't think that's changing.
And have you used much of the openw weight stuff? Is it is it something that falls into your workflow?
>> Yeah, I mean I try to make sure my stuff works with it. Um I my experience though has been like a lot of it is just not nearly the same level for tool calling reliability and like instruction following. Like I try to use like the the latest models and like there's a small test I do. So so in my in my app like one of the first things the model has to do um when it starts a thread is set the name of the chat. Um, and you wouldn't be surprised like how many openweight models just don't do that.
They just forget to set the name of the chat. But all the like big top models do it reliably. And so it's like, you know, just like a small instruction following.
Did you see that part of the instruction? Did you adhere to it? Did you call the tool? It's not a tool that you were trained on. Um, it's like a simple test. Um, and and they a lot of them fail. I've seen a lot of openway models will I I give them like my context builder prompt is like quite large. It's very complex. Um, like most of the models just completely flub it.
they they just can't they can't like follow the workflow. They'll just output a bunch of stuff in the text response instead of calling tools like they're instructed to. Like I I think there's still a gap there. Like even the models like feel good and and they can do certain things reliably, but like if you give them enough of complexity in a workflow like the gap is quite large.
>> Is there anything out of there that stands out? Like are there any like anyone that you feel is uh better than the rest?
Uh, I have like a strong affinity to the Moonshot folks with their Kimmy models.
I think they're they're like I mean they're also like the biggest open weight models that are around. So like it kind of makes sense. Um, the GLM folks are doing a good job as well, though I do find that like they're very sensitive to the harness. Um, I think the latest GLM 5 models are are getting better. Like GLM5 was like a big model, but then I feel like their subsequent turbo and 51s were a little smaller and I could feel the difference there as well. Um, but yeah, like I think the Kim models are are standing out for me. Um, and and hopefully that they're doing great. And also, I mean, I haven't spent much time with it, but the Deep Seek 4 Flash is very promising and I like that there's a community building around like just inferencing that one specific model because I think it stands out. And um, I have my own bench and and on the bench flash is like Pareto Frontier uh, even besting like some of the Gemini Flash models uh, like not the latest one which is like quite large and expensive, but the older ones. Um, so I think there's a lot of promise there uh from Deep Seek on the smaller end.
>> Yeah, I've tried I mean I've tried every single model out there. I think I've tried like at least a hundred of these openweight models and Deep Seek is the the Flash one is the best. Um, it's super logical like it it >> it has behaviors that are really like uh desirable. So it it it'll double check its work like two to three times. So I'll see that it like goes and actually reads the file and runs the test and then checks something else out and like it it it secondguesses itself in a way that isn't like thinking because a lot of the models they'll like overthink but they won't actually do anything. This one will go and actually check >> and um >> it uh it just yeah it just tends to be really like strong. I don't know. Uh, I I've been very happy with this cuz it's the first one that is relatively like it's unaffordable for most people, but it could fit in someone's house without like a mega server.
>> Yeah. Yeah. If you have like 128 gigs of RAM, like you can get a quantized version of it running.
>> Um, yeah, we're almost there. I I think like we're getting we're getting to the point like I think the biggest thing that has been holding these models back is like and you can really feel a difference for the last like year or two is like the models that like the smaller they are the worse they are at large context and I think DeepS has some special sauce around being able to like stay coherent with larger contexts despite smaller models. Um, but most of the most of the companies making making these models just haven't done that well. And even the big labs making smaller models, you can really see they struggle a lot more. Like Opus versus Sonnet on large context is a huge difference. And the Sonnet versus Haiku is a huge difference. Um, even the mini models of GPT though the GPT mini models are are quite strong contenders for what they are. Um, but you know, you you you feel the difference and and I think once we get over that like baseline of quality at a certain context level, that's when you'll start to see much better outcomes with them.
So obviously you have repo prompt and it's it's doing a lot of the things that you think are the right way to approach things for people that are trying to like adopt like these better ways of doing things like what what advice would you have cuz uh I I just want to say this first. Um I speak to a lot of people and like the most common thing is like what tools are you using? What skills are you using? like you know people are just trying to minimax this thing and like what are the fundamentals? I think the biggest thing is like think about like try and understand the tools you're using like whatever harness you've got like whatever harness you're using like understand what its tools are and what tools you're connecting to it like what plugins you have uh what what MCPS you're adding what CLI you're exposing like try and do an audit of like everything you're you're exposing to the model try and like reduce that list to the bare minimum if you can and then also try and get good at writing skills that properly utilize your different tools and prime the model to use them in a way that makes sense in a repeatable format. Like if you can get those things together, like you're you're going to get much better time. And try not to use skills that are just off the shelf if you can. Like I mean I make skills for people and I think people should use them. But like try and make a skill, but then if you make a skill, don't be happy with the first draft. Like iterate on it. Like try and watch the traces, like watch how the model's working with it.
Where does it go wrong? What can you do differently next time? and then like improve your skill and like make some changes and and reduce complexity. Oh, and there's one big thing that you have to always think about. Um, is like where is there conflicting information in your prompt or like things that that that that counter something else. So, if you have like a system prompt that says do X and then you have a skill that says do Y um but then Y is actually in contradiction to X, like you have to be mindful of that. Or if you have 10 skills loaded and they kind of do different things in the same way, that's a problem, too. Um, and you so try and reduce how many skills you have loaded.
Keep the bare minimum installed. Uh, just do an audit. Clean clean your house. That's that's what I'd say.
>> Clean your house is a good good term.
Uh, I really like the idea of models as some form of like state machine where you have, you know, you have like a a predictable input and output and you string a bunch of these together to actually make a good thing.
>> But it's super it's super hard to build that. I know that >> of course >> maybe conceptually what what are some challenges of actually building this AI?
>> Uh yeah, I mean like it comes with time like you don't want to do it all at once. You want to do it kind of in pieces. So I've been like building this the the Rub Goldber machine for a while and uh it starts with certain pieces and I understand how that piece fits in and I really want to understand that primitive and make sure that it works reliably and once I'm happy with it then I can build on the next piece and the next piece. Um, so like don't do it all at once and and try and like I said like just be willing to scrap as well. Be willing to start over. Um, quiz the models as you're using it. Like ask them like use them as like user tests. Like where what are you finding unintuitive with what I told you to do? Um, like what can we do better uh next time and like bring that feedback in, improve and and go cuz the thing that's interesting with models is that everyone's using the same model at the same time, right? So if a model if you use opus and opus finds this confusing well chances are like every person using opus with this tool is going to find that confusing as well. So like trusting it word that is confusing is is helpful. But you have to like ask it the right way and maybe ask a different session ask a different model too and and incorporate different models ideas into the feedback as well.
Yeah.
>> I wonder when we're going to have this cuz it's definitely a skill that exists, but it's not formal as of now as like a a way of communicating with these things efficiently cuz they just tend to like >> if you if you ask them a lot of the times they want to make you happy or what serve you or make you like be an assistant. So it's it's like you cannot just take any face value response. Um also they're a computer and like like there's a lot there's nothing in there.
Um but yeah that I mean I don't know I I hope we can get a better view of like how to communicate these with these things efficiently. Uh is there anything that you're excited about over like uh that you want to share uh before we wrap up for today?
>> Um yeah I mean like uh I'm I'm thinking about some open source projects. So like just stay in tuned like what I'm for for what I what I do in open source like I think there's a lot to be done in open source. Uh a lot of a lot of ideas to share. Um, I think like it's an exciting time to kind of be trying new things and experimenting and um and and just like leveraging, you know, community and building things, you know, as a collective instead of just, you know, by yourself. Um, so yeah, a lot of wisdom in in a lot of people using the same thing. I'm going to add all of the links uh to Reprompt, to the podcast, uh to your Twitter in the description. It was really nice talking to you. I'm like, yeah, I'm traveling so it's it's been really hard to get anytime. I'm very grateful for you taking the time to do this. It's uh >> Well, thanks for having me on.
>> Yeah, I'm actually going to set up repo prompt again today. I tried it six months ago, give or take.
>> It's a fully different experience. Yeah.
>> Yeah. I I can imagine. I can imagine.
>> Yeah. Yeah. Yeah.
>> Thank you for your time, man.
>> Yeah. Cheers.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 views•2026-05-30
AI Doesn't Create Bias — It Inherits It
UXEvolved
176 views•2026-06-01











