Install our extension to search inside any video instantly.

From 46% to 90%: Fine-Tuning Tiny LLMs for On-Device Agents — Cormac Brick, Google
Added: 2026-05-21

4,644 views19721:00aiDotEngineerOriginal Release: 2026-05-20

This talk demonstrates that data-centric fine-tuning is the "silver bullet" for making sub-1B models actually useful for real-world edge tasks. It’s a compelling case for prioritizing specialized efficiency over the brute-force scaling of massive LLMs.

[00:00:15]Yeah, so while we wait for it to come up cuz I know we're short of time, uh I'm going to talk about um agents on device. So I know whoever asked the question about skills and AI core, we have an answer to that. We've built a simple skill harness in top of AI core that you can skills on. We'll be able to show that. Also going to talk about tiny LLMs, uh which are we would call LLMs that are like smaller than a billion parameters. They're small enough to build into your app if you want to have more customization or you want to do something that isn't already available for you in AI core. So that's the gist. So uh quick overview of AI edge, well, how we think about uh small language models, uh tiny LLMs, and system gen AI.

[00:00:57]Then we're going to take a quick look at agent skills, which is something we can build on top of of kind of system gen AI or the new models that are coming down the pipe. Um and then we're going to take a quick look at tiny models.

[00:01:10]So that's that. Okay, so uh oh Yeah.

[00:01:15]Cool.

[00:01:15]Yeah, I'll Okay.

[00:01:18]Yeah, feel free.

[00:01:20]Um okay, so AI edge SLMs and TLMs. Um okay, so I think Ali already covered this. We know it's great to do things on device, latency, privacy, offline use, reliability, or savings depending on thing These are all motivations to do things locally.

[00:01:37]Um me by way of intro, I didn't really do this. Um I kind of I'm a software engineering kind of tech lead working on the Google AI edge stack. So that's um we have MediaPipe, which is an asset some people may be familiar with. We have lighter TLM, which is a LM harness that you can integrate with your app. Um where you download the model and ship the model with your app. And then we also have kind of lighter T as a runtime that supports both lighter T L M and MediaPipe. Uh, it's kind of formally known as TensorFlow Lite, which is a kind of cross-framework runtime for running um, models. And all of that can run on CPU, GPU, or NPU, uh, depending on the platform and depending what's best. And you as a developer get to choose.

[00:02:18]Uh, yeah, it's already trusted at scale.

[00:02:21]Um, like the lighter T runtime, there's a version of that built into Android OS.

[00:02:25]Uh, lots of Android apps already use it.

[00:02:27]So, so it does uh, support over 2.7 billion devices, like lots and lots of daily invocations, and lots and lots of Android apps uh, leverage this.

[00:02:37]Uh, but also works far beyond Android as well. So, we support all of these platforms. Um, and uh, for example, Gemma is available on many of these platforms. Our team like uh, is giving another talk tomorrow, so you can hear more about Gemma performance on all of these types of platforms and are able to do uh, really useful things with the latest Gemma 4 models.

[00:02:58]Um, but then building on Ollie's and Florina's talk, this is kind of key idea is that we have um, system-level GenAI, which is something that will be pre-installed in the system. So, there's Gemini Nano via AI Core. This is an example of the summarization API. Um, Apple also has something going on with their intelligence on iOS that I probably know a lot less about. Um, but as a concept, right? Um, as an app developer, when you go to build a mobile app, this is kind of one choice is there will often be an some form of intelligence built into the system that you can leverage, um, which is uh, you know, highly optimized as kind of Ollie and Florina covered, um, that's available for use with your app.

[00:03:40]Um, then, so this is kind of typically like small language models, like for for Nano, it is the Gemma 4 E2B and E4B are the base models for for what we ship there.

[00:03:52]That's really capable, highly optimized, preloaded with device. If you can use those, it's great. Your app doesn't get any bigger. And if it meets your use case needs, it's a great place to start.

[00:04:01]If you want like more uh if you have a more specific task um that you want to do that's kind of highly customized or something really boutique, um you can use an App Gen AP Gen AI. So, that's with the lighter TLN runtime.

[00:04:15]Uh that can be loaded with your app or even your web page, right? Um and this offers kind of a higher degree of customization and reach. Like definitely more work. Um but uh yeah, you kind of access to uh smaller models that can run on lots of devices. Um and full customization.

[00:04:32]Though it's clearly a lot more work, but it's the other option that's available.

[00:04:36]Okay.

[00:04:37]So, uh rest of the talk uh 15 minutes going to cover two key ideas. One is um hey, how do you do skills on device cuz this is something new that we can do with Gemma 4 that came out last week. We've a few examples of that. This is one key idea. The other idea I want to cover is hey, for tiny models, what can you actually do with those types of models today because we've actually made a lot of progress in this in the last 6 to 12 months. So, I kind of just want to share what's the state of the art with tiny LLMs. And if you want to use one in your app, how would you go about that?

[00:05:06]Okay.

[00:05:08]So, this is wow, there's a lot on the screen. This is um uh an app that our team have developed that works on both iOS and Android for running LLMs locally.

[00:05:19]Um and here we show both really tiny LLMs so you can see what they can do.

[00:05:24]But also, cuz Gemma 4 just came out, we're also using this to showcase what how Gemma 4 can work um on Android and iOS as well. And this actually builds on AI Core. When AI Core is available on the device, it will use AI Core to kind of provide the Gemma model for the app.

[00:05:41]So, skills is the thing I want to kind of go into deeply today. But there's a bunch of other things in the app like you can do AI chat, you can ask image, you can do audio scribe, and there's lots of example models, and the app also supports 3P models, like kind of quan or phi or these types of models. If you just want to load a model, get a feel for how it performs on device.

[00:06:02]And this app is also open source on Android, and it's built using lighter GLM. So, it's both a neat way for you to try things out, but also if you're keen, you can kind of dive into the code and see, "Hey, how does it all kind of hang together?" And as an example for lighter GLM.

[00:06:17]All right, but we're going to dive into skills, cuz this is kind of a topic du jour.

[00:06:24]Okay, I'm not going to play this video cuz I don't have enough time. But yeah, this app is available Android, iOS, code available on GitHub as well.

[00:06:34]Um, let me just take a picture.

[00:06:39]Okay.

[00:06:39]Um, Okay, and the app is called Google AI Edge Gallery.

[00:06:44]So, this is the video you will watch cuz it's shorter and meets my time budget.

[00:06:49]And we don't Could we get sounds?

[00:06:53]Sorry, I'll go Please reply to me in English.

[00:07:00]So, this uses a restaurant related skill.

[00:07:03]And we'll see how that's built in a moment.

[00:07:06]Let's have a look at the restaurants, select one.

[00:07:08]Winner, right? So, that's an example of something neat that you can build like with a simple agent harness on top of Gemma 4 that's like really just a few line like pretty easy to do with a few lines of code or a few lines of the right vibe coding prompt as you'll see in a minute.

[00:07:25]Okay.

[00:07:28]Okay, I don't know how that Ah, yeah.

[00:07:31]Okay.

[00:07:33]Uh, okay, I kind of got lost a little bit.

[00:07:37]Okay.

[00:07:38]All right, sorry. Back, back, back to where we're was to be. So, what's actually happening under the hood? So, like I was saying, this is built on like um this is built just using a prompt, right? And here you can provide We have our own system prompts in our app. Uh then we also put the skill descriptions into the prompt. So, the um so the model is aware of the types of skills that it can use, but it doesn't have to see all of the functions and details of the skill. That's only kind of loaded on demand.

[00:08:06]Um we actually have a a load uh a load skill tool call built into the model that then like selectively So, if you say, "Hey, can you show us the location of the Google office?" It will then know, "Wow, I should use the maps skill." It then loads the skill for map navigation.

[00:08:22]Um the tool responds, and then it uses the show JS tool um to show you the location um in the app as well. So, one of the things that's neat about being in an app is you can put simple JavaScript into the skill that we then call as part of the skill. Uh so, this is how like I don't have the corresponding demo for this, but this would kind of pop up a nice um uh kind of like JavaScript UI of kind of Google Maps to kind of just show you uh in the app right there. Uh similar to the restaurant roulette, that was a custom JavaScript uh to do the rendering to do the uh roulette wheel piece.

[00:08:59]Okay, so you can create your own skill as well. Um the app supports this. Uh I'm sorry. Oh, yeah. Instructions on GitHub. I don't know if I should ask this page too fast.

[00:09:09]Uh but also um I just create your own skill. There's full instructions there if you want to kind of handwrite it out.

[00:09:14]Uh this works really well, though. Uh so, you can use skills to write skills.

[00:09:18]So, we have um Gemini CLI or code coach like our team have done like about 80 skills and had a lot of fun with this.

[00:09:24]Um so, this is an example of a prompt which works really reliably.

[00:09:30]Um in Gemini CLI, we actually have an ADB skill as well um that that we our team uses a lot. So, you can even debug and test by saying, "Hey, you have access to a device um via ADB and you can um also ask to test uh that." So, this type of thing actually works really really well and it's fun and you can then create a scale and then in the app there's a dot dot dot button and you can go to load your own skill from a URL if you kind of publish it your custom to your own GitHub. And it's kind of really easy to do from within the app.

[00:10:00]Uh you can then also uh let us know in our discussion on GitHub um that you've created a skill and then other people can check out your skill and kind of use that as well. So, these are some things This has only been out like since last Thursday, but these are some example skills that the community have built.

[00:10:15]So, feel free to do it and tag it up here.

[00:10:18]Okay, so that's skills. So, 10 The last 10 minutes we are going to spend on TLMs um or probably more ideally maybe five or six of the minutes, so there's time for questions.

[00:10:28]Okay, so lighter TLM, this is the runtime that we have that um we use for running models. It runs models in lighter or TLM format, which is a single file that packages everything we need to know about the model in order to be able to run it. Uh it's open source, it's fast and it works on multiple platforms.

[00:10:47]Uh and there is a Swift API and a JavaScript API coming soon. At the moment, if you go to the GitHub, you can see the C++ and Java version. And when we publish the Swift version, we will also publish um we'll also open source the iOS app at that point in time. So, if you go to gallery for at the moment, you can only see the code for Android, but um hopefully in the next few weeks um where uh we can get the Swift work finished, have a really good API and then we'll be able to um open source that as well.

[00:11:18]So, yep, and it supports Java 4 as well on all of these devices. Uh it also supports loads of other models, but understandably Java 4 is our favorite.

[00:11:27]Um So, then to deploy uh tiny model, what do you do? So, typically starting transformers, you then have a package called lighter or T torch that can help you export the model and then lighter or TLM, there's actually a reference version of that that you can use on your desktop as well if you want to try out a model.

[00:11:45]You can either try it out on a desktop or you can load it into the gallery and see it perform there. And then you can deploy with lighter or TLM.

[00:11:53]It's worth noting for smaller models, you would either pick a fixed function model like visual language model or a transcription model or something like this. So, there are some pre-built models available on our on our on our transformers page that you can use. But, something we also see that's really common is people fine-tuning models because certainly once you get down to like 200 or 100 million parameters, for that model to work, it needs to have a very narrow and focused task and um we've had a lot of success deploying those models internally and in an app a different app that you're going to see in a minute.

[00:12:29]By doing kind of fine-tuning using synthetic data.

[00:12:32]So, this is what the export and inference flow looks like. So, this is um This is showing okay, on the left-hand side it's showing us exporting a quant 0.6 model and then running that on desktop with lighter or TLM run so you can just see how that behaves using a GPU for example. Uh the right-hand side is showing a different example which is Apple's fast VLM.

[00:12:54]This is a visual language model that's only 500 million parameters and this is optimized and running on like a um This is running on the Qualcomm NPU that's also available through our stack NPU optimization. So, this is an example of that happening end-to-end. And this is running really quick cuz it's using hardware acceleration. And this is model is just that particular model is 500 million parameters by way of example.

[00:13:19]Another example that we've spent a bunch of time with the DeepMind team on was publishing function journal, something we published last uh, December. This was, uh, based on Gemma 3 technology.

[00:13:31]Uh, this is only 270 million parameters, um, but it's robust function calling when fine tuned, typo there. Um, and this is then it's small and it's really fast even on legacy devices. So, if you go all the way back to a Pixel 7, they still, uh, can process almost 2,000 tokens per second prefill and 140 decode. So, it's really useful for lots of, um, uh, it's really useful for lots of simple use cases, like you can do text to function calling or voice to function calling using this size model.

[00:14:00]And there is a whole YouTube video on this called Function Gemma, uh, if you want to find out lots more details about how to do it. We also, uh, Frank, um, uh, have a function Gemma fine-tuning lab. So, if you search Function Gemma fine-tuning lab, I don't have, uh, here, um, this is available as a Hugging Face space. So, you can kind of import, you can define functions, upload your own data and, um, see how they kind of fine-tune Function Gemma. And this is kind of recommended for really high, um, for really robust function calling. So, we have an example in the app called, um, like app intents where it'll do like this the thing you saw previously of like add calendar or add email.

[00:14:44]So, when we took Function Gemma out of the box, our success rate in that was I think 40 6% or something like that.

[00:14:52]Then we put it through this fine-tuning flow where we're like, "Hey, we have these app functions." Um, and instead of providing that via a system prompt, which is what you would do if you're using a larger model or if you're on a device with AI core, for example, um, but um, you instead need to kind of synthetically create a dataset, right?

[00:15:10]It's it's typically the workflow. We use flash to synthetically create a dataset, um, upload it to this type of tool or we obviously have our own internal tools, but that then got that 46% um, Um, to over like it was over 90% for eight of the 10 functions we were trying and two of the functions were a bit lower in their kind of ACCs.

[00:15:28]Um, so you can get really robust and reliable function calling using this fine tuning workflow. Yes, it's it's a bit more work than just prompting a larger model, um, but it does allow you to kind of ship something robust in your app at scale.

[00:15:42]Oh, sorry, going the wrong direction.

[00:15:45]Oh, yeah. So, then pre-built uh, tiny models are here.

[00:15:50]Um, yeah, okay. I'm going to pull stop for questions. Okay. So, we have and I I don't want to go into too much detail.

[00:15:58]We have another app I'll I'll just speed run this for 1 minute. We also have another app called Eloquent, which is a transcription service. Um, but what's more interesting than the app was just an example of like how we built that.

[00:16:11]So, it also supports things like personalization. So, it does transcription with with your own favorite keywords. So, if you use a lot of like tech jargon or a lot of people's names, transcription services don't always get that correct. Um, sadly this is only available on iOS and not available in Europe yet, right? So, this will be increasingly available uh, soon.

[00:16:29]But, the more interesting thing for the purpose of this conversation is under the hood, this is something we've built using tiny LLMs ourselves. So, this uses a ASR engine that we have built based on Gemma 3 technology and then it also has like something we call like a text polishing engine that we've also built uh, with Gemma 3 technology. And both each of these models are only a few hundred million parameters, uh, but chained together they can create a really compelling offline like offline transcription service that is able to leverage a personal dictionary, right?

[00:17:00]And also like the polishing also removes ums and ahs and that sort of stuff, right? which is also a common gripe with kind of um, kind of offline transcription apps. But, yeah, so not really available widely yet. It'll be available in but for the purpose of this conversation, it's more just like a proof of life example. So, like this does work in production. Once you put in the effort to kind of fine-tune a model and you can create pretty compelling things.

[00:17:26]Okay, so that's not available in iOS in Europe so that's out of the question.

[00:17:29]So, yeah. Takeaways, system gen AI and app gen AI, that's the kind of the overall yeah, overall kind of message and wrap up. Happy to take questions. I have 3 whole minutes.

[00:17:39]Uh I think person there was first.

[00:17:52]Um yeah, we are still putting models on the clock there.

[00:17:55]Uh so, we've literally been playing with a model for about 2 to 3 weeks now. It's been in public for about 1 week. We we see like within a single conver So, we can provide like certainly for the 4 billion parameter model, like if you by default we enable about eight skills and it's able to choose between the eight skills reasonably well, right? Um within a conversation you're able to say, "Hey, like you know, like um like find me out a fact on using a Wikipedia scale. Then, oh wow, show me where that is on Google Maps." So, if you have a conversation that uses skill, skill, skill, that works really robustly. The thing we're still working on that's harder is through a single like um interaction with the app for the app to know to call multiple skills as part of a single answer. That's um that works sometimes, right? And we're still like that's something we're still figuring out and how to make that more robust, right? Um but yeah, like it's all in a just our like our engine harness thing is really simple. So, I'm sure we'll figure that out, but we're still kind of discovering the limits of how far we can push the model.

[00:18:58]Uh yeah.

[00:19:10]Yeah.

[00:19:12]Yeah.

[00:19:14]Yeah.

[00:19:26]Yeah. Light or TLM file format for just for stock LLM is effectively a replacement for dot task file. That's a transition we made last year. Dot task files are still useful for things like like a task file creates more things than just an LLM model, right? So there is like a face mesh task and obviously that has a lot of other code as well.

[00:19:48]But for LLMs we want something dedicated, simpler that people can use with open developer tooling.

[00:19:55]Yeah. So yeah, so it bundles things like the tokenizer but it's just the LLM model. Okay.

[00:20:01]Yeah.

[00:20:04]On TPU?

[00:20:06]CPU? Um Yeah, so there is another talk tomorrow from some of my colleagues including Wei who's here in second row and that has lots of performance data on Gemma and various models on Yeah, you can also check out our model card in the meantime if you search Gemma light or TLM model card. We we keep it kind of running like we update that whenever we have new performance numbers on new platforms. So there's a lot of comprehensive data there as well.

[00:20:37]All right, I'm at zero seconds and it's flashing at me.

[00:20:40]Yeah.

[00:20:41]Yeah, I said 2:30. Sorry, Chinchin's here. Sorry Chinchin, didn't see you.

#ai #ai engineer #ai engineering #software development #tech

Related Videos

Artificial Intelligence

OpenHuman VS Hermes AI: Who Wins?

JulianGoldieSEO

285 views•2026-05-29

Artificial Intelligence

Long-Running Agents — Build an Agent That Never Forgets with Google ADK

suryakunju

142 views•2026-05-30

Artificial Intelligence

5 Mind Blowing Omni Uses Cases

PaulJLipsky

1K views•2026-06-02

Artificial Intelligence

This computer is made from real human brain cells. And you can buy it.

Talktmsmedia

3K views•2026-05-28

Artificial Intelligence

BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2

aimmediahouse

122 views•2026-06-03

Artificial Intelligence

I Made the Same Anime Fight Scene in Every AI Video Generator

NobleGooseAnime

295 views•2026-05-30

Artificial Intelligence

Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S

cnnnews18

3K views•2026-06-01

Artificial Intelligence

I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)

AICodingDaily

298 views•2026-05-29

Trending

Revisiting The Cat Cafe For The Final Time

BenGtalks

3195K views•2026-05-29

Lil bro is a menace 🤣

NotAirJordan

2037K views•2026-05-31

Political Science

My response to the Police

RecklessBen

1496K views•2026-06-01

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30