Google’s push for on-device AI elegantly solves the privacy puzzle but creates a new class divide based on hardware specs. The necessity of cloud fallbacks proves that local processing is currently a luxury feature rather than a foundational shift.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
AI on Android: Ask me Anything — Florina Muntenescu & Oli Gaymond, Google DeepMindAdded:
[music] >> Hi everyone.
I'm Florina Montanescu. I'm a developer relations engineer working on lots of things AI both intelligent experiences what we're going to talk about today and also developer productivity. So if you want to talk about that find me at the booth later on.
I'm Ollie PM for Android AI.
Our work spans a bit of everything. I've been working on this since before we called AI and back when we were calling it just ML.
So we do a bit of everything from helping build features, things that run directly in the OS with applied AI. So things like background tasks that optimize the screen brightness and system memory and everything through to developer tools and infrastructure to make it easier for people who are building features on Android and right down to the lowest levels of hardware acceleration and iOS optimization with many of our silicon partners. So we do a bit of everything so happy to talk about any of that as well.
We want to make this a conversation where you're asking the questions and then we're trying to answer as many as possible of them.
But I'm not sure like how familiar you all are with like what's available on on Android about how to build intelligent experiences.
How Would you raise your hand if you already know how to build intelligent experiences on Android with like on device hybrid or cloud?
Okay, we're going to do a short like TLDR and then hopefully from there you might also get some ideas of like what kind of questions you you want to ask.
Okay, so if you want to build intelligent experiences on Android you can use on device models. You can use hybrid so use on device when on device is available but otherwise do the inference on the cloud or just do cloud inference fully.
So when it comes to building on device the prompts are processed directly on the device with no data being sent to the server.
This is great because of multiple reasons. It means that you're able to take advantage of local processing.
So sensitive data like I don't know banking information doesn't have to leave the device. You have the ability to do offline work and of course it means that there's no additional cost for all of the inference.
So use cases like you know sensitive data like or personalization or anything that requires like maybe a shorter context window like translations and so on are things that you can do on device.
So there are two main ways you can do this in Android. You can use the ML Kit GenAI APIs where you get access to Gemini Nano which is our on device model or if you need something that's much more customizable your own models your own custom models then you can use Light Art LM. There's another talk after this talking about Light Art LM so we're just going to talk about the ML Kit GenAI APIs.
So we said that these are the ones that give you access to Gemini Nano. So this is our most efficient models model for on device tasks and it's using the same architecture as Gemma 4 that you probably have heard about that was launched I think last week but Gemini Nano is optimized for Android devices.
So the way Gemini Nano comes on your device is through the AI Core system service. So with this it means that you only have one model on the device and then all of the apps are using that same model through this AI Core system.
Because it's Gemini Nano it means that we're optimizing for the hardware so you get like what is it lower latency and faster execution for AI tasks.
>> Yeah, so you can think about this as you know imagine you're using a cloud service, right? And everything is kind of provided for you. You don't have to worry about setting up the LLMs, running them on devices, getting your TPU inference etc. You just focus on your feature, your prompt and then the service provides everything. We're doing the same thing for on device, right? So we get the models to the device, we make sure they're optimized, take advantage of the specific hardware that's available on each device at runtime and sort of make the package that all up for you so you don't have to worry about any of that stuff.
And it's also it has privacy and safety considerations. So this means that you know your requests are not going to be I don't know messed up with all of the other apps requests. They run in isolation and then the input and the output data is not stored on device at all. So it's all as private and secure as possible.
Okay, so how do you access these the Gemini Nano? So you do this using the ML Kit GenAI APIs. So actually GenAI ML Kit APIs are part of the bigger ML Kit APIs where you also get access to APIs and models for like vision and natural language. For GenAI specifically we actually have a bunch of APIs specialized on specific tasks like summarization, proofreading, rewrite and so on. But the most powerful of them is actually the prompt API.
This means that you're able to send natural language requests to Gemini Nano.
For now it supports text and image as input and text only as output.
So with the prompt API it means that you can easily use it for like stuff like image understanding, content assistance, content analysis, entity extraction. So I would pretty much say whatever use case you want to do prompt API is going to be able to help you.
Okay.
The Gemini Nano models are only available on like Pixel 9, Pixel 10 kind of like that generation of devices not just on Pixel devices but also on other OEMs.
So what do you do if you want to get access or be able to use AI on other devices? Well, you can do this by using cloud when the local model is not available for that device. So that increases the reach of your feature.
So to do this you would use the Firebase AI logic. We've launched a hybrid inference a couple of weeks ago.
So this means that you're able to decide like if we have Gemini Nano available on device then you can run that inference on device. Otherwise it can run on the cloud and especially now I think with um Gemma 4 when the next generation of Gemini Nano will be available and I think we can already use it with using the AI Core preview.
It means that you're able to have the similar experience using on device with Gemma Gemini Nano 4 and then also in the cloud with like the Gemini Flash models.
But if you want access to even more powerful models like the the Pro Flash Flash Light and you want to run all of this in the cloud you can also do this using the Firebase AI logic. So this gives you access to both Gemini API and Vertex AI and Gemini developer API as providers.
>> Wow.
Okay. That was a whirlwind.
[clears throat] I'm sorry.
I think this is amazing.
We just wanted to kind of set the scene because with Android we're really trying to make sure there is a comprehensive offering, right? So everything that you need whether it's running things locally on device for low latency responses, for private inference, whether it's going to the cloud for almost powerful models, whether it's something in between we're building out a solution for you that covers each of those points and we're trying to make those APIs as consistent as possible so that it's easy for you to blend what is needed. So we wanted to kind of give you a high level view of everything but we're also happy to go super deep on any of these things. We're happy to talk about use cases, we're happy to talk about what we're seeing in the industry, what we think is interesting, where we're going but we really just wanted to set the scene a little bit with some of the different things we've got available today.
Florina, anything? Yeah.
What Ollie said. So I think actually now we're handing it over to all of you.
Okay, Um my main question is like have you checked amount of RAM usage or like the battery consumption that you might have Mhm.
running Gemini Nano models or Light Art LM?
Yeah, absolutely. So as Florina mentioned these models today we've shrunk them down as much as we could whilst trying to maintain all that capability. But that does mean they need flagship capabilities, right? Right now.
For battery concerns that's one of the reasons why we've produced AI Core which is to basically optimize everything as much as we can so that you can know that you can rely on that day if it's available I'm going to be able to use it and I'm going to get the best performance out of it. Yes, there is battery impact. If you are running this non-stop you are going to run down the battery fairly quickly.
But what we're finding is that the kinds of use cases where people are using it today are things like you know a user is coming in asking a question and they're responding or they're manipulating some data and it's happening at that point in the flow. Users maybe use it 10, 20 times a day. That sort of usage is really not you know concerning for battery life. Where we're seeing more batch use cases so I have a bunch of stuff that I want to process often times that's not necessarily latency sensitive so people are using that in the background perhaps when it's running overcharge overnight and they can basically run that continuously until they've completed all their tasks.
But this is one of the reasons why we're trying to build in the platform so that we can sort of do all those optimizations for you so you don't need to worry about it. If however you have very specific needs and you want to go with the custom models we have a bunch of tools for profiling, for trying to determine what the impact is going to be for you but of course it does need you to do some of that work if you're going to go custom.
Sorry, can I just give you this hand mic since you're >> Just pass that around if you've got a question. Thank you very much.
Um yeah, quick question regarding two actually two questions. First one, quick one.
So, in the in the case of using ML kit, which is will be outside the most optimized one, offering, it means does that mean that we are using um a model that is shared cross apps on device? Yeah. Yeah. Okay. If yes, then is how are we sure is it managing the scale? Imagine like a user have 100 app that's actually all in two years time they will all using that same model. How are you handling that to get I would say latency as good as possible? So, so that's exactly one of the reasons why we're doing this at a system level, right? If you imagine these models, we're talking about the smallest ones are like 1 GB to be useful. The ones we're shipping are actually close to 3-4 GB in total.
It's not really very easy to ship that as an app developer yourself. It really requires you to have an absolutely killer feature to justify that to an end user. If we put this in the system, we do that once and everyone can share and benefit from it, well, that shares that cost, right? So, that's one of the reasons why we're doing this. So, we have no concerns about hundreds of apps using this because we've centralized that cost. In terms of lots and lots of apps using this to generate features, what we tend to find is that yes, it does have a battery impact, but by centralizing that, by queuing things, by making sure there is some overall system management, we can inform the user of like, "By the way, this app is using quite a lot of quota to do this. Do you want to do that?" Right? We don't ask that, but as in we're attributing it to the battery impact. And we've seen this time and time again with things like GPS or Wi-Fi. If the user feels they're getting value out of the app, they're very happy to use that. They're happy to spend their battery on the features they love. If, however, the app is doing something that doesn't provide a lot of value to the user, they might not want to use that, right? So, we think it's really important to just make these capabilities available, make it easy for you to use them sensibly, and then developers will build amazing features and users will choose what they want to spend their battery on. But from a developer perspective, you don't need to worry about that. You just do your inference and then that's it and then the AI core does all of the Correct. the We handle the scheduling, we handle make sure [clears throat] that you know, things are getting queued up appropriately and stuff, so you don't need to worry about that. So, only the probably thing to think about is if there are other apps using that model, then probably the our prompt will get in the queue, basically. So, if you're trying to obviously access it from the background, then yes, you'll be queued. If you're in the foreground, of course, you're going to have top priority. So, whichever app is currently being used actively by the user is obviously going to be prioritized by the system.
Uh the question in the back.
Android 12 on my phone, which is not a Pixel, right? It's an Asus, but it doesn't matter.
And I have used um the Hold it closer.
>> assistant. Can you hear me better now?
Yeah, great. So, I have a couple of questions. The first one is um when I'm asking Google to do solve something for me, is it doing locally on my mobile or is it doing remotely? When you are when you say when I'm asking Google, do you mean like the Gemini app or Whatever whatever I have on my Android by default. I did not install anything.
It is there. I just say, "Hey Google, what's the temperature outside?" And it will give me an answer. Is it running this locally or is it running remotely?
So, that would be the Gemini app or like the search app that has integration with the Gemini APIs. To be honest, I'm not sure. My expectation would be that it's running this remote. Right. So, then my next question makes a lot more sense.
How can I then install the things that you have presented here? And and second part to this, is it possible to write skills to for it to locally to improve the kind of answers that it gives me? If I notice that it's giving me consistently the wrong answers for whatever reason, possible to do that?
So, what you're talking about is like two different user journeys. Cuz if you're building your own app, you wouldn't interact with the Gemini app. That's a completely different app, but rather you would interact with the model itself. Yes.
Therefore, if you're using an on-device model, you don't need to care about the the cloud. All of that inference will be on device. Yes.
So, I'm asking precisely about that. If I installed what you showed here locally, assuming it is possible, Yeah, yeah, yeah. and I then write write skills to improve the kind of responses it gives me? I don't think you would write the skills.
No, so I mean it depends on the kind of experience you're trying to build, right? But if you think about a skill, it's just something you're sticking into the prompt alongside the rest of your query. So, the tools we're providing here are more sort of low-level and designed for you to be able to build back on top of that, right? So, if for example, you went and installed something like Pocket Claw or Open Claw on the phone itself, that would then basically take those skills, compose them into a prompt, which it would run through this API.
So, what we're focusing on here is building the lower layers to enable you or others to build those things on top.
Thank you.
5 minutes.
We started late.
Uh there was one question there.
Yeah. So, the Gemini Nano model itself is completely contained by AI Core, right? So, you don't need to worry about how to set that up and configure it. You use the APIs to access it and it's all handled by the system. In terms of building a RAG-like solution, yes, absolutely, you can do that with the Prompt API.
We are looking at extending some of the APIs we have, for example, adding an embedding API soon, and that will make it easier for building these kind of RAG-like solutions. But yes, it's totally possible to do that. In terms of the relationship between AI Edge Gallery and AI Core, AI Edge Gallery is a showcase to show you what's possible, right? You can use AI Core backing in that or you can use custom models. We really want you to see the full breadth of things you can do.
AI Edge Gallery, I think, is a really good way of showing the frontier of what you can do, right? But it requires more uplift. There's more work you have to do to take advantage of those things. AI Core is really focused on making it easy for you to just focus on the prompt, focus on the specific feature rather than any of the setup. Build like production-level apps.
When you say system, do you mean like system prompts or files? So, Like I have to take a lot of pictures today. Yeah. I want to then like summarize everything together. Mhm.
Make a note. Yeah. So, the ML Kit GenAI API, the Prompt API, allows both text and image input. So, you can give it access, yeah. So, all the things your app normally has access to in terms of files and so on, you can then obviously pass them through the API and run them just like just like you would do on a server-based inference, right? In terms of creating the prompt, feeding the information in.
Is there any vectorizing embedding model also available for this API? Like I want to vectorize some of my text notes so that they're like, you know, I can get similarity between some other notes, maybe. So, does this have that or So, we don't yet have an embedding API. We will soon. Cool. So, if you're looking at things like the Gemini embedding model, that's what we're going to make available so that you can use that directly from the API.
You want a last question? Last question.
We're being kicked out. Thanks.
Yeah, there's a diversity of devices, you know, and there's also a diversity of models.
You showed the Prompt API and you know, embeddings are coming soon, but how do I think about the sort of that matrix of device, like what's capable on a given device versus like what what models can be used? You know, there are more use cases than just LLM and embedding model or are there other models that also run on device?
And then will they be widespread across all devices or do I need the latest flagships to to use those? So, as Ferrie showed, the ML Kit API covers a range of things, like text, OCR, vision, all sorts of stuff. Those models, the classical models, are much smaller. They can run on super large range of devices, like billion plus devices, no problem.
The new ones, the GenAI APIs, those ones require currently fairly flagship devices from the last couple of years.
What we're doing with AI Core is we package that up and say, "Hey, you can call this API and if it's available, you know it's going to run well."
If, however, that reach is not big enough for you and you want to basically do the work to test on a wider range of devices, Light RT LM, which Cormac is going to talk about in a second, can help you do that, but you will then need to do the work for testing. Cormac hopefully will talk about some of these tools we have available to make it easier to do that testing, but you will need to do that testing yourself. For AI Core, we cover that all, so you know that if it's covered by AI Core, it's going to run well.
We're going to be around today and tomorrow, so please come and ask us questions. Thank you.
>> [music]
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 views•2026-06-01











