Large Language Models process prompts through a multi-stage pipeline: first, a tokenizer breaks text into tokens based on character pair frequencies from training documents, assigning each token a unique ID; second, the transformer with attention mechanism provides context awareness by converting token IDs into probability distributions; third, the model generates responses token-by-token by selecting the highest-probability token from its vocabulary until it predicts a stop token or reaches maximum length.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
How LLM works
Added:Hi everyone. So today we'll discuss about this topic that whenever you give any prompt to the LLM, how things will work, how it will generate the answer, how it will process the prompt. So that thing we will discuss today. So let's say we give any prompt to the LLM. Let's say what is machine learning? So that is a prompt. We have to we give it and we want some answer from the LLM. Let's say Chargd, Anthropy, Grock, something else.
So at first what will happen it will create some tokens like what will one token then let's say ease is another token let's say machine is another token let's say learning is another token so at first it will create the tokens but it is not the word by word so you cannot say what is a word So it is another token is a word.
It is another token. Machine is a word.
It is another token. Learning is a word.
It is another token. It is not like that. So it is based on the tokenizer tokenizer sorry tokenizer. So it is based on the tokenizer. Tokenizer is creating those tokens. So let's see how this tokens how this tokens is created at first. So let's say there is say word like l n gua g e okay and we gave thousands of documents to train that tokenizer to creating the words. So let's say language is one word. So if we write in in character wise like in that way l a n g u a g e. So these are the character of this word language.
So it will create the pair like let's say l a then ng sorry a n then let's say ng then let's say gu let's say ua then let's say ag then let's say g e so in that way some pairs are created so for each pair it will try to find in those documents that how many number of times this this combination of character is appeared. So let's say it is appear 50 times. A N appears 20 let's say it is 60 times NG appear let's say four times GU appear seven times let's say UG appear 20 so in that way we get the frequency let's say it is uh eight let's say it is 40 and let's say it is 30 so which one is the highest let's say it is it is also come 60 so which one is the highest 60 we will take the first one which have the highest frequency. So a n is one character combination.
So this word is become in that way. A n then g then u then a then g then e. So this a n is combine. So in that way we try to combine those characters which have more which is present in more frequent number of times in these documents.
So let's say in that way we get lang is appear multiple times and there is u a ge appear multiple times. So these will be two tokens and it will store in the vocabulary. Let's say it is lang is one and then u a g is one and it also gives some token id. Let's say it come at the first. So let's say it is zero and let's say it is one. So these are the ids of those tokens. So in that way the tokens is created. So whenever we get the answer like what is machine learning so what so what is definitely present in this vocabulary so that's why word come so let's say machine so instead of machine is not completely present in that vocabulary so it can happen that m is one word and then is n this another token so these token are these two tokens are present in that vocal a value. So instead of machine become one uh token, it can be break in two tokens like machine and I because these tokens are present in that vocabulary not machine present in that vocabulary. So in that way these tokens are created.
So why not? So these are the some example what name of the example of the tokenizer like GPD4 what type of tokenizer it use llama what type of tokenizer it use hugging face wrapper what type of tokenizer it use so there is one question can arise like why we not use whole the word like there is what is machine so instead of breaking the machine in multiple chunks or multiple parts why why we not use the complete word like What is machine learning? So these are some words we store it and we just use it. The main problem is that one word can be multiple variation like so there can be let's say we can take example run sorry run so there can be another variation running there can be another variation let's say runner There can be another variation ridden.
So if we try to store all those words the vocabulary size become huge and for searching in the v vocabulary it will very time consuming and it is not efficient. So that instead of storing all words we try to find the generalized way we try to find the balance way so that we can store these words and we can use it properly in efficient way. So let's say instead of storing the words word by word we can store in that way like run is sorry run is one token then we can store ing is another token then we can store er is another token then we store is another token so whenever we get this word it will become run then we can say Let's say there is let's say instead of I E N so n I N G N I N G so it will instead of storing the running it will generate run and running I run plus ning and for runner it will be run and there is maybe any so so let's say it is stored in near let's say rerun also let's say re plus run so in that way these tokens are created so that we can also store those words and we can use in efficient way why not use character wise so let's say there we can we can use the character wise like let's say it is run so why not use like run And we store it R and U and N. Why not? So there is some S key values.
So we have to store only ASK values. So there are multip there are n number of as key values and we have to store only those. But the problem is that the number of token become very high. So if we try to break every word word character by character, it will create large number of tokens. And for processing those it will very heavy for the transformer. So we'll discuss that how those word next step it will happen like what where it will go after creating the tokens. So it will go to transformer then we will discuss it. So this is the first step. So it will create the tokens with the help of tokenizer.
Then next step is so let's say our word was our prompt was word is machine learning and let's say there is question mark so based on the vocabulary it is giving the token ID let's say 102 This 102th word is what in the vocabulary. So it will get 102. Let's say is come at second. So it is assign two to ease. Let's say machine machine is 919.
Let's say it is 834.
Let's say it is 43 maybe based on the vocabulary. So it will be so after getting the prompt it transferred to some numbers.
So it transferred to numbers. Then what is the next step? Next step is we have to give the context awareness because these are the some words which don't have any meaning but we have to implant the meaning like how it will how it will connect to each other. So let's say we can take this example the animal did not cross the road because it was too tired.
So you can say this it it can refer to road and it also refer to animal but which one we have to take? We have to take the animal because it this eat is telling about that animal that animal is too tired so that it cannot cross the road. So what will happen? It will after creating the token it will go to the transformer. So that we give the context awareness whenever process the word it the attention mechanism of the transformer. Look back animal and look at the road also decide it refers to the animal not the road. So it is giving the context awareness. How we get the context awareness? those numbers those numbers transform to some probability distribution. Let's say it is 0.3 let's say it is 0.5 let's say it is 0.7 let's say it is 0.8 8 let's say it is 0.1 so it is giving some pro it is taking some probability distribution whenever it cross the uh attention mechanism of the transformer then what will happen after we getting the embedding we implant the context awareness then the LLM will try to generate the answer how it will try and generate the answer here also it will use the vocabul abulary which is already created.
So let's say our uh our question was what was machine learning? What is machine learning?
So by the by this embedding by this embedding it is giving the probability distribution of each word in the vocabulary. Let's say it is um machine machine get the probability of 0.92 let's say and let's say happen is getting the probability distribution of 0.3 so which one is the highest 0.92 is the highest so whenever it generating the answer machine will come first and this one will go to the next step. What is the next step?
This process will repeat. Let's say at first machine come then it will go again and what is the next pro next word can be let's say in the pro in the vocabulary learning l a r n i n g learning get the highest probability of 0.92. Let's say run is another word another token not the word.
So it is getting a probability of 0 let's say eight. So which one is the highest learning. So at first what we got machine then it is learning then let's say machine learning is a then field.
So in that way word by word or token by token it is adding in the answer and we will get the final answer. How much it will go until it predict the let's say stop token or it hit the max length till then it will generate the answer and generate the token or it will give the probability to the token and predict the token which can be more suitable for the answer and uh it will generate the answer until we get the until we it will predict the stop token or max length. So what are the steps go through again? So at first it will split the tokens based on the tokenizer and we already discussed that how those tokens are created and it is stored the vocabulary. So then after getting the tokens we will uh get some token ids based on the vocabulary. Then after getting the vocabulary we have to give the context awareness. So it will go through the transformer and the attention mechanism gives some probability distribution for each tokens.
Then after getting the tokens the LLM try to uh generate the answer. So again it also based on the based on the embedding it try to predict the answer with with giving the probability distribution of each words in the vocabulary. Let's say machine is a one word one token is getting the highest probability for this for this question or this embedding. So machine is adding first then in that way uh whichever more suitable token is there it will add from the token vocabulary. So in that way it will generate the answer and we'll get the final answer.
So that's all. Hope you able to understand the this topic. Thank you.
Related Videos
AI Agent Mastery Certification Course: Lab 4 – Tools & MCP
arizeai
350 views•2026-06-16
Real-time Voice cloning, Kimi K2.7 CODE, GLM 5.2 and 3D reconstruction | AI News
kaiexplainsYT
111 views•2026-06-16
He Believes AI Could Replace Humanity Faster Than Anyone Expects
LondonRealTV
815 views•2026-06-15
General Session by Rami Rahim-The next generation of networking: From vision to self-driving reality
HPE
108 views•2026-06-17
[PLDI 2026] Flatirons 3 - LCTES (Jun 16th)
acmsigplan
191 views•2026-06-16
Google DeepMind’s AI Halves UK Housing Planning Time
60secondsignals
467 views•2026-06-17
The Creators of Claude Code and OpenClaw don't Prompt Their Agents Anymore?!
ColeMedin
569 views•2026-06-18
Why prompt injection is AI's biggest fail
usemultiplier
1K views•2026-06-17











