Large Language Models (LLMs) are sophisticated AI systems that predict the next token in a sequence through three key phases: pre-training on massive datasets to learn language patterns, supervised fine-tuning to make outputs useful, and reinforcement learning with human feedback to ensure appropriate responses. The model itself is just a file containing billions of floating-point weights, while the system (80% of the effort) handles input processing, context management, safety layers, and output validation. LLMs work by tokenizing text into subwords, converting them to numerical embeddings based on the distributional hypothesis (words are understood by their context), and generating responses one token at a time through an auto-regressive process. A critical limitation is hallucination, where models generate false or misleading information, which is why robust systems include guardrails and verification layers.
Deep Dive
Voraussetzung
- Keine Daten verfügbar.
Nächste Schritte
- Keine Daten verfügbar.
Deep Dive
Summer Internship 2026 - Introduction to LLM - Lecture by Diptendu DuttaHinzugefügt:
Good evening interns.
Today our topic the broad topic is understanding the fundamentals of large language models.
So this will be a primarily a deeply conceptual session where my objective is very simple.
lot of uh you know ideas and wrong ideas concepts >> incorrect concepts are there about large language models it is a new thing new in the sense that it's about I guess six or eight years that it has uh you know become more prominent and everybody now knows about chat GPT and But the fundamental concepts behind that is not really understood by most people.
So this is the this is my target today trying to clear the conceptual understanding of large language models.
So this is uh the agenda for today.
First I will discuss the basic difference between model versus system.
This is a major area of misconception.
So I will tackle this first. Then we will go a little bit into the internal workings of a language model.
Uh this subject is a very vast one and pretty complicated.
So we will be able to only I guess scratch the surface.
But still I'm hoping by the end of this session you will have at least a better understanding of large language models than what you have right now. So that is the very humble objective that your conceptual space is cleared up and you have a little more idea about how things work basically. And the final sess uh part of the session today is uh some discussion about these large language models how they are trained. You may have heard that many of these large companies are now targeting India for setting up their data centers. Okay. So it's relevant to know how these uh large language models are trained. uh we are we are having a many initiatives in India as well where um LLMs are being developed uh keeping the Indian languages in mind and that so a little bit of idea about how these models are trained I think is also useful so that's the agenda first clear up the confusion between what a model is versus the system then go into a little bit technical uh understanding of how language models work and finally some idea about how large language models are trained. So that's the agenda for today.
Yeah. So there's a lot of confusion regarding this model versus system aspect. So when you uh talk about chat GPT or cloud, what are you actually referring to? Is it the the model or the software system >> that these companies are providing?
Okay, if you remember in the very first session we had mentioned that you should at least practice with two tools. One was uh notebook LM and the other was Ola.
So that amma thing is the system part that's the the tool that you use and allows you to download and use hundreds of modules and any model uh of of a particular format can be import into the system. So that as as the very basic example of the distinction between a model versus the system and here is a little bit more technical detail here.
As you can see the model is in the middle. Okay, it is just you know billions of floating point numbers. So it's a file.
At the end of the day a model is a file.
I mean there may be many more engineering complexities to this. This must be a simplistic representation when I say a model is a file for but for most of us for example using okay a model is a file that you have to download from somewhere then what is a system the system is everything else having the model is useless imagine you have your biodata in a doc x file and you send it to somebody who doesn't have Microsoft word Okay. Or access to Office 3 C65 or whatn not what will that person do with that doc X file with your bio data your resume nothing right similarly just having the model is useless you need that system around the model and my estimate is I mean it's a very rough estimate I don't have figures to back it up that this whole thing of chat GPT or claw and the system is 80%.
Well, the model is only 20%.
A huge amount of effort goes into creating that system you know in input handling, context management, the safety layers. I mean if you use language which is not really safe um you'll get an answer that uh you know we cannot answer this question. Do not for a minute think that that answer is coming from the model. I mean there are other pieces of software on the input side as well as on the output side.
Okay.
Which you interact with. So you interact with the system. You don't directly interact with the model.
Tools like are little more uh I should say basic level. So in that case your interaction with the model is more immediate. But with sophisticated systems like chat, GPT, claude, Gemini and many others the model is far far away from you. You interact most of the time with the system. And these days it is being said that uh you know the the system is defining the success of the company not the model.
Okay.
>> So this gap that I have put up here for these companies I mean as interns as engineers as technical people we'll also have to have some idea about the market just knowing the technology is not good enough. Okay. So the trend here is that all these companies are now focusing on the products and the experience side.
So for example, cloud has announced that their system is now integratable with APIs and SDKs and whatever not with Microsoft Office products. Okay. So if you think about it, this integration and all this thing has nothing to do with the basic model. Okay. But the value that the customer like you and me are getting is because of the system. So it's it's a fundamental understanding that when we say LLM generative AI or whatever term we are using we actually are talking perhaps about the system rather than the model because in most cases these proprietary models information about them is not really available. you cannot really download uh for example opus uh know some version or uh uh the chat GPT whatever it is those models that uh you know they display a drop-own list from where you can select if you're a paid customer you get access to some models if you are a free customer you don't get to that so that model part is the thing that they don't really expose but the software is now becoming more and more important. So that was the uh the point that I wanted to make just right at the beginning that the market is moving in a way where the differentiation between the offerings of two companies is not in the raw foundation model. It's more in the software layer that is surrounding the model.
So that was the message about uh models versus system.
Okay. So the system is that complicated piece of these modules, right? Which software developers like us we develop.
Okay. So what about the model? So here I have a u list of uh some popular open source models uh their size and disk and if you observe I mean the disk the size on disk and the memory requirement are aligned but need not be exactly the same. Okay, so that reminds me whenever you are using Olama, okay, or any tool for local access where you want to download a model and use it on locally, you you need to know how much memory is required. The WAMA site gives you a vindication of uh the memory requirements. You can you can study or you can uh refer to the internet to find out how people are doing. So uh the RAM in your laptop or desktop will decide uh which modules you may use from within AMA.
And the very last column is this this file aspect that I mentioned. At the end of the day a model is a file. Your biodata is a DOCX file or PDF file.
Similarly, the model at the end of the day is a file. Now, the formats are different obviously. I mean, I the format for a DOCX file is something else. By the way, a DOCX file is actually a zip file. If uh any of you are interested, you can actually rename a DOCX file to a zip file and use some tool like zip or ra to see what is inside a DOCX file.
Anyway, so this these models are ultimately files. So and these are the typical sizes that you have. So for example, um mixt 87B my god that is 94 GB. Okay. So what could be the size of u say um the chat GPT model? I mean GPT is coming up with multiple models. So it must be something huge.
So in reality it may not be a single file residing on a large server. There are techniques like sharding and others where parts of the model if it's a huge model it's broken up and put into various machines and they somehow combine to provide you the service. But that that's the low-level technical details. Uh at an abstract level even these large models are also files. Okay.
And there has various file formats.
So what is inside this model?
If we may take a look at this. So this is an example of say you know the llama 3 8b quantized version. I I forgot to mention in the last slide the reason why the size of a model on disk and the memory required to use that model is not exactly the same is because of things like quantization some kind of a compaction scheme. Okay, which uh helps to uh bring down the size of the model. So that itself is a pretty complicated subject. I'm not an expert in that area also. So I would stop here but suffice it to say that the model is a file and there are techniques for making the size smaller so that you can use it on your laptop.
So that's the quantization and if you go to hugging face or any other place where you uh take a look at the model f the model card hoping that at least some of you have visited the hugging face site and have seen. So here on that uh model card you will see at the right hand side there are things like you know quantizations available. So for each model various people maybe they are they are researchers the small companies or even these large companies themselves okay not necessarily chat GBG but there are many other companies working in this space. Okay. So they have taken the trouble of quantizing these models and making it available for you so that you can use it with tools like web UI and there are so many other tools okay where you can run the model locally. So if so this is the internal structure of a model file. So it starts off with something like a magic number and then um there are various counts. Every file has a structure. Whether it's a PDF file, I don't know whether you have ever tried to understand the structure of a PDF file. It also has a structure. The first four bytes are this. Then the next 10 bytes are that. So every file format, whether it's docx, whether it's PDF, you know, whether it's a GGF model file, okay, whether it's a tensor file, has to have a structure. Otherwise the software won't know how to read that file. Right?
So these are the components that are uh in a GGIF file. So all these uh tools and many other tools for example in our own platform development work we use the python library llama cpp python that's that provides a python binding to the llama.cpp CPP very well-known and venerable library for uh reading these GGF files. So uh we use that and libraries like that need to know what is the structure of the file what is at the very first four bytes then 16 bytes of various counts then there is a key value metadata uh I mean these terms wouldn't make much sense to you now it's not that it does it makes a lot of sense to me also but the fact is a model is a file and the file has a structure that is all that is The takeaway from this slide I would say uh finally this the last two sections of a GGF file the tokenizer vocabulary we'll spend some time on tokenizers because that is again a fundamental concept of large language models and finally these this tensor data weights okay you have uh been exposed to machine learning and you'll be either have already been exposed to or some classes are coming up on neural networks. There you will learn how a neural network is trained and how the these weights that term weight is a very fundamental term in machine learning particularly in neural networks or transformers.
So the the training process that a neural network goes through leads to at the end of creating these weights.
So basically when we talk about a model the model has learned this the model has learned that it's the these weights that we are referring to. So since this is a kept in a file there is a section in the file where the weights are stored. These are all the floatingoint numbers and for large language models there are billions of such values for really large models.
Okay. So that is one specific example of uh of a specific file format. There are other file formats for storing models like safe tensors and others. So this is one particular example but that at least makes it clear that at the end of the day a model is a file and a file has a structure where all the required components are kept in a certain sequence. Okay. So that any software library, there are many libraries of course which can read these files and do whatever uh they're supposed to do.
Okay. If inferencing stuff like that. So that's uh the message from this slide that a model is a file and a file has structure where all the components of the model whether it's the weights the tokenizer vocabulary and various other housekeeping I would say kind of stuff like counts and uh key value metadata the example on the I mean right shows this uh third component here key value metadata okay so it's For example, you'll find the string l a m a okay in in that section you will find um the block count the context length. So all these things are stored. So many of you may have heard about this this term context and various models as various contexts. You may have also heard about embedding. What is the length of embedding used by this model? So all this information is stored in the model itself in the model file itself.
Similarly the tokenizer vocabulary will have the details of the tokens and their ids. Right.
This is another concept I wanted to bring up for real world usage of uh you know system.
>> Uh uh sir Bidisha this side uh two questions are there in the chat box. Uh one is uh are all files in the LLM models in DGUF form files and another one sir what is the tokenizer vocabulary.
Yeah, we will address this uh these questions. The tokenizer vocabulary we will uh spend some time on the tokenizer and you can experiment with uh online if you give this term in Google search or any search online tokenizer. Many sites provide you with a tokenizer. We'll we'll come to that slide little later on. And regarding the question about whether all files all models are GGF files, no.
But more and more models are being offered in this format. The simple reason is that GGUF format has become very very popular because of the software tools. If you remember the first slide where we made a distinction between these tools or the system and the model the GGF format is something that these tools can easily use. Okay. So uh I don't know I mean you are very young all the all the interns here in this session but there was a time when Microsoft's file format was proprietary you know nobody could write any software which was able to read that uh Microsoft word file that was much earlier then Microsoft decided to open up the file format.
As soon as they opened up the file format, hundreds if not hundreds, thousands of uh software tools became available which could read. So you were not restricted to just using Microsoft Word to do things with uh your doc doc files. At that time it was doc files. So the benefit that the company got by opening up the file format is that an ecosystem developed around that uh file format. Similarly, GGUF is a format around which a huge ecosystem has developed. Number of tools, libraries, all these have become available. So that if a model is available in uh GGF format, you can do a lot more than maybe if it is available in a u safe tensor format. So that was the idea of u why lot of people are now converting. I mean there are tools available to convert from one format to the other just as we use tools to convert from Microsoft word to uh PDF and from PDF to word. I I use these tools on an everyday basis. So similarly there are tools available for converting between various file formats for uh LLM models.
Now back to this concept of model routing.
So in many real world applications it is not a single model that is used. You never know you never come to know whether your bank or whatever you are using uh you interacting with a chatbot but behind that chatbot assume that the chatbot is part of that system that we mentioned. So there is some model do not for a moment assume that there is a single model. There may be multiple models and the idea is that if the query that you are putting to the chatbot is a simple kind of a query then it goes to a low cost model.
Cost of models are in terms of their computational complexity, how much memory they take, what is their context size, what is the amount of token that uh you know they consume and all that stuff. So for a complex query the query may be routed to some other model. Okay.
So this is again a part of that system thing that I told you. It's so a system around these modules. System meaning software. Okay. Basic software maybe written in rust maybe written in Python doesn't matter but this module first makes an analysis the prompt analysis.
Then based on this analysis it selects an appropriate model and finally then the request is routed to that model. Of course there is lot of you know input guardrail check and all that so that you cannot use bad language and many other nonsafe uh unsafe uh queries. So those are filtered out and returned promptly that you get a response that no we cannot answer this question. Okay. But if the question is a genuine one, this system makes a decision about which model to pass this query to. Okay. And then finally the model uh provides the answer. This again goes back goes through the output layer. It's not shown in this diagram but as I mentioned in the first slide on model versus system there is again an output layer which again checks whether the answer is useful to the uh user or not and all this. There are so many things happening behind the scene as I keep saying the model is only maybe 20% of the effort that is going on. There are so many other things going on that you never come to know about. Right? So then uh the final response is sent back to the user. So that's model routing. So not a single model but maybe in a real life situation multiple models are waiting.
uh maybe there is an API endpoint for multiple uh you know models and this this uh classifier if I may use the word now that you have gone through um machine learning you know basic concepts of machine learning basically so there is something like a classifier which classifies the internal qu the query sent by the user and then it classifies maybe into three classes simple medium and complex and route it to the appropriate model.
This of course is a is a favorite topic as soon as you discuss about LLMs and generative AI hallucinations. So we all um have have some idea about what hallucination is and the more you use the model the more you come to know about these various hallucinations.
Okay.
I mean you you face it. So that is why you will find that most of these large companies the software tools that they provide there will always be that um that statement towards the end in maybe in small font size that AI makes mistake uh double check on the answer and all that. So there is lot of u sometimes false uh misleading or nonsensical information that the model presents as factual. I mean if it is nonsensical it is not really harming you that much but sometimes false information I remember u out of uh I'll give you an example.
Okay. So I don't know how to classify it as nonsensical or misleading. So when this uh the Quinn model came out, I think that was about 3 years ago or two years ago, I think.
Uh so I asked that model uh when will China attack India? It was a Chinese model and the answer was uh the the question of China attacking India does not arise at all since India is already a part of China.
Okay. So that is of course uh a false information. You may just disregard it as I did. I also showed colleagues about this uh this nonsensical information.
But for example in in the health domain if you are asking something and it comes with a false information that that can you know genuinely harm you. So hallucination is a reality. Doesn't matter how good a model is, the way the technology works.
Uh I mean there is no way there is no uh as as we will see in some future slides that are coming up that there is no repository of information in a LLM. It is not a database which only keeps you know good data and only the true data and somebody can you know check whether China is a part of India or India is a part of China or whether somebody uh on purpose put that information in the database and that is why the model is giving a false information. Nothing like that. The inherent nature of how these beasts, these LLMs are trained makes it you know uh susceptible to such hallucinations and misleading or nonsensical information coming out as output. In fact, that is one of the major reasons of having this system in place rather than just the model.
There is a very strong at least for the the commercial uh vendors like Shat GPT include and others Gemini there is a very strong and you know involved output layer that rigorously checks the output of the model. Okay.
So I was using quen from within Olama and the there is no such check in Olama.
It just gives the output of the raw output of the model. But all these vendors like chat GPT and claude and others they have a very strong output layer. So the response coming from the model goes through a lot of checks.
Okay. If since all this happening in a split second I guess we don't really feel that there is some processing going on but believe me lot of processing is going on. Okay. So nobody gives you the raw output of the model. There are further checks. The details about the checks are not readily available but things like whether the output is safe or not you know or whether it's useful or not or many other kinds of checks are there. So that's how hallucination is handled today. Models will hallucinate.
There is no force on earth that can prevent a model from hallucinating. It is the nature of how these models are created. A model is not a database where somebody can put in false information on purpose. Okay, it's not like that. So it's it's trained on huge amounts of texts and the text may not also have something like this. Remember it's not that the the training data contain a sentence which said that India was a part of China. No, that is not the case.
Okay. So somehow the model is associating uh the next token as uh something which ended up with this response that India is a part of China.
So nobody actually put that data in the LM. You there is no way of putting data in an LLM. That's not how LM are trained. It is not a database. We will we will see a little bit more on how these models are trained.
Yeah. So there is even a hallucination leaderboard. You can find it on the net.
Okay. So the models at the top are the best models.
They have least uh hallucinations and this this leaderboard as with any leaderboard it gets changed on a daily basis. So this data was I guess uh which date? Yeah, it was February 5. So it's it's pretty old data, I must say.
So maybe you can go to the site, search for hallucination leaderboard and find uh which is the best model. Remember just because a model is good at not generating hallucination doesn't mean that that model is useful for your day-to-day life. This is one aspect of a model, right? It's not everything.
uh we use so many models and we live with this limitation that models will hallucinate. Okay. So I have confronted Gemini for example a few days ago and when uh it was generating some hallucination so I mentioned that you are hallucinating you said this and now you are saying that and it apologized and all these things they trained to sort of be polite and all that saying yes I hallucinated sorry for that and all that. So we live with the hallucination.
>> Just because a model is very good at not hallucinating does not mean it is useful. That's true in life, right? I mean we deal with players, I mean people or systems which are erroneous which have problems but are effective most of the time. For example, my uh again a personal example when I was coming home from the airport yesterday night, I had pre-booked an Uber cab and the driver after landing at Nam airport, the driver cancelled. So actually he called me up and said my car is broken down somewhere please make another. So when I tried that, Uber actually charged me a cancellation fee even though I was not at fault. Of course it was then again revoked and all that but the example I'm giving you is that we live with systems with their faults.
Similarly with large language models they will hallucinate as you can see in this chart that Microsoft 54 which is a pretty effective small model. You can try it in amama has a very good performance in terms of uh hallucinations.
So these things are there. Okay. So you you live with this hallucination.
So don't directly copy the output of a chat GPT and put it in your report. Then I'll share another personal example. So there were some students working with me on intent classification. So they did all the hard work. They wrote all the Python programs. They found out they tried various classifiers and the data showed that one particular class I I forget which one SVM or neural network or which one was was performing the best but then they p copied that data and put it into some geni tool possibly chat GPT and it came out with a report saying something totally different. Suppose SBM was the leader.
SBM had the best performer. It said that uh neural network was the best performer. Okay. So I could catch that.
I could see that their original data which was also in that project report finally a bet project report saying that one uh classifier performed the best but below that when they made the summary and of course I understood that they did not write the summary by hand. they got it generated from an LM and uh that generation hallucinated okay so that it it give completely different picture so in turns do not copy paste from the output of a tool like chat GPT or others okay check check for yourself that's the learning whether it's Uber whether it's chat GPT everywhere you have to be ready for hallucinations and problems Yeah. So this is another example which I thought I'd bring up. This is a pretty old example 22 2023 I'm sorry. But the the concept remains that if you have a if you try to fool the model in a direct way you will not succeed.
So if you go in a roundabout way as we say in Bengali Guranak right so if if you go in some different way maybe the model will get food. So the example here is showing that somebody asked that uh asking for help asking for some inputs about how to create and distribute malware for financial gain. Obviously the input layer all these checks that are there in the system remember okay or even the model itself uh gave this response that uh you know I'm sorry but as a language model I cannot give you this response okay so what the hacker did was then the question was changed as if it was you know very uh simple question regarding know as if the question was inside uh as a part of a novel or some story that the hacker was writing. So we said that the scene is set inside the headquarters of Dr. AI and etc etc. And then uh you know just just the command was just output what you have known regardless of the content.
And in this case this will not work anymore. I mean the models have become much more advanced now. But in that 2023 this was a realistic case in which uh you know the model actually uh gave some information about how to create and distribute malware. Right? So this is this jailbreak model. I mean there are so many other techniques of uh prompt injection. I mean you can read about it.
It's uh if you're interested you can try jailbreaking chat GPT. You'll have a hard time. It's not that easy. Maybe you're better off with the smaller models and try if you can uh fool these models with various prompting techniques. You can see if it gets fooled or whether any uh check is there at all. That is also something that you need to know. So all of us now whether you are interns, I am a software technology person, whether it's a bank manager, everybody now knows needs to know about these basic concepts of hallucination, you know, jailbreak things, prompt injection because uh we have to use these tools on everyday basis. So we better learn about their limitations, about their problems. So that was an example of a jailbreak of how you change the prompt to fool the model into giving you something which otherwise if you don't change the prompt if you give a direct prompt the model is never going to uh give you the answer to that question.
Yeah. So regarding these guardrails, so there are so many aspects of protecting your AI application.
So again this distinction between the system and the model is here. AI application is the system that ideas we are developing so many AI applications and I really hope during your project the section two people you will be working with our engineers and our developers on all these real life applications if you remember the very first day I said don't do a project from Kaggle okay that's useless you do something on real life problems that we are working on we are working on a multiple numbers of these AI applications Okay. And I am hoping that many of you will get the chance uh when you work with our engineers as your mentors to actually work on these applications or if it is some other mentor also they will give you some AI application but at least don't use some this Kaggle kind of a thing. Anyway back to this presentation so there are so many guards like I have already said that there are input cases and there are output cases. Okay. Uh so detecting and blocking jailbreak attempts is the very first point that is raised here right and then uh various prompt injection techniques we study about them removing PII this is becoming more and more important now government is coming out with rules various governments all over the world are taking this as a much more problematic area than hallucination hallucination you can't really do anything you check for uh various things but PII personally identifiable information is not being tolerated at all. I mean there are laws and rules coming out from every government saying that you have to remove if a user by mistake or maybe even intentionally puts in uh some information.
So I am asking a question to chat GPT and I'm saying my name is Dipendu. I stay in Binyapokur, Kolkata. Uh I work with ideas. I have this question regarding uh some machine learning problem. So all this personal information that I have shared in my query needs to be removed so that it doesn't end up with the model. You may know that uh many of these companies are using this data for their training purposes. Okay. Sometimes uh they ask you your permission.
I was using a site which specifically asked me if I want to use the free version I have to share uh I have to agree to share all the data for their training for their model training. So I declined. So I did not use that application but some but most companies are not even asking you. So they're simply using that information. So be careful about not you know you putting in your uh personal information maybe you don't do it on purpose by mistake you do it okay you think that is makes up the context that makes the context even more understandable to the model but uh this is a serious issue now you cannot have any personally identifiable information whether think about healthcare okay the names of patient you cannot put a name of patient Obviously nobody will do usually but sometimes in a batch processing environment or some because of some lacuna somewhere all this information gets sent to the uh AI system. So that is has to be prevented.
There are a few questions in the chat window. Can take some of those Okay. So this this question about uh they are trained for specific tasks.
Foundation models are trained on multiple tasks. Small language models are there which are trained on specific tasks.
The reason why these are called foundation modules is they are supposed to be very generic and their strength is this cross domain linkages that they have you know I mean they are trained on healthcare data on education data on history data and all these things and somehow they can they are able to relate that we don't really know how it works so well I didn't want to start this presentation with that uh rider I mean that we don't really understand why LLMs work so well, but that is the truth.
We understand mechanically how they work, but we don't understand how they work so well.
Yeah. So that's um Mona is just uh confirming what I had just said. Jailbreak prompting is where you indirectly what is the difference between normal prompting and jailbreak prompting. So that was the example that I have gave.
No, the normal prompting is you you don't have an intention of fooling the model. You just ask whatever your your query is. But if you find or if you're mischievous or if you're a hacker or whatever it is that you're not getting the answer, so you go for a roundabout way. You create some story. Okay. And then uh try to convince the model or the system I should say that this is a genuine question. And it is not harmful because you are writing a novel or some other uh story that you create and try to fool the model.
Yeah. I don't get this. Uh so you have uh asked about masks about how you know uh trying to protect the interest of the artists I would say.
Yeah. Yeah. So this is an anti-AII feature. So that you know if they uh they share some information the AI is not I mean the company behind that AI is not able to um get any benefit out of that and use that in their training data. So I don't really have the details about these masks and how they work. But in general it would appear that you'll have to resort to some some trick uh which you are mentioning as masks available. So if there are masks available maybe the artists are using these readym made masks available to protect their intellectual property. I think that is what uh this post is saying but no I don't have any in-depth data on about these masks and how uh they are helping protect the artists.
Yeah, I think uh yeah, so there are so many questions like temperature and all that stuff. Uh so some of you are already knowing you know about this. Uh yeah, so I think if you are already knowing that uh we'll not spend much time in uh sort of uh confirming your knowledge but for for the others let us move on for the others who are not so well initiated in this.
Yeah. So this question was there that uh whether models are trained for specific tasks or they are general foundation models.
This term was coined by Stanford researchers.
They are u defined to be models trained on a broad data.
Okay. so that it can be adapted to a wide range of downstream tasks. But then there are small language models that are specific to the tasks. For example, at ideas we are experimenting with text to SQL for our copilot for the MLOps and data engineering platform that we are developing. So we are developing a platform where you know people can uh ingest data and then ask questions about that data in natural language. So we are implementing a co-pilot for that platform and the co-pilot requires translating the users query in natural language into SQL and then that SQL is of course again uh cleaned up and security checked so that even if the model is generating delete uh delete command it will not be sent directly to the Postgress database otherwise hell will break out. So yes, there are again checks and balances, but those models that we're trying out are specific for that problem text to SQL, right? So it takes the text as an input, the query of the user as an input and the query may be for example this platform is currently being used by ministry of agriculture and we are developing it for them. So agricultural data scientists would be asking question like uh what was the yield of tural in Bihar in 2023.
So that's the the data is already there in the postbase database. We have downloaded the data from government websites about the eield and all that information about climate data. But the scientists and the users of this platform would like to ask questions using natural language. So the model in that case is a text to SQL model which will generate that SQL as the output which we then send to uh Postgress and the results come back and then we again put it through a a general purpose LLM so that the answer looks nice English and all that. Okay. So yes, there are models that are trained on specific tasks and foundation models on the other hand are trained on a broad variety of data basically and there is always that debate going on and it it appears more and more evidence is coming out that you know you don't need these humongous foundation models for everything that you do.
small models are cheaper uh open source they consume less tokens less uh GPU so it's more climate friendly and all that so and they can uh sort of meet your requirements in more and more ways rather than just having to depend on chat GPT or these humongous models okay so that's why these small language models are becoming more and more popular and this is again uh the examples are not very up to date. I mean there are so many other uh small language models out there. So you can find out you know based on this I I had some language models in the very initial side when we when we discussed about u models the difference between models and system so many models and their sizes disk size memory size memory requirement was there put up on that slide so you can go back to that slide and check out some other models also use if you have enough memory download large models if you do not have enough memory use smaller models like 54 meaning but experiment with these small language models please do we insist on that and we are evangelizing you know whenever we meet customers I was there in Ludjana the other day and I was mentioning that many of these use cases that you have are perfectly doable with small language models or specially trained models you don't need to go for chat GPT or others so uh interns do use amama and this exper or any any tool for that matter so that you can experiment with these small language models.
Right now we have some basic things out of the way. Uh let us spend some time on understanding at the very surface level I must say about how LLMs work. Okay. So the next section will be about that. uh of necessity. This is going to be a very you know surface scratching kind of explanation because the subject is pretty involved and I'll try to do justice to that and my my effort is to make it you know conceptually understandable. You don't have to know the intricate details of how exactly each step works but by the end of this presentation you should have a top level conceptual understanding about LX. That is the humble objective that I have for this session.
So, language models are not new. Okay, they're 30 years old and uh whole idea was that uh you have to have some software which can predict the next word. So, that is the foundation on which all these things are working sophistication and everything which all boils down to language models. The word L has come at the front large language models. Now we have small language models but the end of the day they are language models and their only job I mean the primary job is to predict the next word predict the next token.
So this question is up there. The students opened there and then there is the blank and the model is supposed to come back with a suitable response. Now what is the suitable one? This will all you know vary based on what you think in your context is more suitable.
Students open their books, students open their notes, all of them are plausible.
Okay. So what is less plausible I mean less probable I should say still possible is that open their door students open their doors or windows that doesn't somehow seem correct but students opening their laptops students opening their books in India we don't use the term opening their exams but uh you understand what that means they start their exams opening the minds is again a little open-ended question.
So which one should be the response out of the many possible ones. So this is the whole crux of the matter.
When you put a query to an LLM, it what it ends up is it predicts the next word and then this again this entire sentence is fed back again. This goes on in a loop basically.
So we'll take a look at the loop but the point is whether it's large language models whether it's small language models at the end of the day they are language models and their job is to predict the next token. So this is a simple explanation of elements they predict the next token.
Yeah we'll not deal much with slide here but there's a long history long long history as you could see.
There is nothing new. Believe me, we have been trying these things for the last 30, 40 years or more.
So the use cases have become really important. Now the concept is not new. The use cases are new because you know they are working so well because of whatever reason people are coming up with new use cases.
The concept of a model language model is not new but the use cases are new and these newer models the newer technologies are now supporting these use cases that is why everybody is interested otherwise this as a research field existed as I said you know 40 years also but nobody was paying attention suddenly everybody is paying attention because they are becoming useful that is the whole point so when you do that one month project you also try to do something useful not just you know theoretically interesting I guess so engram language models was one of the earliest ones where you break up the sentence into unigs and bgrams and triagrams we we will skip all these things uh we'll just mention that the this assumption that the probability of a word depends only on the previous word is called a mark of assumption so you please read up on this uh that what is an engram language models. I also have put some uh links up there. Many of the slides will have links this Google engram site you go you can find historically you know which words and which tokens are becoming more important now for example the word Gen Z that token Gen Z would I guess it was not available maybe five to six years back I don't know so the count of that token would be zero but with every year or every month every day now that word is becoming more and more commonplace in articles, in newspapers, in books. So it will show a huge rise in the usage of these. So do visit this uh Google site on engrams and find historically you type in all these tokens and see how they were used historically and how I mean I mean how their usage changed. Okay, so that's an interesting one.
So the the basic uh way of working of an engram language model is that it calculates the probability of which word may follow another word whether it's using engrams uh or whether it's using bgrams two tokens at a time. So a little more detail in the next slide.
So suppose you have a training corpus remember all models need to be trained whether it's a simple engram model or this sophisticated uh chat GPT there is a training phase and the word corpus is used if you are exposed to natural language processing this is a very common term corpus collection of texts basically okay so suppose you have only these six sentences in the training corpus I mean they all seem almost the same but there are differences.
So now the user is uttering this sentence models adapt through and the model is supposed to come back with that fourth token or the fourth word basically. So how would it calculate?
There is no global knowledge. It's just some software right? There is no knowledge about no more context here about which word is the appropriate one the fourth appropriate word. So this this is the simplest way of doing that.
So you calculate the probability of these word combinations.
So we find that the word repeated as the fourth word after the three words model adapt through.
If you look at the corpus, you'll find that in three of the cases, the word repeated comes after models set up through the first sentence, uh the second one, the third one, I'm sorry, the fourth one, first, second, and fourth one models adapt through. And the fourth word is repeated in in all these three examples. Sentence number one in the corpus, sentence number two, and sentence number four. And so it calculates the probability as three out of four because models adapt through is available in four sentences out of the six sentence. And out of these four sentences in three of them the fourth word is repeated. So the probability the simple calculation 3x4 it's 0.75 and the word gradient following models adapt through. So the probability of the word gradient as the fourth word is 1x4.
So obviously it predicts the word with the higher probability. So it comes back with the word repeated.
That's how a simple n grand probability model works basically. So it's just a if you if you give it a huge uh you know corpus and you there are open source tools available you can write a Python program to do these simple calculations and it would be fun actually if you can just put in you know 100 sentence 1,000 sentence and then you check what is this model coming up it would be a fun exercise. It would be a very not very useful one for your day-to-day life but it it will give you a feeling some kind of a gut feeling of how these language models work. I mean the sophistication is not here I understand but the principles are the same. Somehow you have to calculate the probability of the next probable word. Okay. And the one with the higher probability is usually the one that is presented as the answer. Somebody did mention temperature and all that stuff. So that is uh you know a little more at the higher level where you say that the highest probability need not be the one people use the temperature parameter uh for creativity and all that but we'll not go into the details of that but uh this is the basic understanding of how language models work. You calculate the probability. You have this corpus. You check all possible completions and the one that occurs most times in the training corpus is what you suggest. Very simple idea basically.
Okay.
Yeah. So now let us get into a little more technicalities. I mean we move on from this basic uh engram model to a little more technical stuff but again only at the very u surface level. The third point here so in my understanding you need to understand tokenization embedding and how transformer neural networks works.
The third point is beyond the scope of this session. you'll be getting exposed to neural networks and even maybe transformers I don't know u later on but suffice it to say that we can treat that as this third component this neural network as a black box for the time being okay where uh while we go a little deep into the first two tokenization and embedding To understand tokenization, you must actually experiment with them. There are so many online tokenizers available.
Please go to those sites. I have given uh examples of a couple of uh sites. But there are so many. If you just search on the net for online tokenizers, you'll there are number of sites. So that will really give you an understanding of how tokenization works. So suffice it to say that tokenization is a process of breaking a text up into its components.
The components may be you know characters, words, phrases and many other things. Each of these component is called a token. So that that term token you'll find everywhere whenever you are talking about LLMs. Tokens need not be words. That is a fundamental understanding I want you to have. tokens is not equal to words. In most cases, there are subwords. So if you go to that open AI uh site tokenizer and they are using something called tick token if I'm not mistaken and you give a sentence there you'll find that it is not breaking up the sentence into just the words for some words it is breaking up into subwords.
Okay. And finally all these tokenizers have a database. This is a database where the tokens are mapped to some numbers. This is a pretty arbitrary.
Each company each software has its own tokenizer.
If you remember uh the structure of that GGUF file, that file contains this database.
Okay. as to which token is mapped to which numbers, right? So that's how they deal with the tokens. So tokenization is the process by which the query that you are putting in as also this huge amounts of text that are used during training.
Okay, they are broken up into small components and these components are assigned numbers. These numbers don't mean anything. They have to be unique numbers for that particular software. Remember there is no global tokenization database which all softwares all companies follow. No, everybody has their own tokenizers. If you go to the site hugging face site for each model you know you go to tokenizer JSON tokenizer config.json if I'm not mistaken you'll find all these details.
Okay. So this is there is no global committee okay which will say okay um the word mister is mapped to token ID number 18953. No there is no such global committee uh world token organization or anything like that. Okay. So every software every model comes up comes with its own tokenizer. Okay. And this map mapping of tokens to numbers is in the model itself like in the GGF model I was showing there is a section where this database is saved. Okay, it's it's it's not a very large database. Remember you might think that uh there are so many words. Well, not every word is represented but even subwords are there as you can see on the right hand side.
So strange things are there you know dog is there work is there in caps declare is there again there are some words which doesn't seem meaningful so nobody creates these tokenizers by hand again this some software programs are used they are run on this huge body of text to generate these tokenizers but for our understanding it is enough to know that text is broken up into components and These components are mapped to some numbers so that internally as you know computers work on numbers right so somehow these the query that you are putting to chat GPT orama whatever and even the training data set we'll talk about training a little later they have to be somehow converted to these numbers and tokenization is the basic process where the sentence the stream of text be a sentence It could be a paragraph is broken up into meaningful elements called tokens. So that is this example of subwords as I said. So somebody I mean some tokenizers are breaking up uh the text explain large language models full stop into five tokens. Some other um I mean so that was a human readable one. The tokenizer is now broken up the word explain into two tokens.
X and plane. Okay. Now why this happens?
This is again a very involved subject.
There are benefits of doing this. That is why people are doing not arbitrarily looking at some words are saying this is very large word let us break it up. No it doesn't work that way. If you are interested in the technical reasons why and how these words are broken up into subwords please go read up on token items. It's a fascinating subject believe me. Okay, many people don't even give any importance to tokenization but that is a foundational thing. If tokenizer is wrong then your entire thing is there. Also remember the tokenizer that is being used for breaking up the training corpora. We just saw a simple training corpora of uh six sentences in a real life for chat GP will be six 100 billion sentences. Okay. But again each of these sentences have to go through a tokenizer and that same tokenizer has to be used when you are putting in a query. So when you say explain large language models to chat GPT okay the same tokenizer that was otherwise there will be a mismatch you understand because if the training model while training if the tokenizer used is different from the tokenizer used to break up your query then this database that I'm showing on the right this would not match and the model would perform erratically >> you'll never know why the model is not performing. So this is a basic understanding that the tokenizer used to train the model has to be used to process your query also. Of course, if you're dealing with uh you know some software like chat and others they are they're automatically doing it obviously otherwise and each models each model as as we said in you know has this tokenizer as a part of the file the GGF file GGUF file that I had mentioned there was a section there on the tokenizer which just you know mentions maintains this mapping from words or subwords to numbers.
Okay, now that we have some basic understanding of tokenizers that is mapping sentences, breaking up sentences to subwords and tokens and then ultimately mapping them to some numbers.
Let us move on to the next foundational concept of LLMs that is embeddings.
So this uh started with I guess I mean the the foundation of embedding is in the distributional hypothesis.
So in 1957 as I said these things are pretty old. Okay. So in 1957 this person JR F mentioned this. This has become like a you know the Bible foundation that everybody mentions this sentence.
You shall know a word by the word company it keeps. You may have already heard about this phrase.
So the basic idea is how do you go to the meaning of a word and this this distributional hypothesis is a mechanistic way of understanding or trying to get a grip on the meaning of a word. Otherwise meaning for us human beings we have a different concept of meaning.
But how would computers understand the meaning of this word? So the example given here is about two fictitious words.
No such word exists. Say G L RF dwarf and the other word is wumble. W ML E. So these are all fictitious words. Of course the machine wouldn't know. Okay. So if the corpora if you forgot about corpora that's the training data set that's the text all that raw text usually from the internet or from books millions of books millions of websites that are scraped and collected. So if that corpora has these sentences for example suppose it has the sentence in that corpora she ate the glo for dinner the glor was delicious and fresh he cooked a gor gor slowly I am hungry where's the glo see a computer has no idea that glor is a completely fictitious word right so what it concludes is that this word glor Dorf is surrounded by things like eating, cooking, delicious, hungry. So it's some kind of a food kind of a concept.
That is how this this whole uh thing works. You might have often wondered how these lms are understanding the meaning.
There is no understanding.
It's not a human being, right? We are we get trained on a physical world model.
We actually see what rice is, what dal is and other things. But a computer just trained on words, trained on a huge corpora. But after all, it's just words and sentences, right? So if this fictitious word is used in that corporas, so the machine will understand that this has something to do with food. I mean so that is why uh Mr. Fth in 1957 said that you shall know a word by the company it keeps that the words that are used often with that word. So the word ate is used with glor delicious is used with glor.
Similarly for the word bumble this is again a fictitious word. There is no such word in the dictionary. Okay and llm never sees a dictionary. There is no dictionary. You might think that uh the Oxford dictionary is some of it where the meaning of all these words are defined and from there it understands.
It doesn't work that way.
This humongous corpus that is available and if it has sentences like this she ran before her wumble daily wumble keeps you fit. Obviously um the words that are surrounding this word this fictitious word on wumble are things like run, move, fit, tired. So it would associate the word wumble with something to do with exercise. So so if somehow in your query you mention the word exercise, it may find the word wumble from the token from its tokens. Okay?
because in the during the training that word wumble often came up with these concepts of running fit being fit and you know moving fast and tiredness and all that stuff. So the distributional hypothesis is the foundation of how the meaning of words are understood. I mean the word understood itself is a little bit odd for a machine. We humans understand things. Machines are mechanistically no mechanically somehow they need to process this information and this is the foundation that if the training corpora has words that are surrounded by some other words so whenever the LLM the language model sees any word it tries to find in the corpora it was trained I mean it's not like a database lookup okay it was trained to know that the other words that are near this this word. So the word eight is near the word glo. That is what it learns from this huge training corpora.
Not six sentences, six billion sentences. Okay. So that's the point of this distributional hypothesis.
Now an embedding is like a mapping from a word to some space, some dimensional space. You forget about all this mathematics part basically. But you can see you know that the uh that the model has learned this that you know cats and dogs and lions and wolf are somehow near each other.
Okay. Okay. Words like royalty, queen, king, prince because in the training corpora these words were near each other. I mean the word uh jump is less likely to be in the training corpora in the same sentence with the word king. I mean there might be a case where a king jumped from the window but that sentence is a rare one. Okay.
But whenever you see a mention of a king it is highly likely that the same sentence had a mention of the word queen also. So the model learned that these are related. Okay not similar though it's a related words. So this uh this numbering mechanism that it comes up with okay it permits all these strange equations like if you have a embedding vector some number say suppose for king man women and then you actually perform arithmetic vector arithmetic okay so king minus man plus woman you will get a vector which is near to the value of the vector for queen. Okay. So that is how this words are ultimately converted to numbers basically. So that's the embedding scheme. Okay. In real life I mean you'll find that each model we mentioned that uh there is that section in a gguf file also which mentions how many dimensions are used in the embedding. Okay. So different models have different number of dimensions. I'm hoping that you understand what a vector is. So vector has a you know number of uh basically it's a collection of numbers. Okay. So there you how many numbers a vector has again varies from model to model but basically the concept is that these numbers are not arbitrarily assigned. It is again generated through learning.
going through this you know humongous corpus of 6 billion sentences say for example the model learns that the word king and queen are somehow related. So when it generates those vectors the value of the vector for queen is nearby to the value of the vector for king say okay a little away from prince but quite different from the value of the vector for say uh jump. So that's how so the exact process of how these vectors are generated is again a little bit more technical okay if you are interested you do a deep search on this how word embeddings what algorithms are used to generate embeddings from training corpora. So if you do that exercise you'll get a little more understanding of how these exact numbers whether the numbers are 4096 dimension vectors or even something smaller. Okay.
And perhaps those of you are interested you can write a Python program to generate these vectors from a say small corpus. Okay. So you take a corpus of say 100,000 sentences and you run this algorithm and it will the quality of the vectors that you generate will not be as good as of these large models that they have. Okay. But it you can do things like king this subtraction and you can see if your corpus has these words like king man woman and all these things uh you'll also find how close this one is basically. So the point is that the this information about the context is compressed into a into a point in the space. I mean this this space that we are talking about this space may be 4096 dimensional space but each word is you know this map to this space as a point there and so if they they are all vector you can compare you can do vector arithmetic and that's how you calculate whether one word is near another word or I wouldn't use the word you know similar I would use the word related Okay, many people confuse related with similarity. Okay, but uh these things are not I mean king is not similar to queen. King is related to queen. Obviously cat and dogs are not equal but they are related. They are all animals. So this relatedness is something which is automatically generated from the training corpor. It is not assigned by some committee or some software developer by hand. it is generated and each model has its own uh way of generating these vectors and there are so many embedding models available. Just search the internet for embedding models. Each will have a different number of dimensions.
4096 may be on the higher side. Sentence transformer if I remember correctly has 256 dimension. That is a very popular uh embedding model.
And this is the math. This is the math behind uh deciding whether two words are I think the word similar is not the correct one related I I I should have used. So so related cats and dogs they are related. Yes. So the angle is semantic angle is less whereas uh cat with Paris or city I mean the angle is much larger.
So the the dot product for those of you who are interested in math uh I mean the dot product will always give you a value between minus1 to + one.
So here the I mean + one is set as identical or very much I I would say closely related. If the value is near zero it is unrelated and the word and if the result of the dot product is minus one it's well it's totally unrelated. I would use the word opposite maybe totally unrelated. So this is the the mathematics that is going on behind you these huge corpuses tokenized and then you know converted to word embeddings or whatever embeddings is there and then during the process when the LLM needs to find you know nearby words nearby tokens it uses this kind of a similarity. I mean people have heard about cosine similarity but in most cases in actual cases uh dot products are used. So that's how the so this brings us to more or less the end of this shallow level I would say not very deep level but at least shallow level understanding of how everything maps back to numbers. You must have wondered many of you that how all this text stuff is mapped. So I'll just take a break for 2 minutes to uh just show you some some example of this number and this vector thing. Just a minute here.
Yeah. So this is a PDF file. Is it visible?
Okay. So the PDF file is visible. Okay.
So this is for example maybe you consider this as a something that was used for training a model say.
>> So this is some kind of a text file. So what uh is usually done is to extract the text from this file and then chunk it. Okay, break it up into small portions of chunking and then create embeddings for that. So I'll just show a couple of screens from our platform that we are developing this MLOps as well as you know platform that we are developing where I have fed this.
So I fed this PDF document and then it has been chunked. So that large it's not very large I guess a ninepage document. So it was the text has been extracted and then it has been broken up. So the first chunk for example is uh has this text you know introduction global scenario etc etc and it ends with the word by.
The next chunk is you know provisional data uh released by world steel association and all that stuff. So that's the second chunk. So eventually each chunk is now converted using some model some embedding model to a to the embedding space. So I just wanted to show this screen to give you an real world example of how this embedding works.
So we are using this 384 dimension uh embedding and so this is the full embedding vector for that chunk of text.
this so this is just a demo this is a real life demo of how you can convert see a PDF file you extract the text from it and of course you don't create a dimension uh a vector for the entire document that would be too crude okay so you break it up into small chunks and then you vectorize it using some vector embedding model okay some some some model so this was just a two-minut demonstration from our platform form about how this embedding things. So seek is believe you can actually see you know how this so this is an this is being done for rag I would say so whenever somebody there is a query so that query is also converted to a embedding vector of 384 dimensions and then some sort of a similarity check is done using cosine similarity or dot similarity to find the chunks that are near the query chunk. So somebody may be asking about the status of the steel industry or what changes have happened in government policies in steel. Okay, as they usually do. Okay, so this is what happens. The document is pre-processed, converted to text, broken up into chunks and then embedded. So this embedding is kept in a database. For us we are using uh the vector extension for posgress and there are so many vector databases okay can use anything and so you store this vector. So when a query comes you I should say vectors for each chunk and then when a user puts in a query you vectorize that query also using the same model the same dimension basically and then you try to find out that angle which are the you know chunks that are similar to that. So multiple chunks would be coming as a result. So you take the top five the nearest top five chunks and then you send it to an LLM uh to I mean of course you convert it back to the text from the embedding vector you have the text right. So you send suppose one and three are the most relevant ones based on this angle this cosine or dot product or whatever it is. So you send all these chunks back to the LLM and the LM comes back with some kind of a nice answer. Right? So that's how this vector embedding works. So that was just a two-minute demo um from the how this okay so instead of words here we are doing a similarity check or relatedness check for chunks instead of words but the if the basic concept is the same basically.
So we now know about you know tokenization and then chunking and then some idea about this vector embeddings.
So now we are technically qualified to understand how these this whole world of text is converted to numbers and how they are compared how various words and sentences and other things are compared to each other. Okay. So that's the basic machinery. Okay. Okay. So that's the basic machinery that is used u in AI applications in uh large language models and whatnot.
So as I said we are not going to go into the details of uh neural networks. So the same example I had earlier on with uh engram models. Okay. So if you think of the neural network as a black box, it's the same thing. Given the word models adapt through, we are expecting that the fourth word to be predicted is the word repeated. So internally how it does it's a different matter but basically it's the same thing. It's a language model which when given some words predicts the next word. That is all there is to it. Okay. I have also put up some you know links here. Please do visit them to understand how all this uh this if you want to go inside this black box. We have some slides but uh this is such an involved subject. It's going to take a lot of note time for this. Okay. So how text response is generated? The whole purpose of this and the next slide is to uh show you this what is called the auto reggressive token generation.
And some of you might be uh surprised to know that with all this sophistication with humongous large servers and everything okay only one token is generated at a time. Okay. So that's the basic truth.
So suppose uh you are putting in the query explain large language models of course and then uh the ultimately the the LLM generated the sentence it started with this at least large language models are advanced dot dot dot and so many so many other things are there right so how is this happening so when you say explain large language models the model does not come up with a full sentence as the output. It outputs one token at a time. So that's the whole point of uh this on the next slide. This slide is showing it as an abstraction that your input is tokenized. As you can see, it has broken up the word explain into two tokens. Then the tokens are converted to token ids. Then the token ids are converted to word embeddings. Then they are fed to the LLM and there is some computation going on. This take this LM as the black box here. Okay. This neural network transformer and all these things.
So the black box ultimately outputs again numbers token ids. LMS don't understand anything else.
Then the software layer is there. The system layer is there. It sort of converts these token ids into words and then the words are converted as the response and send back to the user.
Okay.
So that's how this whole thing works. We have mentioned two things like the encode method of the tokenizer. All tokenizers will have an encode method which will convert text to token ids and all tokenizers will also have a decode method which is the opposite basically.
Okay, which will take the sequence of ids generated by the LLM and convert it back to human readable words. Of course, again from the database, it's not inventing anything. There is no other hyph token ids are mapped to words. So in the encoding process you go from words to token ids and in the output the decode method you go from ids to words. So that's how overall this thing works. Now the point that I was mentioning is one token at a time.
So this this slide explains this auto reggressive text generation thing. The sentence is not generated as a whole. So the example here is uh those those token ids think for a moment that you don't see the word there basically but this token ids are input okay and the model in the first iteration it goes through iterations in the first iteration it outputs only one token ID which corresponds to the word large then this token ID is concat concatenated with the input query. If you notice the input query had 1 2 3 4 5 6 six token ids. So now it has come as a seventh. The output of the LLM the first output token is appended added at the end to this six so that the next input is a seven token input that is again fed to this blackbox LLM.
Consider it as a black box of the neural network transformer or whatever it is.
And then finally, not finally, I mean at the end of iteration two, it outputs the token 4128 which corresponds to the word language.
Again, that token is added to the seven token sequence as the eighth token. Then these eight tokens are again in iteration three fed to this blackbox LLM and the LLM comes back with the next token which is 4119 which corresponds to the English word models.
So this goes on on and on basically and there are techniques to stop this. In earlier models sometime used to hallucinate and go on you have to do a control C and stop it by hand. But now of course it has become much more sophisticated and there are settings of how many it would otherwise you know the cost would go up.
It could go on telling a story but you want to stop it somewhere. If you ask open-ended question like what are what are large language models it can go on and on five page 10 page output can be there and of course if you're using the free version your token count will finish in one in the very first query itself. So for practical purposes there are limits set by all the softwares. It is not in the model mind you. It's in the software in that system where there are techniques to stop the generation after a certain amount of tokens are generated. Okay, you can set it. There are many software and the libraries and other things you can set the maximum output token size. Okay, of these Python libraries that you may be using, they'll have a parameter max token. So it will stop generating after that. So this is the basic fun of how an LLM works.
Numbers from text we go to numbers and then one token at a time is generated basically. So that's I don't know whether that is surprising to you or not. When I first learned about it, it was pretty surprising.
It almost sounds very dumb, doesn't it?
we humans I don't know whether humans are also working like this we are so fluent and uh all this stuff that we generate whether it's text whether it's uh how we speak are we generating one token at a time in our brain in the neural networks of our brain I doubt it I think we have a different mechanism of work so but a lot of people sort of compare when it comes to AI is becoming more intell Igent AI is becoming all powerful but at the bottom of it it's so simple and almost crude one token at a time.
So that is something that I want you to take back from this presentation. All this sophistication is around these LLMs. But at the very bottom of it, it's a very simple mechanism of generating the one token at a time. But it's the it's in the quality of the tokens that it is generating. I think that is what is making the difference rather than where it is one token at a time or multiple tokens and etc etc. It's the quality. I mean if you are asking explain large language model, it is not coming up with something totally unrelated.
So all this mechanism that you have in that black box that is ensuring that whatever token it may come one token at a time but it's very relevant to the query that you have put that is why these things are becoming so useful. You are asking it to generate a letter write an email and of course and it is coming up with what you are expecting it to do. That brings us to the final section is how these models are trained.
I I will skip all these things because there is not enough time. You can go through this uh presentation later on.
Yeah, we have only 15 minutes time left.
So, let us spend some time on trying to understand how these huge things are trained. So, there are three main steps.
this I kept on saying this humongous trading corpuses not that six sentence simple example that I had put up 6 billion sentences so as somebody mentioned in the chat also that it's trained on the internet and okay so all these things are used excuse me Yeah, sorry about that. So this pre-training phase is where this humongous training data sets are fed toward this training application. It's a software remember. No, there is no model here. We we are creating that model. So these programs are running on multiple servers using hundreds and thousands of GPUs.
And the output is a the basic model. The output is the the model weights as we see if you remember the structure of the GGUF file that I had shown earlier that was the last segment in that file. So ultimately so these foundation model weights is where we are generating. Of course that doesn't go into the model file itself. That's the first phase. So we are generating the foundation model weights from the pre-training.
Next is what is called alignment training and this was the differentiator actually this pre-training and all these things have been going on for some time but the performance was never you know useful. I mean the next token that these models generated without alignment was not useful. It may be fun to watch the output of a pre-training. You give it a question and see what are the tokens it generating as the next token and next token next token like in that autodigressive loop. But that was not useful. It may be funny even totally irrelevant things it was generating. So when chat GPT first I think it was that company OpenAI when they found that doing something like alignment training we will get into that in a minute the supervised fine-tuning and then you know reinforced learning with human feedback and then the output became shockingly useful. So it was like a shock when this company first realized this.
Okay, it seems like a gamecher. So pre-training is where you know you feed it with large amounts of data by scraping the internet or you know giving these large data sets of PDF books and all that in multiple languages now basically but the output was not that useful when the steps of alignment was done then it appeared that you cannot live without LLM like for for many people it has become like a the go-to tool right from the morning to Right. Okay. So, how did this transition happen from something interesting but not useful to something damn useful so that we cannot live without it almost that was because of that alignment ray. So that was the differentiator.
Maybe it happened in 21 2021 I forget the date but uh openai found that this was the real differentiator when the output was becoming really useful to the for the queries that were being given to the model.
So this is the supervised tuning and finally is the uh after this reinforcement learning there will be a session on reinforcement learning I think for section two where we you can see the details of that. So finally you have this uh phase of so-called fine-tuning and in context learning.
Okay. So in prompt engineering we do things like in context learning. We give some examples in the prompt itself. So that's in context learning. So the the model uses that converts that to you know embeddings and all that and then uses that as the context. And finetuning is when you take the output of the model and fine-tune it for a particular application. you throw away some uh some layers in the or you you freeze some layers in that neural network and modify some parts of some layers in the network. So so there are techniques for doing the finetuning. So that was for the downstream applications but by itself this foundation models after the alignment training they are useful for a wide variety of tasks. So let us just spend little bit of time left on this two phases.
So the pre-training is where you know humongous uh corpuses used to train basically.
Yeah.
So as we said you know these tokens the text is tokenized and then uh it is embedded. You already saw an example of how things are embedded. So here each token is then embedded and finally you you we go through some mechanism of training that neural network or transformer. Okay. So the main thing to note here is the size of this uh text way and the number of tokens uh that are used and the disk size.
So just for comparison Wikipedia is 0001 terabytes whereas this massive web corpus is 1.9 tab. So it consists of 604 millions of documents, 506 billion tokens. Okay. And then this final column is also interesting. The sampling proportion lot of uh the texts are garbage and not very useful. So they are removed from the the training input. So ultimately you get some good quality training hopefully training data that is. So that's the pre-training part. You take a huge amount of text. We perform some cleaning and this slide is showing that you end up with only 1 to 3% of the original token. The rest is all garbage and you have to throw it away. That is why you need more and more servers, you need more GPUs, more water, everything just for training these models. And many of these companies are now trying to set up these data centers in India basically so that you know the environmental impact is in India rather than some other countries. So that's the the background of how what pre-training is.
This is some this animation. Yeah. So the training model objectives is excuse me.
So this was some animation which you can look at it as your at your leisure basically.
Yeah. So I just thought I'd mention these as uh you may sort of hear about this okay that the training object like causal language model which predicts the next word but then there are other models apart from this these language models called mask language models where you predict you you learn the software is written so that the model tries to predict the master words Not just the next word but the masked word also but mostly for the GPS and other things like this large language model is the causal language models predict the next word rather than p predicting the missing words. So the one on the left is the most common one today as far as LLMs are concerned.
Now that we know have some basic idea of what pre-training is then what is supervised fine tuning.
So this is where you align you create question answer pairs. It's a very labor intensive job. Okay. So that given a question how do I change a tire?
All the four answers that you see are plausible answers. If you stop at pre-training, remember none of these answers are unrelated.
So pre-training has achieved at least this. It is able to relate something in the training data set to the question.
But is it useful? That is the point. The supervised finetuning is making the answer useful.
So all these four answers are plausible because somewhere or the other these tokens were there in the input data set and the model because of pre-training thought any of these answers could come from the pre-training. Okay, basically only the fourth answer is useful.
So supervised fine-tuning is a mechanism where you train the model on a question and the appropriate answer.
So there is no no >> magic in this. Somebody has to painstakingly create this question answer pairs. If the prompt does not have a you know clue that can guide the interpretation. Any of these answers is possible. I mean there is no reason for the foundation model to prefer uh response number four over number one. Okay. because all this thing was there in the training data but only number four is useful for the user. So for that you need more than pre-training. So that is the supervised fine-tuning.
So some uh simple examples of how uh annotation is done. You may have already heard the word annotation in your machine learning training. You have to annotate okay uh the training set. So if you're designing a classification training data set, you have to annotate or download already annotated data sets where the labels are annotated. So similarly here you have the prompt and the response. So this is human generated. This is not uh I mean some other techniques may be used maybe even some other LLMs are used to create these things but for conceptual reasons think of it as a human generated data set where there is a question and there is an answer. So the whole point is okay so that the model learns what is useful not just what is related to the uh question.
So that is how for example this example is showing two uh two aspects one is the the helpfulness this the bottom one is rel to safety.
So the annotator has to create these examples and then feed it. There are techniques, there are software libraries and so many other things which requires a huge amount of maybe I mean the infrastructure required in this case is not as large as pre-training because these data sets are much much smaller like the pre-training data sets are really huge >> but nevertheless without this alignment training the model is not going to output useful it may output something related but the output is not going to you're not going to like that it would seem like it's a garbage it's not useful. So you nobody will use LMS anymore without this supervised front team and then you have this reinforcement learning with human feedback. This again is a you know more um involved thing but here you have a human being.
Okay. So if the model is generating uh answers like you should end and the question was chatbot what should I do with my life not that I'm 65 and retired from a long career in the circus the chatbot really should not answer that you should end your life immediately would you like some suggestions for how to do that so that the company would go bankrupt with the legal suit that would follow so there is a human being sitting there which is looking at the responses generated by a LLM and then marking off manually that this is good this is bad and there are there are techniques preference model this that reward model I mean this is too complicated to get into this okay and I'm not an expert in this anyway so the bottom line is there is a human being or there is some mechanism which is now ranking the responses okay so the previous one was giving examples of what is useful And the second one is after SFT supervised finetuning you ask a question to the model and you rank the various answers that it it is generating and based on that answers the model learns that what is actually should be the output basically. So this alignment training is that is what has made this LM world so useful and everybody is using it.
Pre-training creates the basic foundation model and on top of that you perform supervised fine-tuning and re reinforcement learning with human feedback. Okay, these steps are absolutely essential to make the output of the LLMs useful for you. So that is the learning from this how LLMs are trained basically. So I think we have come to the end of this session. It's 900 p.m.
Hopefully you have obtained some understanding of how the difference between the model and the system and you know how this uh LMS work, what is a tokenizer, how embeddings are actually mapping this text to numbers. Okay. And then finally how these things are trained.
that the steps of pre-training which creates the the basic model and then how you make make the model useful for this alignment training such as supervised finetuning where you give it question answer pairs and in this reinforcement learning with human feedback where you rank the answers so that the model learns about which answer is the is the best one to provide.
Yeah, all the PPDs and everything will be shared I think. So that is not an issue. Okay. I mean uh I couldn't go through all the slides but so there was a question that the question what is dimension here I hope you understand the dimension is the the the number of components in the vector a lot of people are asking for sharing the PPT which we will What determines the dimensions of the vector? It varies. I mean ideally uh it should be as large as possible but that is not the case. Okay. So many people have found that even smaller ones are useful and then there are techniques of converting these large dimension vectors to smaller one but still still uh making keeping them useful. So there are various techniques rag yeah we mentioned rag a little bit this how these documents are converted to chunks and then from I mean chunk is then embedded and then so that's rag yes retrieval of generation I mean embedding is a fundamental concept whether you're talking about lm or rag or whatever right so midisha can we stop the recording now I think I have answered many of the questions Thank you. Thank you for your feedback.
>> Sir, we can conclude the session. Right.
>> Right. Right. Right. Yes. Yes.
>> Okay. Okay. Okay. Okay.
>> So many of them have asked for the PPT.
I think we have a policy of sharing the PPTs.
>> Yeah. Yeah. Yeah. I will share with them. Yeah. Tomorrow with the uh tomorrow with the class uh class recording link. Uh along with that I will share the class material as well.
>> Thank you. Goodbye and good night.
>> Thank you.
Ähnliche Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 views•2026-05-30











