The lexical analyzer is the first phase of a compiler that reads source code character by character, groups characters into meaningful sequences called lexemes, and identifies their corresponding token types (such as identifiers, keywords, operators, constants, and special symbols) using two pointers (Lexeme Pointer and Forward Pointer) and Deterministic Finite Automata (DFA) for efficient pattern recognition.
Deep Dive
Voraussetzung
- Keine Daten verfügbar.
Nächste Schritte
- Keine Daten verfügbar.
Deep Dive
Compiler Design-Lecture 4Hinzugefügt:
Good afternoon students.
I am Dr. Sidasan Beha working as an assistant professor in CC in the school of computer science and artificial intelligence in SR University Warangal India.
In the last class we uh discussed uh how different phases of a compiler works. I mean how individual phages of compiler work and we realize that with the help of an example. We took single example and we observe what is the role of each and every phase of the compiler through that single exam. Right? So more or less we came to know what is compiler, what are the different phases of compiler and how each and every phase of compiler works in order to convert the higher level language into the machine code right I hope this thing is clear to you more or less it is clear to you okay now we'll navigate in depth to each and every phase of the combiner.
Okay. So that we can have a clear uh clear idea uh how different phases works together. Okay. Uh in order to complete or in order to com convert or translate the entire source code written in highle language to the machine. Okay.
I think this lecture is four or lecture number is four. Okay.
Here onwards we'll be discussing unit two which completely uh holds two pages of the compiler. One is lexical analyzer and uh second one is syntax. Okay.
So, so here you can see the heading is unit two and the first phase of the compiler that is called as lexical anal.
Now let us review once again as we already discussed more or less okay what is the role of lexical analyzer it is let us take one more review on that okay before we navigate in depth uh to the actual function okay I mean actually in the sense what in depth before we analyze the in-depth working principle of lexical analyzer okay so as I said before the lexical analyzer is the first page of the compiler. It reads the source program character by character.
Remember once again it reads the source program character by character and it groups the characters into a meaningful word called as lexim. And for that lexim it find out the corresponding token type of that lexim.
That is the major role of lexical analy. Okay. And uh simultaneously it removes any comments associated with the program. Sometimes we are writing comments no hashtag comments we are using hashtag to write comment. Okay particularly in C and C++ language okay that that comments are never going to be execute. Okay comments not execute. Okay so that is why these comments are not actually considered as the part of the program. So here while scanning the program while reading the program. Okay. From the starting to the uh last whatever the commands or wide space are encountered by the lexical analyzer it is just skip them or it just remove them. Okay. Because these are the these are not the tokens and they are not going to execute in the not a part of the code for our convenience for our understanding or what we are writing somebody else can understand it easily.
Okay. For that purpose we normally put some comments. Okay. So this is so and this is that this is so and so and this is that like that comments we need to add. Okay. For our community but those comments never executed. Okay. Similarly within the programming construct if any white space is encountered in between two words if white space is encountered that white space also eliminated. I mean the lexical analyzer keep the white space and it then go to the next next word. Once it is scan one word it go to the next.
So here is one example. You can see the example is written in C language sum equals to a + 10. So sum is nothing but a but an identifier. Here equal symbol is an operator. A is an identifier, plus is an operator and 10 is a is a constant then the entire the line is terminated by semicolon. Okay. So let the let us assume that lexical analyzer it reads the this particular line then what it does it convert okay sum s u m sum is a sing single word okay once it encounter this word sum okay then it find out the token type of that s what is sum sum is nothing but an identifier okay so the token type of the sum is identifier Okay. And is it the first identifier? If yes, then it will write identifier 1 ID comma 1. If it is second identifier, id comma 2. Okay.
Since starting whenever it scan the program starting onwards, okay, whenever it encounter an identifier, it assign the identifier with the name of the I mean it it identify the token type of the identifier. What is the token type? That means ident the name of the identifier along with the numbering of the identifier. Okay, if the identifier is the first one, then the entire token for that identifier will be the name of the identifier, the serial number of the identifier, identifier one, identifier two like that.
Okay. Similarly, assignment is nothing but an operator which is predefined here. No more extra token need to be generated. Okay, it is already predefined. A is another identifier. So what type of it is second identifier.
Okay, the second identifier even though it is identifier the serial number is second. So here the token type the enter token will be like this identifier comma two that means ident name of the name a indicates a a is an identifier and the serial number is two that is called as ID2 identifier second identifier okay and then plus is an operator which is predefined no more things to do it is just read yes plus is an predefined symbol which is already defined with compiler. So no more extra information need to be incorporated for for the operator class.
Then 10 10 is a real 10 is an integer number. It is a constant. Okay. The value is already known to the compiler.
Okay. So no more extra information need to be accurate or need to be incorporated in the symbol table by the lexical.
Okay. Next once the these are the tokens generated for that single line. These are the tokens generated by the lexical analyzer. Once it is scal sum what is sum? Sum is an identifier. And what is the what is the serial number of that identifier? The serial number of that identifier is one. Okay. identifier comma one or in short you can write we have seen previously in the previous example ID 1 ID2 ID3 that 1 2 3 the subscripted variable is nothing but the serial number very simple you can just instead of sum you can write ID1 instead of a you can write ID2 and no more other things to be written there because plus and equal operator 10 these are are the predefined symbols predefined values to the compiler Okay, these tokens are generated by the lexical anomaly.
So this the process of generating the process of generating or identifying token for a particular word or lexim is called as tokenization.
The process of scanning the input symbol character by character and upon getting a single word find out the corresponding token type of that one. Okay. Find out the corresponding token type of that.
That is that process is called as tokenization. Very interesting.
Okay. Very interesting.
One second. Let us repeat. Tokenization is the process of scanning the source code character by character upon encounter or upon identifying a sequence of characters that means a single word single word or a single lexim.
Upon identifying a lexim or a single word then find out the corresponding token type or find out the corresponding class of that word. Actually that word is belongs to which class.
Okay. What is the token class of that word? That is called as tokenist. Very simple.
Okay. And this this tokenization what are the components used here? One is lexim. Lexim means sequence of characters. S U, M together. S cannot be taken within a word. You cannot separate and S is a or S is a lexim. No, you cannot say that you cannot divide the word the characters of a word. Okay.
Rather a set of characters. Set means one element can also be there. Two elements can also be there. Three symbols can also be there. Four characters can also be there. But the question is when four characters are consecutively there, you need not to break them. Rather you start scanning from character one, two, three and four.
Once you encounter four consecutive characters together as they appear in the source code that is called as that is called as lexil or bird and for that word you find out the class of find out the token type of that else e l c in the same language. What is that? That is one keyword. Keyword is also one token.
A A is an identifier plus minus these are all are what into these all are what operators.
Okay.
So there is there is a very shortcut way to remember tokens.
You just ask yourself what is a program?
A program is nothing but a collection of tokens.
Nothing you are writing beyond the token. Okay? Nothing.
Collection of tokens. Now the question is what are the different types of tokens available?
You just tell me what are the different types of token available in CX.
There are some around six type of tokens available. The very sweetest way to remember short and sweet way to remember you remember the name of a chocolate cookies. K cookies. C O K I S something like that.
Cookies means C for oro. K I S S C O. K for this total is tokens.
K for keyword. It is a token. I for identifier another type of token. S for string is a type of token. Another S for special symbol is a kind of token. So kiss then co C for constant is a kind of token. Maybe real constant maybe it is a constant then O over operator plus minus into modulus. All these all are fun percentage I mean logical and logical or all together known as objectives. So there are kisco k i s co six type of tokens available in c language. Apart from that there is nothing in c language or the entire program any program you write these programs are comes from cook sorry kisco kisco is nothing but set of tokens.
This is the easiest way to remember.
Okay. Okay. Anyhow, the major role of this lexical analyzer is tokenization.
Tokenization means finding I mean identify the lexim and find out the corresponding token type of that.
Okay. Identify the lexim and find out the corresponding token.
Okay.
Suppose one person is missing in abroad.
What you can do? First you identify that person once that person is uh once you catch that person what you you find out you find out which country that person is belongs to.
Suppose suddenly a missing person is encountered I by you in anywhere in the country or abroad.
What you have to identify that person I mean by knowing his name, passport everything you identify then find out the which contact that person is that is identify the person and find out his or her corresponding content that is called as okay that the same type this job is also same first identify a word or lexim then find out the corresponding class or the corresponding token type of Okay, here you see sum is an identifier.
Okay, you see the table and equal is an assignment operator. 10 is a number.
These all are the lexim and the corresponding tok and their corresponding token. The the token of sum is identifier. The token of equal operator is assignment operator. The sum of equal the token of token of equal symbol is assignment operator and the token of 10 is a number. 10 is what?
What is the meaning of 10? 1 zero. 10 is a number.
Later when uh you navigate it, you can identify whether it is integer number or it is real number. First it is what is the category of 10? It is a number only.
Okay. It is a constant. Okay. Next.
Now the question is how this job is performed by lexical.
How lexical analyzer perform tokenization?
Is there any particular tools available with lexical analysis to perform uh because uh because uh I mean starting from the first character of the word and reaching to the last character of the word there must be some technique associated with that. No.
Otherwise, how the lexical analyzer can identify? Yes. Yes s U M is a word.
Initial I N I T I L initial is the word.
Position P O S I O T I O N position is a word.
How it can identify?
Don't you think there must be some techniques associated with lexical analyzer in order to scan the word or in order to scan the word and in order to know yes this is a word and that is not a word.
So the technique there are two pointers.
The this to identify the lexim lexical analyzer uses two pointers. One is lexim pointer. Okay LP in short that is called as LP. It normally marks the start position of the lexim normally in and the second pointer is forward pointer.
It is scan ahead character by character.
Actually what what is the what is the concept behind this? Initially see this is one pointer lexim pointer. This is forward pointer L LP F3 both of them initially pointing to the to the first character of the first character present in the input dopper because every program when you write program and compile the enter program okay line by line it store in the input buffer. From the buffer the po two pointers are there they read. Okay.
And initially both the pointers are point to the starting cell of the buffer.
Then LP the left the left LP pointer it mark the start position and it keeps itself also. Then the next pointer forward pointer it moves one by one. First after scanning the first character it go to the next cell. It is scan the second character. Then it go to the third cell.
Scan the third character. Then it goes to the fourth one. Scan the fourth character. And then it goes to the fifth one. Once in fifth one if the fifth cell is empty that means a white space is there. Then in between that third uh in between the first and fourth I mean fifth place how many how many cells are there? Four cells in the the content of that four cells will be taken as a lexi.
So it starts scanning from the first character and it goes on scanning character by character. Once it encounter the blank space or white space it stop and in between lexim pointer and uh forward pointer whatever the character sequence of characters found that sequence of characters created as a lexim and there the job start once the lexim identified identified then it find out the token types what is the class of that lexim Okay, that is the concept behind tokenization.
Okay, here let us realize how tokenization carried out with the help of LP and forward pointer. LP means lexim pointer and FP means forward pointer with the with the help of an example. You see the down of the screen one example is given total equals to marks + 50.
Okay. One expression is given that total equal to marks + 50.
I hope you are understanding my dear.
Yes or no? Very good. I hope you are understanding because I'm going through the root. Actually starting from the root I'm telling you the concept. I don't think that any doubt will be there. Still if you have any doubt you can send me through comment. Okay. And before that I would like to tell you if uh possible you subscribe and give like as much and give like also like the channel and subscribe the videos.
Thank you. Okay. This total equals to marks + 50. This is an expression given to you. Okay. Now this expression your as a lexical analyzer you have to read them character by character and find out one sequence of characters that is one word and find out the token type of just identify what that word is and what that symbol is. Okay that is called as tokenization. Okay. Now let us see the very interesting diagram is there. You can see see don't already left side part is already discussed.
Okay. No more discussion is record of that. But same thing is repeated. Okay.
LP lexim pointer LP marks the beginning of the current lexim. Current lexim means the current word points to the first character of the lexim. Yes. It points to the first character. It remains fixed while a forward pointer scan ahead. Okay. The second pointer moves. Okay. It moves only when the new lexim start. Yes, first lexim is identified and the corresponding token type is identified. Then LP will now come here.
They will they will scan the second word. Second word again LP will be constant and forward pointer will keep on keep on keep on moving again it encounter space then the second word is identified and then identify the then they find out the corresponding token types of the second in this way the entire scenario can be found okay I mean the entire tokenization process is carried Okay. Forward pointer. Forward pointer scan the character one by one by one from the left to right. This is the nature okay of the parser. Yeah. It moves forward till it finds the end of the le as I said it will until it find out a white space. Whiteside space means end of the first le. Okay.
It may read one extra character called as look ahead.
Okay. Uh it stops at delimter wide space or invalid character. If any invalid character is found then it stopped. Any wide space is found it stop. Okay. So it helps in recognizing the token type of the leg. Okay. Let us take one example.
Total equal to mark + 50. Here you see the same thing is written here. Okay.
See the same thing is uh diagrammatically given here diagrammatically. Let me move this here.
You see total everything is loaded in the input buffer. The total cell are are you seeing that you can see the enter this is nothing but your RAM card a random access memory which actually holding now I mean when when you load the program the program get load into the RA that is nothing but the buffer okay that buffer means it has number of cells it it consists equal number of cells I mean every the size of every cell is equal. Okay. So you imagine the total equals to mark + 60.
Okay. Marks + 60. Now imagine that this this statement is loaded in the total to o t a l then space equals to marks + 60.
Let us assume that total is already scanned. Okay. Now you uh now the pointers are looking for a new new lexim let us say marks.
Okay. So you can identify how the how both the lexims are working together in order to find out the lexim called as marks and in order to find out the corresponding token type of marks.
Here you see first LP XM pointer is pointed I mean it is pointing to M mark means M A R KS okay five characters. So first LP is LP is both the pointers are together I mean LP is LP and FP both are earlier point to the character M. Now LP is constant at M but forward pointer is moving. You see forward pointer you see just the movement of forward pointer below to the below to the first I mean first half of the screen. Okay. First M then a then R then K then S okay and then plus when it encounter plus okay that means so M A R K S when it encounter plus and yes they are not what I mean they they are different symbols right so their marks can be taken as a single word. Okay. Mark can be taken as a actually here space is there but in this example space is not mentioned.
Okay. Actually M A R K S after that space will be there. Once it encounter space that single word will be called as marks. For that marks. What is mark?
Mark is a user defined word.
Yes or no? That user defined means it is identifier. So it identify the after delimter form. After delimter means space here because of because of uh the because of I mean it is the example is big but the number of cells are very less. Okay. Because of that complexity that white space is not mentioned actually but it is written clearly you see to the ground after delimter found.
Okay. The entire word marks is considered as lexim and the corresponding token type need to be find. What is the corresponding token type? You tell what is mark? You have entered a variable mark that variable is not predefined that is user defined. So identify. Okay. So in this way the lexim I mean both the lexim pointer as well as forward pointer they works they perform tokenize.
Now another question is coming sir. Okay token is identified sorry lexim is identified. You said that any sequence of characters between the lexim pointer and the forward pointer. Okay the forward pointer is considered as a lexim. Okay sir we understood that.
Fine.
Then how you identify the corresponding token type of that word or that for that lexical analyzer uses a very interesting tool called as deterministic finite autom okay that will be seen in the future classes okay now we see yes here only it is Okay. How exactly it happens? So, okay.
You identified something. You got something.
Okay. You got something. Suppose. Yes.
Very simple. You got the word. Now the question is how to identify the class of that word? What type of word it is?
Whether it is a keyword, whether it is an identifier, whether it is a operator, whether it is a special symbol, whether it is a constant.
How sir, how can you identify or how exactly it happen? Uh I mean happening.
This is happening through a very interesting tool that is used in by the compiler that is known as deterministic finite.
uh this this uh normally this context I mean the compiler design syllabus is present I mean is available for 31 student whereas two two students for two two I mean second year second semester student we used to provide a top teach the subject called as theory of computition or formal languages and automata theory both are same Okay. So, so this DFA deterministic finite automata is nothing but a prerequisite here to the compiler design. Before we study compiler design, we must know what is deterministic finite and what is context grammar. Okay, this pre I mean preliminaries must be with us. Okay, but no no no problem. I came forward with the definition and the how deterministic finite automatter works. Okay, we'll discuss that your doubt will be clear.
You'll you'll be knowing about the working principle of deterministic finite automata. Then you can easily understand how the mapping this mapping I mean knowing from the word to the token how that mapping is done for a single word. M A R K marks M A R K S marks you are saying marks is an identifier. How you are saying?
Is there any tool? Okay we know the rule but how system is doing that? You show me, you saw us that sir.
Similarly 1 0 1 1 0 1 is nothing but a constant.
Sir, what you said that is correct. It is a constant.
But so how the system is identifying that is a con isn't it?
Is it a valid process? Yes.
And I'm there to answer it.
with full priority. Don't worry. Okay.
Now the now the current topic is we need to know what is deterministic finite.
Isn't it very good? So next here deterministic finite.
So once again two lines are present here. The lexical analyzer is the first phase of comp of a compiler. It reads the source source code character by character. Yes, we agree. We have seen that. Okay. And uh to recognize tokens efficiently.
Okay. Recognize tokens efficiently. The lexical analyzer recognize means you got a word. For that word you need to identify the token that that is called mapping.
Okay, that mapping for that mapping it uses a DFA deterministic finite. Okay.
Deterministic finite two as I can assume that but if you are if you have started I can make you to recall through the help of my slides otherwise if you don't know also if you pay attention press also you can able to know what is DFA and what is the working okay no problem you can easily understand and you can easily identify what is the actual implementation of DFA in combination. Okay. So what's it what is a DFA? A deterministic finite automotive is a state machine used for pattern recognition.
Okay. Pattern recog pattern identifying the pattern. Okay. It reads one input character at a time. It changes the state only one state at a time. I mean from the present state to the next state. The next state can also be the world state sometimes. Okay. So it changes the state one state to another based on the transition rule. Okay. When a valid pattern is recognized the DFA reaches to the final state. Once it reads a sequence of pattern a valid pattern then it normally whenever it reads input it jump from one state to the next state.
Let us say it is in the starting state it reach jump from start state to the next state. Then from there it may reach another symbol. There it may jump to another state like that it will it can jump randomly. Okay. Based upon the input. Okay. And finally when all entire input it reads it may go to the it may reach to the final state. And if it reach to final state then we can say the pattern is okay. The input sequence is okay. Okay. The recognition the the recognized lexim is then converted into it. Okay.
Then once it reach to the final step starting from the start state. Okay.
While reading the input symbol if it reach to the final step then the entire input pattern is a valid pattern. I mean is a valid lexim. And for that the corresponding uh token is easily identified.
Okay. That we'll be discussing.
Yeah. Before that this DI has several components. Okay. Let us let us understand what are the components because without this it will be difficult to understand what is DNA. It has several components like a set of states. Okay, states represent different stage of recognition. Jumping from state one to state, state two to state four, state four to step three, step three to step one, state one to step zero, state zero to final step random that is called as different stages of recoveries. Okay. So it has sequence or set of states you can ultimately. Now each state indicates the current condition of scanning. Okay. Let us take the example. Q0 is the start state. Q1 is the identifier recognized.
Q1 is a state where the machine can recognize the identity. It may be the final state. Okay. Next is transition.
What is transition? Transition define how DFA moves between one state to another state. I mean how DFA moves between states.
Q not to Q1, Q1 to Q3, Q3 to Q, Q to Q4, Q4 to Q5, Q5 to Q not again Q not to final state. How this transition? Okay.
How the DA is transit from one state to another? That is called as the movement of DFA.
Okay. Transition define how DFA moves between states. Okay. Then movement depends on the input character. It depends solely depends on the input.
What input it is consuming. Now I am in state zero. I am consuming some input. I may jump from state zero to state one.
Okay. Normally we see whenever you are inserting ATM card in the machine earlier machine is in idle state.
Whenever you are inserting ATM state ATM card machine is jumping from ideal state to the starting state then there it is scanning the ATM card I mean bar code.
If it is correct then it is asking it is going to another state and there it is asking you or producing you a man and asking you to enter the option withdraw pin change XY Z things that is there the steps getting changed okay so you need to realize in that way that is the best example I cited okay next uh later move from yeah move from Q to key one digit remain in key1 it depends on the type of input whether it is consuming a input as a letter or as a digit okay then final or accepting state final state you can is a part of the state accept also known as accepting state it is a valid it is a state that a valid token pattern is recognized so when it reached to the final state then you can say that the job is over and the reading of input is over and the input is valid and for that the corresponding ing token is also identified.
Okay. When DFA reaches an accept accepting state and uh and delimter is found the lexim is accepted. That means the lexim is valid and for that the corresponding token is already there automatically the mapping is carried that I can show you next.
Okay, we'll be learning this in depth because very interesting concept deterministic finite automator because it is acting as a tool in the compiler in the lexical.
Okay, we must know it thoroughly. Okay, very interesting things and we have number of examples to go through. Okay, so not to worry. Uh so we will be discussing this in the next class. Okay, from this components of DFA is over. uh otherwise I can start it from the from this slide on okay so thank you have a good day all the best Okay, thank you.
Ähnliche Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
Re: 🗣️📍theprophedu📍2026 GST 103 CLASS (E-EXAM REVISION)
theprophedu
636 views•2026-06-04
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Instagram accounts got PWNed
EricParker
13K views•2026-06-03











