India's unique linguistic diversity, with 1369 languages, 22 official languages, and 270 mother tongues, presents complex challenges for AI development, including data scarcity, code mixing, script complexity, and the need for specialized AI tools across education, judiciary, and heritage preservation domains.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
AI India Challenge and the Road Ahead | Prof. Girish Nath Jha | Samvād 2025Added:
All right, moving forward. It is now my honor to welcome Professor Girish Cha, Chairman Commission for Scientific and Technical Terminology, MOE on deputation from JNU to the stage. Professor Ja's pioneering work bridges the depth of indic knowledge systems with cuttingedge computational methods. A rare and invaluable intersection that continues to enrich the field of language technology. Please joining me please join me in welcoming professor ja to deliver today's keynote address.
[applause] Thank you Vidushi G and thanks to SitLab for hosting me here and that was a really a wonderful uh welcome entertainment no and lovely thank you so much Guri G for setting the tone my talk will will not be as entertaining as you had earlier but a little bit by the I'm former chairman of the commission I I just came back from there and joined JU I was in departition to government of India see uh our country is uniquely diverse and more diverse than perhaps any country in the world and therefore our AI needs are equally complex and diverse. We have a lot of critical areas to make sure AI is applied and each of these areas you know requires robust and scalable solution. was just talking to was just speaking to Vidushi G that there's so many languages and dialects in India and for example Haranui uh coming to Delhi and speaking in his or her tone uh so-called Hindi and uh and then machine has to recognize that also Google assistant for example so we need to translate from haranui to Hindi and then Hindi to English and for example there's a bangani dialect of of Ghavali and Garvali is a dialect of Hindi so someone speaking Bangani has to be translated to Garwali first from Garwali to Hindi then to English. So there are a lot of complications actually and let me tell you these bigger companies are actually working on it.
These are uh in my opinion the critical areas you know and I wish our government also you know as Garni was saying required that you know any health prescriptions be uh in fact explained to uh the customer or the patient in his or her language. Similarly, business and all of that. So that will open up a lot of uh industry here.
So datadriven AI requires enormous amount of data. You all know that. And uh not all languages in India, not all domains in India have that kind of data here. And for each of these application areas and each of these are in our own languages and how many languages we have 1369 as enumerated by the previous census.
Out of those 22 only are our official languages and out of 1369 we have 270 mother tongues and 121 bigger languages language groups right so enormous uh complexity that we can have.
How about the continuous changes in data technology and expectations of the people now we expect our AI systems to to answer much better. No, I remember 10 years earlier uh when Google had their Hindi to English machine translation, they sent me an email to have a look at it and I gave a simple Hindi sentence ra and look what they translated Ram is a general account and they weren't wrong.
Am is general kata is account. At that time the system was trained on the business data in Hindi. So we are a general account. Now that thing has been corrected. Now it's doing much better.
uh uh uh just last week I gave uh Google to translate a single Hindi sentence with 100 words and I was hoping that Google will fail but it did not fail. It did wonderful translation.
So we need huge quantities of data in almost all the areas and every domain of each of these languages. Now we have each domain divided in subdomain. I'll show you some example of how subdomains work and the subdomains may have mixed subdomains also.
what are the issues in data in India? So sourcing the good data is always a challenge. I was I was sourcing uh I was I was I was in crawling sthali data for a predictive keyboard for swift key and uh sthali is popularly written in danagri and with danagri we had many languages there bhpuri mahi hindi mili all of that mixed with saltthali it was nightmare to get sthali out of it automatically you know then you have to do manual intervention and get all of that for example Bihar right now has a land records problem and the same land records could be in multiple languages angages and could be in in kiti script which nobody knows right now. So kayi script can actually be used for maiti bhpuri mahi hindi and udu and you have to know how to actually get that back to danaguri to make sure people understand that. So I'll give some examples of you know what we call big linguistic data in India judicial documents admin papers historical records sale purchase newspapers and print media you know uh uh literature's film television Facebook other social media manuscripts and other language historical records all of that could be very very complicated and remember we are not looking at data only in one single language so one page can have potentially all possible languages of India and possible scripts of India theoretically yes Because we have manuscripts where not only multiple scripts are used but one script influences another script. So for example a Bengali writing a dvanaguri manuscript in danaguri might use Bengali conventions in danaguri and that will be more complicated.
So we need AI to [laughter] build critical [clears throat] AI tools in all our languages and I consider these as most critical right now. You know the font the basic level one level even fonts converters we don't have standards there's a big controversy going on in the country right now on the fonts how they why donag is not being typed seamlessly well in all the platforms and mobile so ASR TTS OCR all of that online handwriting machine translation text summarization CLIA all of that and many more will be critically needed for Indian languages so complexity of linguist linguistic data which is becoming more and more complex puts additional requirements on pre-processing you know and tokenizing all of that.
So for example I have to collect there's a big problem in India we have spoken data but not really transcribe data in India. For example I want to store spoken data from mobile phones right away in the databases. How to do that?
So we'll have to know bootstrap a conventional SQL you know you know all SQL right with with the natural language. My master's thesis in Illinois was basically on that. So I was trying to bootstrap English language with SQL and bootstrap crawlers with voice commands obtain target data and store them in DB right on the fly otherwise it will be nightmare you know you let people speak and data vanishes in there you don't have any way to store that data otherwise India you know people everybody speaks in mobile phone if you're able to capture that data you know as we capture big data that'll be lovely and we can have a lot of data crawl uh the raw data and train ML algorithms uh that would be certainly the thing to do in India. Now the linguistic diversity and complex usages poses additional challenges for India and I mentioned about 22 official languages 38 more are pending approval from the home ministry. So uh the only country country in the world where we are gaining languages officially right we had we started with 14 languages in our constitution now we have 22 we'll have more when there is election year some languages do get in right and third 1369 uh plus minor languages and varieties so and English as the 33rd and the dada language right we call English dada language it's very important language India because 10% of Indians speak English and approximately 3 or 4% speak good English right and that is important for hydro education and multimedia media uh uh multilingual media movies TV print education all of that very very complex and by the way English media is not growing as much as the native language media which is a good sign in India language policy education policy we don't have a language policy in India yet we have language formula remember language formula is not policy right I've been telling government colleagues that please have a policy And we have education policy good one though and new states are emerging when new states emerge there's a problem of language telangana which language you know and the udu or or the telugu you know when bihar comes out of some other state so whether hindi or or other language it's always an issue actually in India governance health multilingualism language and script mixing English as an essential mix if you're writing something on internet English has to be there because of the URLs and internet ids etc. Now I can has solved that problem but people still don't know about it.
Mixing of the language big problem.
English you know English is u mix mixed and most mixed language in India is Telugu actually not Hindi.
Sociocsychological reasons for code mixing and switching. We switch code.
For example, we were doing a a corpora of the 100 years of Bollywood movies and there was one movie where Govinda is speaking to some some friend male friend in Hindi and suddenly he switches to English and then you know next time his girlfriend is approaching from sides. So that's why he switched to English and people do mix language also. So for example, crazy ki crazy ki is a song where crazy and kia are mixed together morphologically. So a lot of these morphological mixing, script mixing, sound mixing and and multimodality mixing. A tamilian shifting to Delhi you know uh so uh his gestures would be different in Tamil of his gestures when he speaks Tamil or Hindi in in Delhi might be different. So multimodality also comes in a big way.
English can be mixed switch with any of our languages. Mixing happens at any level. I just mentioned all of that confused ka all of that. So it will be confused kora in Bangla but confused will stay. No NLP tools have to cater to such phenomena and Amazon Alexa and Google Assistant they're doing big good good work. I've done two pilot projects for Google and I've done some of these things with them.
Mixed non-standard multimodal communicators. I mentioned that mixed non-standard speech data you know I did some work on the Bhpuri people moving to Delhi Maiti people moving to Delhi Harani moving to Delhi they behave very differently when Harani moved comes to Delhi they don't stay in Delhi they go back so they maintain their lingo and style but bhpuri and my people they change to Delhi lingo and that creates a lot of nuances and such studies have to be done you know and Google actually got this done with me so they are going much deeper into how people actually are going to data social media usages handwritten text manuscripts very very complex I'm going to talk about that and then I al talked about the mixed script like Bangla prisham matra being introduced in deanaguri and dead scripts and also dying scripts all of that so we need to create huge amount of what I call language technology resource you know LTRs for all unimodal biodal and multimodal for India and for all domains and all languages huge amount of task but just thinking in yesterday that all these 679 universities in India, they all have language departments. So let us start a consortia and and tie up all the language departments of these universities and let them build data and all these data should be you know ported on on on a on a on a cloud where someone can curate the data and then be used by public. So I'm going to propose that to the government you know that just this would be a good way to create all the data. So discuss some areas niche areas.
So I'll discuss three areas if how much time do I have man? How much time do I have? You can't say endless.
10 more minutes. I'll be wrapping up.
Education, judiciary, heritage. Three areas which concern India more actually today. Education. Uh see we have a requirement of mother tongue. Lovely mother tongue. 270 mother tongs in India. So we can teach up to fifth level in mother tongue and desirable higher level also if possible. Now remember 270 mother tongues. Many of them may not have scripts even. So no written data and we have to construct textbooks in those languages. So it's challenging task content creation in all these languages and Indian languages lack scientific and technical vocabulary to write even a textbook. Think of Ladaki.
I went to Lada Ladak to give a seminar.
The kids would like their textbook to be in Ladaki. But where is the vocabulary?
If I write a textbook in Ladaki you know for example you know doctor patient operation this is Hindi doctor patient operation all English so this is not going to work we have to have words for these languages these you know in all these languages so that we can have actual translations these are the problems and government has a commission I was the 24th chairman of that commission and it created uh more than 30 lakh words and now all of those words are searching You can have agreements with the government of India to get all these vocabularies. When I was the chairman, I I wrote the program myself. The sh the portal sh the road education government is actually my my program. I wrote it in Java for the government of India. Zero cost. So that is being used by now 15 lakh people is doing an MOU with them. So you should also do an agreement with government of India to get that data for your usage.
So remember shove the shabbd.education.gov.in.
You can go to the site right now and see. Oh. So from technical vocabulary to textbooks and education that can you know it was a huge call because we have to really create a lot of data uh uh for these languages 270 mother tongues and for 22 shu language even for these 22 languages we don't have the textbooks you know uh for all the subjects that we need the government has started a huge and ambitious project under Bharti Bhasha Samiti to do that obviously human effort cannot solve this problem we'll have to have suitable technologies and we need to uh two technologies uh education and language technology for for handling that. Now for heritage particularly manuscripts what do we do? We have huge amount of problems for our manuscripts.
Recently government did a gan abharatam seminar where uh we interviewed a lot of AI companies and we shortlisted two or three for doing this job and uh so a lot of these issues for Indian manuscripts you know manuscripts of India are approximately 51 lakhs right now which are unedited unread unaccessed and there are stories how if manuscripts are outside India they're in good shape I wanted to do a a funny manuscript a shika that wasn't found in India in good shape We found that in Hamburg University we and JNU paid €1,300 to get a picture of that manuscript in India and we were able to edit that but that was in good condition right so this is the story and we need to act very very fast actually on that soar mission of government of India is trying to bring technology for manuscript preservation I'm sure this will lot of business for you if you uh are aware that when they make a call uh and it should be so what do we need there actually we need uh standards Uni by multimodel because there'll be multimodal manuscript also available metadata storage search etc. data input doment mechanisms online I mean handwriting recognition editing spelling grammar checking all of that and we need to have how to archive it well search cross-link data um if I had time I could have shown you how do we cross link data so we search in rigida then search in Ayurveda agree in a dictionary they have different meanings so to get a comprehensive idea of a word we need to have cross-linking reading help translation fundamental research experimentations for example I'm searching the science There's a term basma. So how to make basma from the ayurvedic ashadis that should be understood only if you read the text well. So all of that has to be done. Now coming to the third important domain judiciary is a big problem in India because uh the legal language happens to be mostly English in the higher courts and but we have the subdomains of law.
You see about you know 15 subdomains of law and and then there might be you know mixed domains also and see the status of judiciary in India one supreme court completely English I would say rather 90% English still in supreme court and 25 high courts English and regional language and look at the district courts 672 district courts mostly in regional languages some English also and the tendency is higher in the district courts less in the high court and very little in supreme And look at the pendency 3.5 cr plus c cases pending together you know 30,000 filed every day 28,000 heard 2,000 adds to the backlog and 86% pending in district court 13.8% 8% painting in high court and 2% only painting in supreme court. See the language and technology.
Now district court has very little technology available. High court has more technology available. Supreme Court has the most technology. Supreme Court has released about 50 years of data uh uh you know recently. So that should be a good way to look at for you guys. The courts are disposing more cases now.
Pendency rate has been rising. Uh so from you know uh for these you know about 12 years pendency rose by 8.6%.
And so these are the number and you know if penency rises to a certain level that judiciary becomes ineffective we don't want that to happen.
So diverse society, huge population, massive number of cases, poor infrastructure, poor technology support, dependency rising steadily, vacant post, insufficient positions, quality of manpower, communication issues, all of that issues are there. But technology support is a big issue. Now we need resources uh in in complex societies and how do we get four or five things data must be available if not create data using for example some language doesn't have any data available. Mahi, Gandhi, Ladaki for example. What to do? You get the nearest language dictionary for example Hindi for Mahi and then ask person to write words for Hindi and Mahi. Then ask him to create a sentence for each one of these and then speak it out. So these are innovative ways to create data in languages which don't have data. The standards have to be there and manpower obviously need a lot of manpower and tools have to be there, corporate tools and of course data center. I have data center and corporate tools and then of course firms have to be there. Data has to be available. Uh you know not every language has data for each language technology you know uh domains and oral tradition of course many languages don't have scripts. So you have to know make sure how to get data or that data in written form.
Community has to be aware uh some data are born digital some pre-digital analog data all of that copyright issues are measured. My student did a research on Hindi you know polarity judgment you know sentiment analysis is polite impolite. So that kind of you know judgment by machine and we collect a lot of data from social media. We had a game where students can give a rating two sentences then we train the machine. Now machine is doing well. But the question is that social media data can have stake by people you know uh if we start you know if you start making money out of this system then there might be issues.
So we need to be very sure that you know copyright issues are handled very well.
The standards uh in India have rarely worked in the uh in the recent past. PN standard worked for Sanskrit but after that standards didn't work in India but recently you know BIS has been trying very hard and there was a controversy regarding the uses on mobile phones and uh PMO set up a committee and then uh it's currently under DST and there are three committees one is to make standards CI mess other one is chaired by me which is making sure that this software hardware implementation is done seamlessly well across all the platforms third one is popularization committee by triple to Hyderabad.
So uh domain specific uh uh processing is very complex. We need to have a lot of domain and subdomain divided data in Indian languages and uh so that is important. So let me very quickly show you 2 minutes uh and what are the domains? So let me quickly know show you a domain. So entertainment, agriculture, politics, science, technology, art, culture all of the 18 domains and each of domains will have a lot of subdomains. These are from the library of science and for example I uh click on a subdomain and um so u say tourism domain so tourism domain will have a lot of the sub subdomains pilgrimage tourism ecoourism heritage tourism all of that so when you collect a sentence make sure you mark all the sentence with the subdomain level that will become a very rich resource you know for future processing of AI and and obtaining judgment you know on uh the domain domains and subdomains then translating data we need to have translation guideline we have a translation guideline I have copied that on this machine you can use that that was done for government of India project so so uh so creating parallel data and the translation across for example translating Hindi to marathi there's a sentence in Hindi Hindi suppose there's a sentence this cannot be translated Mari Kiran Kiran is masculine in Mari. So it cannot have achel.
So when you consider Hindi marati to be very close when you go deeper you find lot of problems you know and then you have to get around to solve the problem.
So I have a translation guideline document which we have you know created after a lot of you know uh doing workshops in different languages and trying to understand problems of each of these you can use that and the annotation uh we have to have annotation guidelines you know we have the scale tax and the hierarchical tags all of that so uh BIS has now taken up some of the standards we started with Microsoft uh consultancy and created a tax set and then that was taken by BIS and so we can have you know examples of annotating data from very simple to very complex part tagging, name entity, sense tagging, chunk tagging, sentence tagging and discourse tagging all of that you know uh we can basically uh uh do and then you know we have the annotation uh guideline also I have given I have given chunking annotation guidance also so please uh try to use that and uh and see if there's a question finally a great way to get more data we were talking about you know earlier morning sharing and collaboration so I'll give you also example of how to share. So data sharing should be done like LDC University Pennsylvania I was consultant for them.
They share data. You get data and you give data all of that and LRA in Europe also does something similar. And so we need to have a mechanizer where if you are done with the data you can share it no with some other agency and get some data in in exchange and there are data sharing models manual sharing and automatic sharing. For example, I have my isia in annotation platform. I annotate data first two tiers then export data to uh typ platform of ntnu Norway and then they add another tier of annotation then they export data to the IMAG act platform of University of Florence which uh annotates action works for robots. So they can and then it can come back to our our platform. So these automatic seamless export import should be there across platforms to make sure data annotation is done at a better way. Then collaboration industry, academia of government have to collaborate. This is what we are here for. And international collaboration is very very important. I have to get some data for Portuguese for example through my university. I'll have I'll make sure that the students are exchanged. Some students come from that university to audi university. They attend some courses here and and I get data from them. Similarly, my faculty students can go to that university get data. For example, I was spend some time in the University of Florence to get a magic platform up and running for Indian languages. So all of that will be possible and the student faculty exchange can be done for data exchange and research. So thank you so much. Uh I hope I haven't uh taken a lot of time. I can take some questions if possible.
[applause] >> [music]
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 views•2026-05-30











