A basic search engine can be built by loading text files, tokenizing both queries and documents into individual words, and searching for matching terms using set intersection to find documents containing any query words.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Day 1: I Built the Simplest Search Engine in Python... Starting with 5 Text FilesAdded:
Hey guys, welcome to the series. So in this series, we are going to build our own search engine from scratch and we are going to do iteratively. So we'll start with a very basic version um one where it's just simply a bunch of text files and we have to search our words in it and eventually as we evolve we'll go to a version where we actually have website URLs and we are searching the data crawling the data from those URLs and we have workers doing the search for us. So I think this will be a really good um learning pathway for people who want to understand how exactly search engines works or maybe even the concepts which are used in this video which will be used in this video which are mostly around natural language processing text processing um file systems or web crawling I think those will be really useful for a lot of people. So yeah uh let's begin. So a lot of such system design videos actually start with a giant final architecture and we are not going to do that. We are not going to design the final giant architecture at once. We are going to design it version by version. So um we'll start with a stupid version first. Um it will we can call it v 0. So I'll just have like a bunch of txt files and we are going to have a document loader to load those files.
Then we can have we need something like a tokenizer to tokenize those documents.
Then we have a scan search. Basically the query which the user will put in or we as a user are going to put in the scan search is going to run on that and then we are going to have the results.
So this is a really foolish version.
It's um we can even call it v 0ero I'd say. So yeah um let me remove the fill here and this is basically the version zero. Um so let's hop into the code directly. Um for this I have an empty g repository over here and this has a folder called data. This is going to be these are going to act as website for a search. So this has um lot of data regarding to computers and basically tech related topics. Um let me first test my environment. So I have set up a virtual environment over here and let me write a main function so that I know that everything is set up correctly just to be sure print hello world.
Then we have uh if underscore name equal to equal to main then just run the main function. Uh we don't let's do a python main.py. We see hello world. So next thing we want to do is load this document. So in our function um let me write a function called load files. This will take a path which will be a string and it will return me a dictionary of string and string. This is the file name and the file data. So basically the path like the file path um this will be a directory path and in this we will have file names. So file names equal to OS dot u path. Now we can use a list directory function over here. So list directory path. So now for each of the file for file in file name. So what we want to do is um let me have the full path. So OS dot path I think I need to import OS so that I get the suggestions for file in file names.
OS.path dot join file. And then what I can do is if OS dotpath dot is file this file for this full path what I want to do is with open um full path as R with with encoding equal to UTF8 Um let me initialize a dictionary over here. File data. So file data file name equal to we will open this as f to have the file file pointer and just return the file data.
So let me just print this instead of hello world. Pull this down.
Okay. And here we will go we will be printing load files and give the folder name as data.
So it says that empty path has no attribute as file. Okay. Bad.
Uh it gives us empty. Let's see what mistake we made. So we are loading the file os.path join. Oh. So this join function it takes two. So path and path we actually need to give the path here as well path file.
Let's run it. So uh you can see that it gives us a map of the file name and the data in the in the in the file. Next thing we want to do is we want to have a query. So we will give this but we'll give this as file data.
And now we will have a query string as computer programming course. Okay. Next thing we want to do is search for this query inside those files. So um if I search the string directly that will not be useful because I need to search the individual words in the string. For that I'll just I think I can write a tok tokenize function over here. So I'll have tokens in the string equal to tokenize query. Uh let's just write the tokenize function. For that just import tokenize and take in a query string and return a list of string.
So this will return a re.find all r um a to z a to zed and we are going to perform let's normalize it to lower case. Let's print this as well.
Let's run it. Uh it says can't open file invalid argument. Okay. Oh sorry.
Um so it has actually tokenized it but it has tokenized it uh per character wiser. I I missed something over here.
So it is going to tokenize it. Oh I need to do it as a one or more.
Yep. Now we have the words. So now we want to search these words inside the files which we have. So I'll write a function as basic search which takes in the tokens or I can just take in the query string and then tokenize it later on and it will take in the file data.
So the file data this is a list of string.
This is a string and what I want is for it to return me a list of string and string. So um let's see if uh we want to return it a list of string a string first thing we do is basically we tokenize the query just copy it.
So next um for this file data this is actually a dictionary my bad for this file data we want to tokenize um all the data in the file so that we know which words are there on those files. So for example I have 10 files with thousand words each so that at least I switch 10,000 words approximately and of course they are duplication so it will be less than that and then I want to form a set on all of those words for each of the file so per file maybe I have 600 unique words and I want to search each of these individual word like computer programming and course in those 600 words in the set of those 600 words so if any of these words are there in those 600 words we want to return the file name because that is how we want to make a search engine as we want to search for these for the terms in the string inside those file names. So let's just iterate over. So for file in file data what we want to do is um have tokenized file data and this will be a tokenize of the file data. Now I have two list of strings here. First is the tokens in the query. Second is the tokens in the file data. Now to make it unique, I can add a set over here because it returns a list of string and then on top of it I can add a set to remove the d uh remove the duplications and I can do the same over here actually. So now um if there is any overlap between these sets we know that uh we have to return the file name for that. So I can simply do uh like for each of those token I want to search. So let me just write it in a plain for loop format we'll optimize it after that. So for token in tokens for each of this um I need to make a result dictionary as well. So result if token if token and um so what we are going to do is token and for uh okay what we can do better is that on this if we just perform an and operation between these two sets that we have. So tokenize file data and the tokens I can just append it over here. So I can just do uh it is um so we as an output we want the list of the file names. So it's better to just make this as a list and do uh make this as a list here as well and rest dot. So this part it is actually a file name, file in file data dot items and we append the file name and we return the rest. So this should work. Let's see if there are any errors. So print follow and the parameters here is query, file data.
So dictionary object has no attribute lower. Uh okay. Um if I tokenize the file data over here it's actually going to be file because it's the value of this.
So we see that these terms have performed are present in these files. So computer programming course what we did here is basically we tokenized it converted into a set. So computer programming and course and in all of those files if you search you will find that we have computer computer lot of computer if I search for programming I have that also and course we don't have course over here but we have the other two search terms which is why we are returned the file name and similarly for the other files maybe let's see database design do we have course over here?
Yeah, we have course over here which is why it is written. Now you might see that yeah it works just as fine but what happens when we have a million documents and do you want to load all of those documents into data again and again? No, that will be very um very bad for the system because we want to we have to load the data every time and then token us every time. So in order to see what improvement we are making in the further videos, I'll add a simple helper function over here which will help me to basically measure the time taken for the search. So if I do a measure I'll just define a measure time over here and this is going to take a function and the arguments of that function um I need to import time import time. So this measure time I can just do a start equal to time dot of counter and equal to time dotp of counter return and minus start and in between I'm going to u actually call the function. So we are going to do a result equal to function with those arguments. So now this is a higher order function. It's passing taking in another function and along with the time we also want to return the result. So result command minus start and over here we can now simply do uh result time equal to measure time. Pass in the function basic search comma query comma file data and print the result. then print the time.
Yeah, that works. You see how much time it is taking and the time it returns over here if I do a puff counter is in float. So, it just returns me the difference between these two um time.
Okay, so as you can see that this is the time it took and these are the documents. We don't actually need this previous call anymore. I'll just remove it. So, yeah, that's it for this video.
Next video we are going to do an inverted index over the documents so that we don't have to like index the documents every time and basically we can we'll see how we can reduce the time we are taking for the search. So thank you and see you in the next one.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
🚀 BCS613C Compiler Design | Module 1 to 5 Schema Evaluation 🔥 | VTU 6th Sem 💯 #VTU #bcs613c #exam
Pranavaa-y4y
104 views•2026-06-02











