Finally, a technical deep dive that offers real substance instead of just another ex-Google career vlog. It provides a clear and efficient roadmap for understanding how modern vector search actually scales in production.
深度探索
先修知识
- 暂无数据。
后续步骤
- 暂无数据。
深度探索
Introduction To Vector Search | Distributed Systems Deep Dives With Ex-Google SWE本站添加:
Hello everybody and welcome back to the channel. Today I'm finally making a video about vector search. Uh much like being a good boyfriend and becoming emotionally mature, I've been putting this one off for multiple years. So I'm excited to finally do it. Uh it's going to be pretty quick and it's not going to be the most educated simply because a lot of this stuff I was reading up about while I was flying back from Mexico last night. Uh so reading PDFs on your phone while in the middle seat uh getting absolutely smooshed is not the most conducive way to learning about vector databases. Nonetheless, I'm going to try and give a pretty high level overview of this space because it's one that I'm interested in. Uh I think it's cool and uh you know, we'll see where this kind of moves in the next few years as we continue to have more kind of new advances in the vector searching space.
All right, so let's go ahead and get into vector search. This one isn't about any particular system, so I feel kind of weird putting it in the deep dives category. I probably still, even though it's not particularly deep or not much of a dive. Uh, but let's go ahead and talk about it because I haven't really covered this in too much depth on this channel before. I have covered this slide though, but just to reiterate, why do we care about vector search? Well, in the past, a lot of search uh was kind of done on things like keywords and using inverted search indexes, right? So, you know, if you were going to search on Google, you'd probably search for a particular term that appeared in an article or, you know, maybe a keyword or a hashtag that acted as metadata on some row in a database, right? So, for example, if I'm on Amazon and I have to do some shopping, I'm probably going to search for terms that are going to show up in the explicit listing. However, what is more popular these days because it provides better recommendations is to do semantic search, right? Like, if I want to search for tissues, I don't care about what brand they are, whether it's Kleenex or something like that. I just care about the fact that they exist. So we care about semantic search. So you can see I've given some examples of those down below. And basically the way that we achieve semantic search in general is that rather than trying to you know uh figure out the semantics of every single individual type of thing whether it's a word or an image or audio or anything like that. What we basically do instead is we embed every single one of them into a vector. Right? We use some sort of machine learning model.
Maybe it's word tovec. Maybe it's a more complicated one. uh maybe it involves context to actually take one single entity and involve it into a fixed dimensional size vector and then ideally two things that are going to be similar in semantic meaning are going to be similar vectors. So you can imagine if I have a coordinate plane or a 2D cartisian plane I'm going to have a similar vector to Jeff Dean and of course Lily Phillips big fan of uh of mine vice versa of course very similar to Bonnie Blue. So basically the idea is that similar vectors have similar content. And so a very common thing is to say I'm giving you a vector. What are the similar ones to it? An example of this might be if you're watching a Tik Tok video and you like a particular video, Tik Tok is going to want to suggest more of those to you. Uh so it's going to want to find similar vectors to the one representing the video that you just liked. One thing to note is that you can have many different possible distance functions. So even though it's easy to think about these things in a 2D or 3D cartisian plane uh or even you know a higher dimensional one realistically we're talking about hundreds or thousands of dimensions here uh people might use things like uh the cosine similarity so that would be basically the difference in angle between two vectors or the projection overlap. So if I have two vectors and they're going in different directions how much of these uh this one's coordinates are you know being projected onto this other one.
All right, so let's go ahead and talk about vector databases. Um, you may have heard this term a decent amount. Maybe you know more about them than I do, but basically everything boils down to, like I said, the nearest neighbor problem.
Now, in this case, it's going to be the approximate nearest neighbor problem, which we'll talk about in a little bit.
Actually, finding the nearest neighbors is pretty hard and not necessarily super efficient. Uh, so let's go ahead and talk about the two main operations that vector databases support that we're going to focus on in this video.
Obviously they're like databases so they do a lot of other things but the things that we care about here are the actual index and the way of storing them. So uh we care about the insertion of a new vector right if I'm someone created a new Tik Tok video and they publish that uh that has to get put into our vector database and then also we want to be able to say for this given vector database or sorry for this given vector we want to find its nearest neighbors or maybe even multiple nearest neighbors.
All right. So the first and most naive algorithm though even though it seems naive actually may not be so bad is going to be linear search. So basically here we're going through a bunch of vectors uh with a fixed number of dimensions and we're computing the similarity between them. So if it's something like uklitian distance, you guys know how to find the distance between two points on a cartisian plane.
It's the same exact thing just in multiple different dimensions. And this is actually something that's highly parallelizable because these operations are all pretty much unique within a column as you're running down multiple different vectors. And so they can be highly parallelized by something like a GPU or an FPGA. So if your data set isn't absolutely gigantic, you can actually do this type of thing really efficiently. And it's only at massive scale that things like certain vector indices start to become more and more useful.
So you know obviously if we do have a massive scale, the question becomes how do we reduce the search space? I kind of alluded to it already. you want to use an index and then the question becomes what type of index do we want to use or what uh what sorts of techniques have people developed in the vector search space. Now, just as a quick refresher here, you may be thinking to yourself, well, vectors are multi-dimensional data, and we've actually covered on this channel before how we want to index multi-dimensional data in an efficient way, right? And the way that we did that was actually with geospatial data where geospatial data, the whole problem there is that because it's two-dimensional, it's not so simple to use a normal 1D index to basically index the x dimension of a point and separately index the y dimension of a point because then you can't really limit down the search space as much as you want. Instead, you want to use something like a quad tree or a geospatial index to basically organize points so that their hash that's assigned to them, their geohash or, you know, the quad tree zone that they're located in are basically making it so that two points that are close to each other in the cartisian plane are in a similar place on disk. So, you can basically do range queries on disk and get all of those points together. And you may be thinking to yourself, well, this is a concept that probably applies for vectors as well, right? And technically it does. However, one thing that also comes up here that I'm a little bit less familiar with, but is kind of the motivator behind all of these new algorithms for vectors specifically is this concept of the curse of dimensionality. Now, what that really means in practice is that okay, you you know, you can do a quad tree on a 2D region and you can pre-split it into all of these uh you know pre-existing bounds and you're going to get pretty good accuracy in terms of finding the nearest neighbors, right?
because I know that in a 2D space I only have to check a couple of different bounding boxes. Maybe my points on the edge of it and, you know, there only a couple of other locations that I have to look at. However, when you're talking about, you know, 128 different dimensions, all of your points that exist on that graph are probably going to be close to the edge of one of those bounding boxes. And you're going to have to not only search that bounding box that your vector is contained in, but also many, many others. And all of a sudden, you're going to limit the efficiency of your tree data structure.
and you're not really going to be getting good performance anymore.
So to fix this actually one of the earlier vector indexing algorithms that came out was called annoy. So it stands for approximate nearest neighbor. Oh yeah quite the name. Uh but it was released by Spotify in 2013. And basically the idea here is rather than splitting the search space into a bunch of predefined bounding boxes, what you can do instead is basically pick any two random vectors or any two random points in the search space. And at each iteration of this algorithm, you basically say, "All right, I'm going to pick two vectors." And then for every other vector in this box, I'm going to assign it to either the left vector or the right vector based on whichever one is closer. So you do that. You create a split between those two boxes. And then now within each box, you go ahead and do the same exact split. I pick another two vectors within each box and I assign everything to that uh one or the other.
So you can see this starting to come up in this diagram right here that we have this split in two halves. We split it again and we keep doing these iterations until you actually have a bunch of boxes that actually contain a similar amount of vectors. So this isn't necessarily a bad algorithm, but you know, obviously in the 13 years that has come out, uh the 13 years since it's come out, we've come up with better things to do. And so let's go ahead and look at certain things that we can actually accomplish here. I mean, for one, just in order to improve the recall of this annoy algorithm, because there's a lot of randomness to it, right? You're selecting two random vectors to determine where you're going to partition the data set. You could also just create multiple different trees of annoy data sets in order to basically ensure that um you know you're going to get as good recall as possible. So every single index uh iteration when you're searching for nearest neighbors, you go down the tree uh or multiple different trees, you find all of the overlapping uh vectors that are close to the one that you care about and then you merge those together.
All right, so we're going to talk about how we're going to speed up vector searches more generally. Um, I wouldn't necessarily say, you know, I've kind of limited it down as partitioning, compression, or, you know, algorithm changes that we've done in order to speed up these searches. But the truth is that a lot of these things kind of overlap. Uh, in some ways, the compression that we're achieving is also kind of partitioning the data set. Uh, and the algorithms themselves can be combined with the partitioning and the compression. Uh and then it's a question of you know well we're able to kind of do the efficient vector search but maybe a lot of it's in memory and how can we actually achieve that at scale. So let's go into all of these things in terms of just partitioning the data set. Um one way that we can do it is called IVF. So this is not in vitro fertilization.
Nobody's getting any eggs implanted in them. Basically we're just going to partition vectors. So if you've heard of the term K means clustering. Uh basically what it means is that if you have a bunch of points in some you know multi-dimensional space you're going to try and assign them into k different groups so that within each group you know everything is really tightly packed around that one centrid point in the cluster. So as you can see we have a bunch of these black dots right here.
Those are the centrids and the K groups.
And within each of those colored boxes, you may have however many vectors. And the point is this is the optimal balance after running a few iterations of a clustering algorithm to uh ensure that everything is as close as possible to its centrid and that we have K centroidids. Now, this is one pretty natural way to go ahead and partition those vectors. Now what's nice about this is what it means is you know if we have this big graph of vectors maybe we can horizontally scale that out over multiple nodes where every single node holds a cluster and then you know you route to the proper node and then now all of a sudden you can do a linear search within each cluster to figure out uh the best vector to use or maybe you can do something better than that. So let's see what we can do.
Uh another thing that we can actually do is compression right so one pretty naive way of doing it is called scalar compression and basically it is all centered around taking you know higher dimensional uh or rather let's call them wider float numbers which might be 32 or 64 bits and compressing them down to 8 bit integers or maybe even five bits or four bits uh depending on how you want to represent them. The idea behind this is that uh you know the more data that we can fit in memory the better or the more data we can fit on disk the better uh you know actual comparisons by the CPU are going to be easier you can do more SIMD things like that so generally speaking keeping smaller data makes life easier for everyone so a very simple compression algorithm algorithm when you have a bunch of floats is something like this let's imagine I have you know a million vectors and the fifth dimension or the fifth index of every single vector is between you know 236 and 748 Well, what we can do is basically say 236 is the minimum. 748 is the maximum value. We know that we want to compress these down to be 8 bit integers. And so if you have an 8- bit integer, that means you can uh represent values between 0 and 255. Now, if I can represent between 0 and 255, that means there are 256 distinct buckets and I have to represent 0.512 worth of a spread of data. And so basically I can just you know uh split every single discrete range into separate buckets. So 236 which is the minimum to 238 which is 02 higher is going to be bucket zero. 238 to 240 is going to be bucket 1 and so on. Now this is going to really easily compress our data assuming we have a fairly limited range. Now what would really stink is if every single data point was around 236 and 748 and so you know we use bucket zero and we use bucket 255 but then pretty much everything else is not going to be used right and so we lose precision in the zero bucket and the 255 bucket and then we also don't even need eight bits to represent these numbers because we're only using two of those buckets. So that's not very good. So what do we do instead? This is where something called product quantization comes in. And basically the idea here is that you know if you have a vector of a particular dimension let's call it n. We can basically take that vector and chop it up horizontally. So that imagine it was 128 bits originally. We might now have four of [snorts] 32 bits. Now we take each of those four individual vectors. And what we're going to do is we're going to say well we now have all these mini vectors. So across all of our million vectors, let's look at the first mini vector and we're going to perform K means clustering on it. And then we're going to assign each mini vector to its particular centrid ID that it went to in K means clustering. Right? So in the diagram you can see we basically have an 128 bit original vector. We split it into a bunch of smaller sub vectors.
There are four of them. Each of them we do C means clustering so that we can assign it a particular centrid ID. And then our 128 bit vector in the end ultimately just becomes size four because we squashed together those four ids. And so now all of a sudden we've completely lowered the dimensionality of our vector. But of course we've also lost some precision as well. Which is why we often you know talk about these things as if they're approximate nearest neighbor algorithms because at the end of the day uh even though you're getting a lot better uh you know compression out of this you're going to lose some precision when it when it comes to computing the nearest neighbor. Sorry, I can't speak today. This is what happens when I get like four hours of sleep. All right, the last algorithm that we're going to talk about is one that you may have heard about a little bit more. It's called HNSW or hierarchical navigable small worlds. Now, this is one that I'm going to try and explain pretty carefully because I've approached it multiple times now and I haven't had the best understanding and now that I feel like I get it, it's actually pretty simple. Uh, so I'm hoping I can explain it a little bit better than I feel like it's been explained to me in the past.
Basically, the idea is this. HNSW is representing the same exact vector points and the same exact coordinate plane in multiple levels, right? And so you can see in layer zero, we have every single vector represented and there are a bunch of edges between all of them.
Every vector in every single layer is only going to have edges to some of the other closest points around it in each layer. Basically, the idea here is this.
As we go up in layer, the graph gets more and more sparse. And the reason we do that is because it acts as like a logarithmic form of pruning. Right? So when we actually want to search for the nearest neighbor of a particular graph uh of a particular vector, it's going to look like this. You can see on the image itself, you have a query point, right?
So what we're going to do is we're going to start at the highest layer and that's going to be layer 2. We have some sort of predefined entry vector or node that we start from. We're then going to do graph traversals in order to say, okay, in layer 2, the nearest neighbor to our query point is the one that's highlighted over there in the left. From there, we know that every single point in each layer is also represented in the layer below it. Right? So, if I have a node that is represented in layer 2, that equivalent point also exists in layer 1 and layer zero. And every single uh vector in the data set is in layer zero. But not all of them are in layer 1 and not all of them are in layer two. As we go up in layers, the graph gets sparer. So basically, again, when we're starting our query, we start out from the entry point. I'm going to see if I can get my mouse over here. We start out from the entry point over here. We say, "Okay, this guy in layer 2 is closest to our query point as opposed to this guy."
We're going to now say, "All right, I can't get any closer in layer 2, so I'm going to jump to layer 1." In layer 1, we're also going to do traversals again to get even closer to the query point.
Right? The nice thing about this is that this graph is very sparse. So a more dense graph would take longer to traverse, but a sparer one is easier. So you're starting from a better point each time and so it's relatively easy to get to the right place. So again, in layer 1, we're going to traverse through this sparse graph and we're say, "All right, we're really close to the query point right over here. Let's try and go to layer zero. So we're sure that we're looking at the best possible vector. All right, in layer zero, we see that this is the best possible point, and that is going to be our nearest neighbor for our query point. Now you may say okay I understand how you know when we have to search for a particular vector in the HNSW graph we can do so but how does insertion work right like what how does this whole data structure get maintained generally let's imagine I want to insert a vector right like this query point right here basically the idea is we want to maintain uh an invariant such that the higher layers are very sparse and the lower layers are not so every single vector that gets inserted goes to layer 0 and then you use some sort of probabilistic distribution, probably exponential, to determine what the highest layer is that this point goes to. Let's imagine we actually determine that it should go to layer 2. That's fine. Then we also add it to layer 1 as well. And we also add it to layer zero.
So regardless of what layer we decide this point should go to, it's also going to belong in the lower layers as well.
It's just that, you know, we may also say, hey, we're inserting this vector, it stays in layer zero. Hopefully that makes sense. It's a probabilistic data structure, but it works really nicely in terms of ensuring that the graph traversals are efficient at the higher layers and then can ultimately get to the proper place in the lower layers.
Okay, in terms of scaling out, basically like we said, HNSW is a graph traversal.
Graph traversals work much better in memory. We've spoken about this a lot on the channel, but basically the point is when you are jumping from memory address to memory address, that's really quick.
Even if they're not right next to each other on disk, that's a lot less trivial. How are we going to scale out if we can't use disk? I don't know because memory is expensive. You'd have to use a lot more computers to store all that data and ultimately no one really wants to do that. So there's a lot of work in this field at this point in order to better figure out how we can represent the vector search problem on disk. Even though the graphs themselves get really high recall or really high accuracy, it doesn't necessarily mean that they're the best thing to do long term if we can find a system that's much cheaper and still performs pretty well.
So what could we do if we wanted to make this thing run on disk? Well, for one, we could just do more partitioning, right? We know how to cluster the data.
We know how to compress the data using something like PQ in order to basically split out those vectors and do C means clustering on each of them. And then we could also try and develop a new algorithm. So one that came out a few years ago that now seems to be somewhat popular is called disk ANN. It's very similar to HNSW actually or it's a similar concept in the sense that it's also a graph traversal. However, in disk ANN the whole idea is to basically say all right of these uh vector points are implicitly connected to each other if we wanted to represent this as a graph. So there's no reason to represent this thing as a you know densely connected graph. Instead we should make it more sparse. we should only keep edges that are important between all of these vector points and that are useful for us. We want to drop the redundant ones.
And so what that does is it's basically saying, you know, we can represent this graph of vector to vector in a more efficient way such that we have to perform fewer disk IO operations in order to get our data back into memory and figure out what the closest neighbors are to a given query vector.
As a result of that, it's more optimized for disk. Right? You pull in some data from your disk at one time. Maybe you'll cache it in memory. Maybe you'll do your graph traversal there, but there are far fewer disio operations and that makes a huge difference. Feel free to look into this one some more or maybe I'll make a video on it at some point. It's called the Vmana algorithm and basically the idea is to, you know, continue to maintain these really sparse graphs with useful edges.
All right, in conclusion, uh this is becoming really common, right?
Everyone's using vector databases right now. Uh they seem to be a new hot thing, a little bit of a buzzword as well. So I was trying to demystify that ever so slightly especially the vector indexing part of it. Obviously a database is much more than its index. Uh but the index is very important for ensuring query performance. Uh and also potentially depending on the implementation of what index you use how expensive or cheap it's going to be to maintain this type of data structure. Um I actually have a typo here. It says disk based algorithms often seem to show better recall. What I meant to say is that graph-based algorithms often show better recall.
However, they don't scale as well, right? Even with graph databases, it's really hard to partition graphs. Uh you can't jump from node to node very quickly because random accesses on disk are not good. And so, we're just going to have to see what happens. Uh feel free to look into things like Turppuffer, uh S3 vectors, um Lance DB, things like that in order to see how this type of thing is actually implemented in industry right now. Uh but hopefully this serves as a decent introduction to how vector indexing works and kind of some of the techniques behind it. Hope you guys have a great day and I will see you in the next
相关推荐
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 views•2026-06-01











