Installieren Sie unsere Erweiterung an, um sofort in jedem Video zu suchen

Why Bag of Words Still Breaks Modern Embeddings
Hinzugefügt: 2026-05-27

626 Aufrufe666:48marimo-teamOriginalveröffentlichung: 2026-05-27

A sharp critique of the structural blindness in modern embeddings, reminding us that high-dimensional vectors still struggle with basic logic and syntax. It serves as a necessary reality check for an industry that often mistakes vector similarity for true semantic comprehension.

[00:00:00]Embedding models have a failure scenario. Even if you just give it a little bit of text, there is this bag of words phenomenon that they still all suffer from. That includes the cheap open source models as well as the expensive ones from the frontier labs.

[00:00:12]I've got a notebook in front of me here that's going to explain all these different details, but before diving into that, I'm just going to explain a little bit of linguistic theory to help set the stage. Okay, so just to prove a point, let me just write down two sentences. So, the lion eats a man, and this other sentence, the man eats a lion. Now, the interesting thing with both of these two sentences is that from the perspective of just counting the words that are in there, they are identical. They have the same {quote} {unquote} bag of words representation.

[00:00:40]If you put them all in a bag, shuffle them around, they have the same pieces.

[00:00:44]But if you think about, you know, a language and the actual meaning, then you could argue that the lion eats a man, that that's maybe a normal-ish sentence, but the man that eats a lion, uh, that's a very much out of the ordinary. And you would hope that whatever representation we come up with, that these two sentences are going to be very dissimilar because as far as meaning goes, they're almost the opposite. But we can do more. We could say, well, if this is our starting sentence, our template if it were, we can also come up with a different variation. We could also come up with eats the lion man a. Again, this has the same bag of words representation because it uses the same words, but in this case, the order is all jumbled up. Also here, as a linguist, you would look at that and go, well, surely these two things have to be dissimilar because one of these is grammatically correct and the other one just isn't. And then finally, what you could also do is have a different bag of words representation, like the lion does not eat a man. And what's happening here is negation, and what you're hoping here, of course, is that even though the stuff that's in blue here, that that's all the same as what I've got above, that there's enough context in this red bit that, again, the two embeddings for both of these two sentences is going to be dissimilar. And unfortunately, that is not the case.

[00:01:53]Embedding models are in large part defined by the tokens that they receive.

[00:01:56]And if you have two systems where the tokens that we get in is practically the same, almost 80% let's say, right? Then it's going to be quite hard even for the most hardcore embedding models out there to prove that there's a distinction or to show that they are maybe linguistically different. You could of course fine-tune a model maybe to perform better at this task, but what I would like to do in this video is just empirically check if it holds true. If these sentences are in fact different to what we start with.

[00:02:24]That brings me to the notebook. Uh what you can see here is that I've got a couple of examples that have to do with swaps. So, a man eats a lion, a lion eats a man. The quick brown fox jumps over the lazy dog. The lazy dog jumps over the quick brown fox. Uh these are all sentences where the topic of the sentence is swapped around. Then down below over here I've got shuffles. So, I love coding in Python, in love coding Python I, or all sorts of gibberish really. One sentence is proper, the other one is just shuffled around. And then at the bottom here I've got a bunch of negations. So, things like the movie was not good, it was bad. And then here, the movie was not bad, it was good. I do not like coffee, I like tea. I do not like tea, I like coffee. Now, in these cases the bag of words representation is also the exact same. So, it's a bit harder, but you would of course hope that this sentence is going to be judged differently than this sentence. Or at least in some use cases you would really hope that. So, I'm calculating all of these different embeddings and I'm doing that across different models. So, I've got a couple of models that come out of the sentence transformers library. So, the default one uh and some newer ones, but then I'm also using OpenRouter to calculate embeddings for the larger OpenAI models. I'm also taking Gemini, Quinn, and this other one that I found that should be Once you have embeddings like that, what you could then do is make a comparison matrix that looks a little bit like this. I'm just going to zoom in just a smidge there. What you can see is that I've got these pairs of sentences like a man eats a lion and a lion eats a man. And I'm calculating the cosine similarity between those two sentences and all the other sentences out there. And what do you see?

[00:03:53]Well, across all these things that you could do, like shuffling, negation, or flipping the word, the pairs are always going to be very similar. And there's one exception over here where the the quick brown fox jumps over the lazy dog.

[00:04:07]I've got a shuffle variation of that, and I've also got the switcheroo variation. So, those are pretty much the same words, and then you do see a big spike in similarity. And you know, there's some values here where off-diagonal elements have more similarity in other places. That's all well and good, but in general, the main similarity that you'll find in here are for the pairs that just have the same bag-of-words representation. And And this is for the standard sentence transformers model, but what I can also do is just go for the OpenAI model text embedding large. And you know, if I were to really just zoom out, these two almost look the same. Not entirely. And one thing that I did think was kind of cute to show, if you just squint your eyes a little bit, you do notice here that there's a square where things are just a little bit more similar than elsewhere generally. And that's because there's negation happening here. So, you can see that I guess that's true for both embedding models, but you can see that if there's negation, that that is picked up, but then any sentence that has negation in it is similar to any other sentence that also has negation in it. Now, one thing I thought was good to do at the end is to just do a quick overview as well. So, I've got word swaps over here, and you can see the average cosine similarity for the pairs.

[00:05:13]And then if we swap words, most of the similarities kept intact. If we do shuffles, again, most of the similarities kept intact. And if we do negation, and again, also most of the similarities is kept intact. And that's the thing that you're always going to have. So, if you're doing anything where negation is a very important aspect, just be wary of this. Also be aware that the grammatical correctness of a sentence in no way informs the shape of the embedding, but the words that are in the bag, that totally does. Now, you could wonder to what extent is this going to be a huge problem? And you definitely should apply a bit of nuance there. If, I don't know, you're doing doing for Wikipedia pages, then sure, the negation part is of course going to be a bit of a concern, but uh uh word swaps or shuffles, well, if you're interested in the topic of a document and maybe if you want to cluster it, and especially if you're going to cluster that based on larger pieces of text, then for all intents and purposes, you don't have to care about the word swap or the shuffle. Because in those cases, you could argue that the presence of a word tells you more about the document than the grammatical correctness of it.

[00:06:09]Negation definitely still is an issue if you're going to be doing stuff with embeddings. So, if that is something that's going to be very important, because maybe you're doing something with customer service logs and you're asking if something is a problem and then and then somebody says, "No, it is not. This is the problem." You know, moments like that, then negation can definitely still bite you. And in those moments, you do want to be careful with embeddings, or at least be aware of the fact that you might need more than just a embedding provider. Maybe mixing that in with a classical machine learning pipeline that can detect negation or ALM that tries to do the same, uh that might need to be part of a larger system. If you're keen to play around with this notebook, links are in the show notes, but just remember, bag of words, tokens, those things still matter, even if you have a fancy embedding model.

Ähnliche Videos

Künstliche Intelligenz

OpenHuman VS Hermes AI: Who Wins?

JulianGoldieSEO

285 views•2026-05-29

Künstliche Intelligenz

Long-Running Agents — Build an Agent That Never Forgets with Google ADK

suryakunju

142 views•2026-05-30

Künstliche Intelligenz

This computer is made from real human brain cells. And you can buy it.

Talktmsmedia

3K views•2026-05-28

Künstliche Intelligenz

BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2

aimmediahouse

122 views•2026-06-03

Künstliche Intelligenz

I Made the Same Anime Fight Scene in Every AI Video Generator

NobleGooseAnime

295 views•2026-05-30

Künstliche Intelligenz

Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S

cnnnews18

3K views•2026-06-01

Künstliche Intelligenz

I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)

AICodingDaily

298 views•2026-05-29

Künstliche Intelligenz

3D Platformer Update - NO CAPES

SolarLune

294 views•2026-05-30

Trends

The Casino Had Us Guessing All Day

VegasMatt

157K views•2026-06-03

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30

The Fastest Way To Board A Plane 😮

zackdfilms

6504K views•2026-05-29

Künstliche Intelligenz

DOOM Runs On Everything...except Neo Geo

ModernVintageGamer

143K views•2026-06-01