This video brilliantly reframes a buggy childhood memory as a sophisticated case study in the technical evolution of voice recognition. It turns a failed Nintendo peripheral into a compelling lesson on the history of human-computer interaction.
Deep Dive
Voraussetzung
- Keine Daten verfügbar.
Nächste Schritte
- Keine Daten verfügbar.
Deep Dive
Hey You, Pikachu! A Game for ChildrenHinzugefügt:
Wow, that's so cute. That's adorable, man.
I wonder what people got to say about that.
Huh?
Is he just allowed to lie like that?
3,000 [music] likes.
Brother, there is no such thing as a wild guinea pig. [music] Hey You Pikachu is a strange entry in the Pokémon franchise. Advertised as a game in which you can talk to Pikachu, this is one of the only two games to use the Nintendo 64's voice recognition unit, with the other being released only in Japan. Ostensibly, the player can control the flow of gameplay by speaking plain English to Pikachu and watching him act in response to your commands.
I can't [music] take it anymore. Reel it in. It's an adorable concept, but a bit far ahead of the technology. Reviewers at the time gave it credit for being fun for children, but this is completely independent of whether the VRU actually functioned as advertised. In the same way a little kid might enjoy playing and losing the first level of a Sonic game over and over again, the magic of speaking to Pikachu and watching him listen and react was more than enough to make the game worth playing, even if the game had no clue what you were saying at all. The set of words Pikachu could recognize was never clear, which spawned some urban legends. It was widely believed, for example, that saying "Sega" would make Pikachu angry. In my family, it was "Butterfree likes butter." And why not? This technology was capable of interpreting human speech, hearing the word "carrot" and knowing it meant "carrot." As far as I knew, the thing was fluent in English.
In the years since, I must admit that Hey You Pikachu never stuck with me the same way other N64 games like Pokémon Snap did. I couldn't even tell you if I ever beat it. The loading screen, the music, the memories of sitting in front of the CRT saying "Butterfree likes butter," they all just flashed to my mind whenever I had to move around the clunky peripheral to find something else. Whenever I sifted past the cartridge to play a different game.
But that all changed.
It was a year of great loss.
I had lost a friend, my job, even my apartment. I shed some nostalgic tears as I came across my Nintendo 64 while packing, and reminded myself that wherever I wound up, I would always have this source of innocent joy.
But then I booted up Animal Forest and this gorilla was really mean to me, so I got really mad and packed it [music] up.
But trying to fit all those strangely shaped N64 items safely into a box, I felt a magnetic pull to the clumsy [music] microphone, the smiling mouse, so much nicer than that gorilla from that other game. I promised myself that once my life was back in order, I would see what memories that little plastic box held.
Wait, uh where's the escape key?
Oh, I see. And there's this, which is like a co-pilot key instead of right control. Okay, uh let's try this.
Uh how to rebind keys Arch Linux. Audio not working Arch Linux. Video camera not working Arch. Microphone not working Arch. In reflexive accordance with my no fun allowed principle, I found myself looking up the VRU data sheet before I unpacked my games. Strangely, the search results were sparse. This is one of the least studied, poorest documented Nintendo products I've ever seen. Well, maybe it's not objectively strange that there aren't many people reverse engineering a device used for one mediocre game, but anyway, here's what I found. The device has a set list of recognized sound units that it strings together to build words and phrases.
Given the time period in which the device was built, we can assume it does so by breaking the sound down into smaller windows and comparing the spectrogram time-frequency pattern to that list of sounds. In Japanese, syllables are mostly coded one-to-one or two-to-one by the members of the kana syllabaries. In English, we famously don't have such a mapping. We instead have phonemes that map barely above chance to combinations of letters. Of course, the N64 isn't reading in letters, it's reading in binary strings corresponding to phrases. So, all the concepts aren't all that different between English and Japanese versions of the game.
Reading them out in English is significantly stupider.
In any case, the VRU makes a few good guesses at the phoneme string and then compares the guesses to a list of phrases that the game recognizes. The specifics of this process are secret Nihon-ji knowledge, but that should be enough to get started. Get started with what, exactly? Why are you asking such a hostile question? Well, maybe there's a neat project in there. Say you were, I don't know, blind and deaf, but still capable of speech, and you were determined to play Hey You Pikachu. You would need a robust way to produce speech so that the game could recognize as one of its commands. If you've ever played this game, you know that robustness is not its strong suit. If anything, the game seems to misunderstand you more often than it understands you, and I've broken down the reasons why into an internet-friendly numbered list.
>> [music] >> The same word spoken two different times is not sonically the same word. The bachata music outside my window corrupts every element of my life, but most critical to [music] my happiness, it corrupts the speech recognized by the video game Hey You Pikachu. The VRU makes more than one guess at the phrase to try to get around this, and the game's list of recognized phrases is also created with this in mind. For example, the Japanese version of the game has 15 Pikachus, six konnichiwas, and 10 of whatever this is. That makes the game much more likely to correctly register these words even through the noise of that one Juan Luis Guerra song.
On the other hand, it only has one daikirai, "I hate," and one razuberi, or raspberry for my English-language viewers. The probability that one of these words is misrecognized is relatively high, which makes this game of the vegetables more than a bit frustrating.
Voice recognition software can handle pretty crazy stuff, but remember, we're not talking about the age of neural networks and large language models.
>> [music] >> We're talking about the age of Microsoft Sam.
I place one and then the sound of the helicopter blade.
All right.
You lost.
Lost no make chills.
One obvious issue is that there's no such thing as the pronunciation for a word. Like, we have this, and some of this, and even a little bit of this.
But I have a hard time believing this guy would be getting past the vegetable game. Whether for this reason or otherwise, the game was never released in Australia, Europe, Korea, or Hong Kong. Still, across the United States, we have an incredible and beautiful diversity of accents. We have this, bit of this, and some of this.
I can't say with certainty what accent the player is expected to have, but there is such a thing as a standard American accent. This is usually a learned accent. Like, do you think this guy from Jersey City grew up talking like this? A R in terms of is this a postseason team? Executives and ownership will say If this is the expected accent, I imagine that most children outside of higher-class northern families struggled with this one. And I do specifically mean children.
I don't watch the news because I'm a kid, and apparently every time apparently grandfathers gives me the remote after we watch the Powerball.
This game is designed for children in one of the most literal senses. The main mechanic is optimized for the voice of a child. You may have noticed that children don't really sound like us. I don't watch the news.
And that's not just a stereotype. As we go through puberty, our vocal folds become larger and thicker, and probably some [music] other science things happen, primarily causing our voices to deepen. For example, a male with a sultry baritone voice like myself has a fundamental frequency of around 100 hertz. The average female's is closer to 200 hertz. The average child's and the Hey You Pikachu target audience is somewhere between 250 and 300 hertz, well over an octave above my own.
>> [music] >> So, that's where my research led me before I even plugged in my N64.
>> [music] >> As it turns out, I was dead on. The game took so much effort and repetition to play that I couldn't even make it past the first day without getting too frustrated to push on. Reel it in.
Reel it in.
Reel I can't take it anymore. But this, not to sound too dramatic, deeply haunted me. With all of this chaos, I had gone from nostalgic yearning to technical intrigue to video concept to societal critique to the cruel recognition that something you loved so much as a child just kind [music] of sucked, that you would never be able to experience loving that game again.
Unless.
So, even MATLAB comes with co-pilot now.
I don't know it was bought by Microsoft or something, but at least you can turn it off. Can I rebind it to right control?
In order to experience playing this game again, I would need to find a way to consistently produce sounds that it would recognize as its built-in phrases.
The obvious choice would be to use one of a million modern text-to-speech programs that I assume are primarily used as AI girlfriends. But this would be a far cry from playing the game myself. Whatever goes into this microphone must be distinctly my own. I was able to dig up two critical resources for the US version of the game, the phoneme to hex code map, and the list of 640 recognized hex strings.
Synthesizing the two, I can get a phonetic list of recognized commands.
This tells me the precise expectation for the pronunciation of each word. For example, for this word, the game accepts onion or onion. On the other hand, I might say onion or So, to most closely match the expectation of the system, I put on my best standard American accent and recorded the recognized phonemes one by one.
I cropped them in Audacity and used MATLAB to denoise them and normalize the volume, then stretch them to the correct number of frames. Then, using my dictionary of phonetic pronunciations, I can concatenate the phonemes to produce an audio command.
Did you catch that?
And how about See, distinctly my own. Now, of course, this is not a very recognizable signal from a human perspective, but if the system is comparing windowed spectrograms to template phonemes, as I assume it is, this could be a robust and even optimal way to produce a recognizable signal from the device perspective. But this doesn't solve that child problem from earlier. For that, I'll need to modulate my audio.
According to that graph from before, this should amount in large part to pitch shifting upwards, emulating production by underdeveloped vocal cords. So, we got Ah, sounds great to me. In theory, this should solve all three problems, and I should be able to use this to play the game with these commands generated from my own voice. And in case you were wondering, it actually does work.
The experience is taxing, and the recognition isn't 100% perfect, but it's infinitely easier than playing the game with my own natural speech.
But even though I now had the tools required to experience the game once again, I found myself completely unamused. Worse, I was >> [music] >> overcome with frustration. The same frustration that had been lingering for so long, soiling each and every daily action.
>> [music] >> So, I asked myself that hostile question.
What was the point again?
Wasn't I trying to relive some innocent joy detached from productivity, work, [music] rage?
But this was not nostalgic. This was not reliving the carefree joys of youth. And how could it be? It was so mechanized, artificial.
That's another casino.
And then, this is something called the AI native total spend management.
Clearly, the issue was a faulty optimization target. If my goal was to experience the game as I did as a child, I shouldn't be pursuing the best command recognition. That's efficiency, productivity. Instead, I should be pursuing experiencing the game as I did as a child. Now, you might not have known this, but I'm one of those New York Italians. I grew up on Long Island, but I was taught to speak this way in school so that nobody could tell that I'm Sicilian American. How am I doing?
When Hey You, Pikachu released, I was still a very young boy with a strong Long Island accent. I don't have any voice recordings of myself as a child, but we can get a sense of the voice from the island's most celebrated individuals. Let's get back to that guy, Billy Joel. Singer-songwriter Billy Joel, 9/11 first responder Steve Buscemi, and Nana.
>> [music] >> And I think that's about it. The characteristic markers are the replacement of the ah sound with aw, >> [music] >> and the dropping of the H and R consonants. So, a few extra sound files or and a few minor edits to the code should do the trick.
The result with pitch shifting is the closest I can come to recreating the voice of my youth within my own constraints.
But it's still not quite there.
The emotions just aren't resonating. I can hear it, but I can't feel it.
I can't see it.
>> [music] [music] [music] [music] [music] [music] [music] [music] [music] >> Why is it that even the things we do in the pursuit of joy make us so angry? At times, it almost feels by design.
Like the higher powers are trying to keep us on edge. Other times, it feels like it's our own fault, like we're gravitating towards entertainment designed to piss us off. I don't know.
>> [music] >> Maybe it's the modern world, or maybe it's just being an adult, growing up.
But I guess it doesn't have to be that way. When I start to explode in rage at the 2FA prompt that broke the camel's back, maybe it's okay to get up and pet the dog, or the cat, or the other cat, or some turtles.
Maybe not the poisonous frogs.
And maybe what made that game so great as a kid wasn't that you could get through any of the levels. Maybe it was that you could just hang out and play it, like there was nothing else going on in the world. Maybe even now, I could just sit down and play a game.
Reel it in.
Reel. But maybe not that game.
>> [music] >> Just waking up in the morning, [singing] got to thank God. I don't know, but today seems kind of odd. No walking from the dog, [singing and music] no smoke. Mama cooked the breakfast with no I got my grill bone, but Dig That.
Finally got [music] a call from a girl I want to dig out. Hooped it up a little as I hit the [music] dough, thinking [singing] will I live another 24? I got to go cuz I got
Ähnliche Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 views•2026-06-01











