This architecture smartly pivots from wasteful parameter bloat to algorithmic recycling, proving that efficiency can occasionally rival raw scale. However, calling it a "shock" to industry giants is mostly marketing hyperbole for what is essentially a clever engineering trade-off.
深度探索
先修知识
- 暂无数据。
后续步骤
- 暂无数据。
深度探索
Claude Mythos Clone Shocks Anthropic and OpenAI本站添加:
All [music] right. So, something pretty interesting is happening right now in the AI space, and it's not coming from a big lab release, not from OpenAI, not from Anthropic, but from a 22-year-old who basically looked at one of the most secretive architectures in the industry and tried to rebuild it from scratch.
And the wild part is the idea actually makes sense.
There's been a lot of talk around Claude Mythos, this supposed architecture that people have been hinting at as something different, something potentially more powerful, maybe even too dangerous, depending on who you listen to.
>> [music] >> No official paper, no full breakdown from Anthropic, just fragments, speculation, and a lot of curiosity. So, what this guy, Kai Gomez, did is take all the public research, all the hints, all the patterns we're seeing across newer models, and he built something called Open Mythos. [music] Fully open source, implemented in PyTorch, and not as a copy of Mythos, but as a hypothesis. A testable one. And at the center of it is this idea called a recurrent depth transformer, or RDT.
Now, this is where things start shifting. Most models you're familiar with, GPT-style models, Llama, Mistral, all of them follow the same basic structure. You stack layers.
>> [music] >> Each layer has its own weights. You want more capability, you add more layers, more parameters, and suddenly you're at billions [music] or even trillions of parameters. RDT flips that. Instead of stacking more layers, you take a smaller set of layers and you run them multiple times.
In Open Mythos, that loop runs up to 16 times. Same weights, reused again and again, refining the internal state each time. So, instead of a deeper model, you get deeper thinking during inference.
The way it's structured is actually pretty clean. There's a prelude at the start, which encodes [music] the input once. Then you have the recurrent block, which is the core, looping multiple times. And finally, a coda at the end, which produces [music] the output.
Inside that loop, something interesting happens. Each iteration updates the hidden state using a mix of three components: the previous [music] state, the original input signal, and the transformer computation itself. That reinjection of the input every loop is important, because otherwise the model would drift too far away from what it was supposed to process.
And mathematically, it's expressed like this internal update rule, where the hidden state evolves step-by-step, controlled by learned matrices that decide how much past state and input to keep. Now, here's where it gets even more efficient. Instead of a standard feedforward layer, Open Mythos uses a mixture of experts setup, around 384 experts in total, each one specialized for different kinds of tasks. [music] But at any given time, only a small subset is active. In the case of Kimiko 2.6, for example, only eight experts are selected per input. So, you get this combination of breadth and depth. The MOE gives you access to a wide range of specialized knowledge, while the looping gives you deeper reasoning. And crucially, each loop can activate different [music] experts, so it's not just repeating the same computation over and over. That answers one of the biggest criticisms [music] people have when they first hear this idea. Running the same thing 16 times sounds inefficient. It sounds like wasted compute. But if each pass routes through different experts, [music] then each pass is actually adding new information.
So, instead of stacking hundreds of layers with different weights, you reuse a smaller set of weights in a smarter way. And the results are kind of crazy.
And honestly, that same idea of getting way more output from a smarter system instead of just brute forcing everything is exactly why tools like Higgsfield start getting interesting, too.
Higgsfield is sponsoring today's video, and their marketing studio takes that same kind of compression and applies it to ad creation. Normally, making ads is still messy. You need a product angle, a script, footage, edits, maybe a creator, maybe a voiceover, and then you still need multiple versions to test. It turns into this whole chain of steps that eats time fast. Marketing studio collapses that down. You can paste in a product link or upload an image, and it turns that into multiple finished ad formats inside one workflow. So, instead of getting one generic output, you can generate UGC-style videos, tutorials, unboxings, product reviews, faster-cut promo ads, even more polished TV-style creatives, all built around the same product. And it runs on Seedance, too, which is what handles the motion, visual consistency, and overall quality.
Honestly, I stopped reading AI news. I just watch AI Revolution, robots, models, the entire frontier covered daily.
Go subscribe. You'll thank me.
>> You can even use your own face or generate an avatar inside the platform, then keep that same identity across the different videos. So, the whole thing feels less like patching together random tools and more like actually having an ad engine. If you want to try Higgsfield Marketing Studio, the link is in the description. All right. Now, back to the video. There's research showing that a 770 million parameter RDT can match the performance of a 1.3 billion parameter standard transformer trained on the same data. That's almost half the parameters with similar output quality. That alone already challenges one of the core assumptions in AI scaling. But there's more.
All of this reasoning happens entirely in latent space. There are no intermediate tokens generated during the process. The model doesn't think step-by-step in text like chain-of-thought prompting. It doesn't write out its reasoning and then read it back. It just thinks internally.
16 iterations all happening inside the hidden state vectors, and then at the end, you get a single output. This is fundamentally different from how most people understand AI reasoning today.
Chain of thought is visible reasoning.
RDT is hidden reasoning. And there's a big advantage here. Because it's operating in continuous space, it can represent multiple possible reasoning paths at once, something closer to a breadth-first search, all happening inside one forward pass.
There are also experiments backing this up. One of them looks at systematic generalization. Basically, can the model handle combinations of knowledge it never saw during training? Standard transformers struggle with this. They tend to fail when the exact combination isn't in the data set. The recurrent transformer handled it. Another test looked at depth extrapolation. The model was trained on reasoning chains up to 20 steps, then tested on 30-step [music] problems. Standard transformers collapsed. The recurrent model just added more loops and kept going. So, instead of being limited by what it saw during training, it can extend its reasoning dynamically at inference time.
That's a big deal. Because it suggests that the bottleneck in current models isn't knowledge, it's the ability to combine that knowledge effectively. And looping seems to unlock that. Now, of course, this kind of architecture comes with its own problems. One of the biggest is stability. If you keep looping, the hidden state can explode.
Values grow uncontrollably, and the model breaks.
>> [music] >> This is something that's been a known issue with recurrent architectures for a long time. Open Mythos addresses this using something called linear time-invariant injection, based on the Park K paper. Basically, it constrains the system so that the hidden state remains stable, no matter how many loops you run. There's also the opposite problem. Too many loops can lead to overthinking. The model goes past the correct answer and starts drifting into noise. To solve that, [music] they use adaptive computation time. Each token gets a learn signal that decides when to stop looping. Harder parts of the input get more compute. Easier parts stop early.
>> [music] >> So, now you have dynamic reasoning depth per token.
On top of that, there are depth-wise LoRA adapters. These are small parameter additions that slightly modify behavior at each loop step. So, even though the base weights are shared, each iteration [music] isn't identical. And then there's attention. Instead of standard attention, Open Mythos uses something similar to multi-latent attention from DeepSeek.
>> [music] >> It compresses key-value pairs into a lower-rank representation, reducing memory usage by up to 10 to 20 times.
So, you're getting efficiency across multiple layers of the system. Fewer parameters, less memory, more flexible reasoning. And all of this comes together into a pretty clear idea.
Scaling might shift. Instead of training bigger models, the focus might move toward letting models think longer during inference. That's a completely different direction. Now, while all of this is happening on the research side, you also have companies pushing in parallel directions [music] with actual deployed models. Moonshot AI just released Kimiko 2.6, [music] and this thing is massive, 1 trillion parameters.
But even here, you see similar ideas showing up. It uses mixture of experts with 384 experts, again only activating a small subset per input. It uses multi-head latent attention, similar to what we just talked about, to compress attention data and reduce hardware requirements. The activation function [music] is SwiGLU, which is more efficient than older approaches and already used in models like Llama. And then, you have multimodal capability through a 400 million parameter vision encoder, allowing it to process both text and images. But what stands out more is how it handles tasks. It can spawn up to 300 agents for complex workflows. These agents break tasks into sub-steps and execute them in parallel.
So, instead of one model doing everything sequentially, you get this [music] distributed execution system.
There's also something called claw groups, which lets the model bring humans into the loop, splitting tasks between AI agents and real people. And performance-wise, Moonshot claims it outperforms GPT 5.4 and Claude Opus 4.6 on multiple benchmarks.
On HLE full, which is one of the hardest benchmarks out there with around 2,500 doctorate-level questions across more than 100 fields, Kimiko 2.6 scored 54.
Opus got 53, GPT 5.4 got 52.1.
So, very close, but slightly ahead. Now, of course, these are company-reported benchmarks. So, you always take them with a bit of caution. Every company tends to highlight their strongest results. But, the trend is clear.
Efficiency, modularity, and parallelism are becoming more important than just raw parameter count. And then, on another front, you've got XAI pushing into voice.
They just released new speech-to-text and text-to-speech APIs under the Grok ecosystem. And these are already being used in Tesla vehicles, Starlink support systems, and mobile apps.
So, this isn't new tech. It's production-tested tech now being exposed to developers.
The STT side supports 25 languages, real-time and batch transcription, speaker diarization, word-level timestamps, and 12 audio formats.
The TTS side has five voices: Aura, Eve, Leo, Rex, >> [music] >> and Sal across 20 languages, and can even include expressive tags like laughter or sighs. Pricing is aggressive. $0.10 per hour for batch transcription, $0.20 for streaming, >> [music] >> and $4.20 per 1 million characters for text-to-speech. That's cheaper than most competitors right now. And performance-wise, at least according to XAI, it's strong. On phone call entity recognition, Grok's STT has a 5% error rate. 11 Labs is at 12%, Deepgram at 13.5%, AssemblyAI at 21.3%.
That's a significant gap. Especially for industries like healthcare, law, finance, where accuracy really matters.
But again, these are self-reported numbers.
11 Labs has years of optimization in voice quality and nuance, which might not show up in benchmark tests. So, real-world performance still needs to be judged by actual usage. Still, XAI has one big advantage. Scale. Millions of interactions already running through Tesla and Starlink systems. So, even if it's not perfect, it's already proven in real environments. Anyway, that's it for this one. Let me know what you think about this direction, and I'll catch you in the next one.
相关推荐
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 views•2026-05-30











