Zonos 2 is an open-source text-to-speech model that achieves high-fidelity voice cloning at real-time speeds by using a Mixture of Experts (MOE) architecture, which allows it to have 8 billion total parameters while only using 900 million active parameters during inference, solving the traditional trade-off between voice quality and processing speed in TTS models.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Real-time Voice cloning, Kimi K2.7 CODE, GLM 5.2 and 3D reconstruction | AI News
Added:AI broke the internet again and this week has been absolutely insane. We have a brand new open-source TTS model with high-fidelity voice cloning that sounds better than most paid services. MiniMax drops a massive new open weight model that's the first of its kind to combine frontier coding, a 1 million token context, and multimodal understanding all in one. Kimmy releases a new open-source coding agent 30% fewer thinking tokens than their previous model. Google's G POU drops a new GLM model with a 1 million token context window. Nvidia releases a really impressive motion generation system that can run 350,000 animation skills in real-time. Microsoft releases a tiny 7 billion parameter model that can actually control your computer. We have a new open-source robotic exoskeleton that gives you real-time haptic touch feedback. We have a new open-source 3D surface reconstruction model from some top universities and a next-generation fully open vision language model and a lot more. First up, this AI is really impressive. It's called Zonos 2 and this is a real-time text-to-speech model with high-fidelity voice cloning from a company called Zyfra. Now, what makes this one interesting is the architecture. You see most TTS models have to make a trade-off between quality and speed. The better the cloning quality, the slower it usually runs.
Zonos 2 tries to solve that trade-off using a mixture of experts architecture.
In fact, it's actually the first open-source TTS model to use an MOE architecture at all. It has 8 billion total parameters but only 900 million active parameters at inference time. So, you get the feel of a much bigger model, but at real-time speeds. And the voice cloning quality on this is really good.
Here are some examples.
>> Folks, nobody talks about Shinji Ikari the right way. Okay, no nobody. They are all saying, "Oh, he's conflicted. He's emotional. He's hesitant." And I say maybe.
>> You can hear how closely the generated voice matches the reference speaker's tone, rhythm, and expressiveness.
Doesn't have that flat synthetic feel you usually get from open-source TTS models. Zonous 2 performs especially well on speaker similarity and prosody metrics, which basically means it sounds more like the actual person, and it speaks more naturally in terms of emphasis and rhythm. The awesome thing is they've released this already under an Apache 2 license. So, if you go to their hugging face page, you can download the weights. The inference code is also on GitHub, and they're even hosting it for free on Zyphra Cloud during the launch period if you just want to try it without setting anything up. If you're interested in reading further, I'll link to this page in the description below. Also this week, MiniMax releases a pretty incredible new model. It's called MiniMax M3, and this is a really big deal because it's the first open weight model that combines three things that were previously only available in closed frontier models.
Frontier-level coding, and agentic performance, a 1 million token context window, and native multimodal support for images, video, and desktop computer use. All three of those together in a single open weight model. That's actually a first. Now, the 1 million token context is powered by a brand new architecture they developed called MSA, which stands for MiniMax sparse attention. In simple terms, think of it as like a more efficient way of handling extremely long contexts without the memory costs exploding. And this context length is designed specifically for complex agentic tasks where the model needs to hold an entire code base or a long document thread in memory all at once. In terms of coding and agentic performance, MiniMax says M3 approaches the level of leading closed-source models on tasks like bug fixing, front-end and back-end development, and performance optimization. Now, the awesome thing is they've released the weights already on Hugging Face. You can also try it right now through MiniMax Code, which is their coding product. The link to the technical report and the model page are in the description below.
If you're interested in reading further, I'll link to this page in the description below. Next up, Kimmy releases another pretty useful open-source coding model. It's called Kimmy K 2.7 Code. And this is the latest coding-focused agentic model from Moonshot AI, the team behind the Kimmy series. Now, K 2.7 Code is basically a direct improvement on K 2.6. The key upgrades are stronger performance on real-world long-horizon coding tasks, better agentic task execution, and about 30% fewer thinking tokens compared to K 2.6. That last one is actually pretty important. In other words, it reasons more efficiently to reach the same answer. So, you get faster responses and lower API costs for the same quality of output. Now, here are the actual numbers for your reference. On Kimmy Code Bench V2, K 2.7 Code shows a 21.8% improvement over K 2.6. On Program Bench, it goes from 48.3 to 53.6.
And on MLS bench light, it jumps from 26.7 all the way to 35.1.
Now, if you compare this to the closed frontier models, GPT 5.5 scores 69.0 on Kimiko bench V2, and Claude Opus 4.8 scores 67.4.
So, K 2.7 code at 62.0 is getting closer, but it's still behind both of those. On the Agentech side, it's a similar story. On MCP Atlas, K 2.7 code scores 76.0 compared to 79.4 for GPT 5.5 and 81.3 for Claude Opus 4.8. Pretty close, but still not quite at the top. Now, I will say this is actually a pretty fair and transparent comparison since Moon Shot tested all four models under equivalent settings with thinking enabled and the same agent harness. So, unlike some releases, this isn't a cherry-picked benchmark. The awesome thing is they've released this already on Hugging Face.
The weights are available under their open license, and you can try it right now through Kimiko. If you're interested in reading further, I'll link to this page in the description below. Also this week, ZAI, which is the team behind the GLM model series, releases a pretty interesting update. It's called GLM 5.2, and the headline upgrade here is a genuinely usable 1 million token context window, up from 200,000 in GLM 5.1. Now, if you're not familiar with the GLM series, the predecessor GLM 5.1 was actually scoring 58.4 on SWE bench Pro, which at the time put it ahead of GPT 5.4 and Claude Opus 4.6 on that specific benchmark. So, this is a model family that has been genuinely competitive. GLM 5.2 is built on the same 744 billion parameter mixture of experts architecture as the previous GLMs. The main upgrade is the context jump and two new thinking effort levels, which let you dial between a faster, low effort mode and a slower, high effort reasoning mode. Now, I'll be honest here. They shipped this with no benchmark numbers at launch at all. Not a single SWE bench score, no human eval, nothing. Just a promise that benchmarks and open weights are coming next week.
So, these are just their own claims for now, and it's interesting why they didn't just hold the release until the numbers were ready. Anyways, it's live right now on every GLM coding plan tier.
The standalone API and the MIT licensed open weights are promised for next week.
A link to the model page in the description below. Next up, Nvidia releases something really impressive for game developers and animation teams.
It's called Motion Bricks, and this is a real-time motion generation system that can cover over 350,000 different motion skills using a single neural backbone, and it runs at 15,000 frames per second with just 2 milliseconds of latency. That's actually incredible for a neural network-driven system. So, how this works is Motion Bricks uses a modular latent generative backbone combined with what they call smart primitives. In simple terms, think of it as like a system where each motion skill is a reusable building block, like a brick, and you can stack them together to create any complex movement without needing to author custom transitions or have expert animation knowledge. Now, what's really impressive about this is that it works zero-shot on new tasks.
You don't need to fine-tune or retrain the model for a new game scenario. You just plug in a new smart primitive and it works. Here are some examples from their full uncut 2-minute 40-second Unreal Engine 5 demo. As you can see, the character can navigate complex environments, interact with objects, pick things up, sit down, jump over obstacles, and switch between completely different movement styles, all in real time. And every single motion you see there is neural network generated. No hand-authored animations, no foot locking, no blending tricks. That's pretty wild when you think about it. The awesome thing is they've released code already as part of Nvidia's full body control project. So, if you click on the GitHub link here, it ships with an interactive demo and a training pipeline, so you can start training your own Motion Brick style policies right now. A full release is coming in about a month. If you're interested in reading further, I'll link to this page in the description below. Also this week, Microsoft releases a really useful small model for computer use. It's called Fara, and this is Microsoft's first agentic small language model designed specifically to control a computer. In other words, this model can see your screen, understand what's on it, and take actions like clicking buttons, filling out forms, and navigating through desktop applications. The awesome thing about this is the size. 7 billion parameters is tiny compared to the kind of models you'd usually need for this kind of task. And yet, Microsoft says it holds its own against larger, more resource-intensive computer use agents. Now, here are the actual numbers for your reference. On Web Voyager, which is a standard benchmark for web agents, Fara scores 73.5%.
That's actually higher than GPT-4o configured as a computer use agent, which scores 65.1% and higher than OpenAI's own computer use preview model, which scores 70.9%.
It also beats UI-Tar 1.5, which is the same size class at 66.4% on Deep Shop, which tests e-commerce tasks. Phara scores 26.2% compared to 16.0 for GPT-4o and just 11.6 for UI-Tar 1.5. And on their new Web Tail Bench, Phara scores 38.4% compared to 30.0 for GPT-4o.
Now, the awesome thing is it also does this way more efficiently. It completes tasks in around 16 steps on average compared to 41 steps for UI-Tar 1.5.
And the estimated cost per task is about 2 and 1/2 cents compared to roughly 30 cents for the larger proprietary agents.
That's actually a massive efficiency gap for a 7 billion parameter model. Now, the key innovation here is the training data set they built alongside it. It's called Web Tail Bench, and this is a benchmark specifically designed to evaluate web-based task completion in realistic, hard-to-solve scenarios. The awesome thing is they've released this already. The model weights are on Hugging Face under an MIT license. The data set is also on Hugging Face, and the inference code is on GitHub. If you're interested in reading further, I'll link to this page in the description below. In humanoid robot news, we have two really interesting open-source tools from Nvidia this week.
First up, we have Soma. This is a unified parametric human body model that can work across multiple different body representations. You see, the problem with human body models right now is that there are a bunch of them. SMPL, SMPL-X, MHR, Annie, and others. And they're all mutually incompatible. If you train a motion model on SMPL data, you can't easily use it with an SMPL-X or MHR representation without a ton of custom engineering. Soma solves this by creating a single canonical body topology and rig that acts as a universal pivot for all of these different models. In other words, you can mix and match identity sources and pose data at inference time without any additional custom adapters. And the whole pipeline is end-to-end differentiable and GPU-accelerated through NVIDIA Warp, which is pretty cool. It also links directly into other NVIDIA robotics tools, like their Comodo motion generation model, their Bones Seed motion capture data set, and their whole body control work. The awesome thing is they've released this already under Apache 2. So, if you click on the GitHub link here, you can download and use it for commercial projects right away. If you're interested in reading further, I'll link to this page in the description below. Also, from NVIDIA this week, we have a really impressive exoskeleton for robots. It's called UME, which stands for Universal Manipulation Exoskeleton. And this is actually a pretty fascinating piece of hardware and software research. So, you see, one of the biggest problems with training robot arms right now is that most teleoperation systems can only capture where your arm is, not how much force you're actually applying. UME solves this by adding real-time haptic torque feedback to the exoskeleton. In other words, when you're teleoperating a robot arm through UME, you can actually feel the resistance and forces the robot is experiencing. And that changes everything for training data because now the robot policies it learns actually know how to be compliant and how to react to force feedback. Here are some examples. As you can see, a human operator is able to unsheathe a metal sword from a scabbard while blindfolded, relying purely on the haptic feedback from the robot. That's actually ridiculous when you think about what that requires. You can also see the robot autonomously retrieving a drink can from a fridge, picking up a GPU card from a highly constrained space between two desktops, and doing whole body mobile manipulation. Really impressive stuff. Now, the code is listed as coming soon, so they haven't released it quite yet, but I'll link to the project page in the description below, so you can follow along as it becomes available. If you're interested in reading further, I'll link to this page in the description below. Also this week, we have a pretty interesting new 3D reconstruction method. It's called Surf lo, and this can take a variable number of unposed images and reconstruct a clean, coherent 3D surface from them. Now, how this works is instead of producing a separate point map for each input view, which is how most current methods like VGGT or Dust 3R work, Surf lo encodes all the input images into a single global latent state of fixed size, just 128 tokens, regardless of how many images you give it. And then, it decodes the 3D surface from that single global state using flow matching, which is the same idea behind modern image diffusion models, but applied to 3D geometry. In simple terms, think of it as like the model is learning what is shared across all the views, and then generating the 3D geometry that is consistent with all of them at once. As you can see in their examples, even when you give it 33 or 65 views, the global state stays the same size, but the reconstruction just gets more complete and accurate. And you can decode the surface at any resolution you want without rerunning the encoder. Now, for your reference, the researchers say SurfFlow matches or surpasses other feed-forward reconstruction baselines on standard surface metrics while running an order of magnitude faster than optimization-based methods that need hundreds of views to work. They also claim it's the only feed-forward approach right now that combines a global latent with arbitrary resolution decoding. I'll note that the paper doesn't publish a specific numbers table in the abstract, so I'd recommend checking the full paper if you want the exact benchmark breakdown. The awesome thing is they've released the code already on GitHub. I'll link to it in the description below. And that sums up all the highlights in AI this week. Let me know in the comments what you think of all of this. Which piece of news was your favorite? And which tool are you most looking forward to trying out? As always, I will be on the lookout for the top AI news and tools to share with you.
So, if you enjoyed this video, remember to like, share, subscribe, and stay tuned for more content. Also, there's just so much happening in the world of AI every week. I can't possibly cover everything on my YouTube channel. So, to really stay up-to-date with all that's going on in AI, be sure to subscribe to my free weekly newsletter. The link to that will be in the description below.
Thanks for watching, and I'll see you in the next one.
Related Videos
AI Agent Mastery Certification Course: Lab 4 – Tools & MCP
arizeai
350 views•2026-06-16
He Believes AI Could Replace Humanity Faster Than Anyone Expects
LondonRealTV
815 views•2026-06-15
General Session by Rami Rahim-The next generation of networking: From vision to self-driving reality
HPE
108 views•2026-06-17
[PLDI 2026] Flatirons 3 - LCTES (Jun 16th)
acmsigplan
191 views•2026-06-16
Google DeepMind’s AI Halves UK Housing Planning Time
60secondsignals
467 views•2026-06-17
The Creators of Claude Code and OpenClaw don't Prompt Their Agents Anymore?!
ColeMedin
569 views•2026-06-18
Why prompt injection is AI's biggest fail
usemultiplier
1K views•2026-06-17
The End of Annoying AI Interruptions? LiveKit Turn Detector v1 Tested
livekit_io
190 views•2026-06-17











