拡張機能をインストールして、あらゆる動画内を即座に検索しましょう

oMLX vs Ollama: Extreme Context, SSD KV Cache & Mac Crashes
追加: 2026-05-11

567 回視聴8012:36Protorikis元のリリース: 2026-05-08

oMLX delivers impressive speed through SSD caching, but its instability at high context proves that raw performance is useless without the reliability Ollama offers. It’s a classic trade-off where Ollama remains the safer bet for anyone prioritizing uptime over experimental benchmarks.

[00:00:02]Puerto Rices. In the previous video, I compared performance between Llama CPP and free MLX runtimes, MLXLM, Apple's own Python runtime, LM Studio with MLXLM back end, and the new Olama MLX engine.

[00:00:18]During the testing, MLXLM showed itself as the brute force racing horse.

[00:00:23]Unfortunately, as it goes with horse racing, it crashed my Mac multiple times, too. And I couldn't even get the benchmark with a large contact size to complete. Only the half context test with 49K tokens reached the finish line on my machine. But then I got truly impressed with the new Olama MLX runner.

[00:00:43]Unlike LM Studio, which is using MLXM but is slower, Lama seemed to develop their own MLX runner. You could feel it got some dedicated engineering effort.

[00:00:54]And just like LM Studio, it managed to complete the large context benchmark without crashing, but with much faster performance. See my previous video for more insights. In this video, I want to add OMX to this comparison. Many of you left comments about it. However, please sit down. There's some awesome and some critical things we need to get straight about it. OMX is an open- source runtime wrapper coming from June from South Korea. This is its GitHub page. Similar to one runtime I mentioned in the previous video, Malik's main selling point is its ability to cache computed context to SSD. That's genuinely useful and we'll test it thoroughly in a minute. And what I like in June's approach compared to that earlier rapper is that there are no absurd performance claims and no trashing of LM Studio by putting it into a corner with unfair configuration.

[00:01:51]On the opposite, Omalik's release page provides cold cache comparisons and instantly admits that SSD caching adds some latency. I installed it and here it got h convoluted.

[00:02:05]The Redmi provides so many options, but none seem to do what I need in one step.

[00:02:10]I downloaded and installed DMG and got the menu icon, but it says for getting CLI, you need to pour a homebrew package. Frankly, I just compile everything myself, but it's not clear whether that would give the menu icon or not. I guess not. And I actually just found CLI hidden inside the installed Mac OS app. Wouldn't brewing it lead to duplicate CLIs in my OS? Either way, enough about user experience. Let's understand what we're dealing with here.

[00:02:40]This is the local portal that OMIX exposes for configuring the inference.

[00:02:45]But if you scroll to the bottom of it, you'll find components that it is based on. Aha. And here we see the same MLXM Python runtime from Apple. The one that crashes my Mac every time I go above a certain context size threshold. LM Studio is also using MLXLM, but it manages to contain it and make it stable, however, at the cost of speed.

[00:03:10]Does OMX solve this as well, or does it fail and leave me losing my work and data like pure MLXLM?

[00:03:18]Let's look at the version here. Ah, very nice. It's exactly the same build I was using for MLXM test 2. So, we'll be comparing identical underlying MLX libraries. And to add contrast here, Lama is not using this at all, and I think that's quite unique. Just a quick note, this video is not sponsored.

[00:03:40]Everything I share here comes purely from my own research and perspective.

[00:03:43]And if you're enjoying these deep dives, leave a like and subscribe. What I also like about Omalix is that includes some experimental features without me needing to compile them. For instance, you can try D flash. It uses a small model to predict few generated tokens in advance with a hope to speed generation up.

[00:04:02]Unfortunately, it's only working till 4K context afterwards disables itself. Be careful with it as it's completely trivial to miss the error and fool yourself into thinking that you're using it while you're not. Also, you could try speculative prefill. This one uses a smaller model to scan the context and pick out only the most important tokens.

[00:04:25]The main model then only processes that subset which may cut down time to first token on long prompts. Unfortunately, I was running all these tests with spec prefill on all maliks only to realize later that it failed to load at the very beginning. Most likely my smaller model wasn't compatible with the larger one.

[00:04:44]And that's another silent error.

[00:04:47]This is a bit of a recurring theme here.

[00:04:49]It's really easy to think you're using these techniques while in reality you aren't. Always monitor the logs.

[00:04:56]Finally, I like that it exposes the experimental Turboan KV cache. This can save some memory without losing too much accuracy. You can learn all about it and its nuances in my earlier video. I link it in description. And this one seemed to have actually worked. Though from what we'll soon see, I'm not 100% certain. So, what about Omalik's performance and its biggest claimed benefit, SSD KV caching? Let's figure it out and let's prepare for some fireworks. Let's bump the contact size to 128K and begin by running the benchmark with half the source file. The reason I'm doing this is because I'm really afraid to crash my Mac knowing that it's using the same MLXM that consistently crashed it while recording my previous episode. Fast forward and we have the first results. My benchmark sent two prompts to the model. Both of them had the same half of the source file attached. The first called prompt took 77 seconds. We'll compare this number to Alama and others soon. while the second prompt was lightning fast at mere 1 second. That's because prompt caching is working well. And I can confirm that Llama CPP and OLAM prompt caching works well for this model too.

[00:06:16]But when running MLXLM directly or for LM Studio, it's inconsistent.

[00:06:21]Is it consistent here with OMLX? Let's repeat this test to get more data points. Very good. Now the first prompt is very quick too but we really need to see whether the persistent SSD cache is working as advertised. So let's stop Omalac server and restart it. If SSD caching works even after restart both of my test prompts should prefill instantly.

[00:06:47]And they are perfect. Good job on this one. OMLX.

[00:06:52]At this point I decided that it's time.

[00:06:55]It's time to put on a helmet. Save all my work, close anything useful I had open, and wish my MacBook good luck in the next test. Yes, it's time to launch the full source file benchmark. Llama CPP, LM Studio, and Olama managed to crunch this 98,000 token file just fine.

[00:07:14]But will OMIX survive, too? Crunching.

[00:07:17]Crunching. It went all the way to 77,000 tokens and I genuinely thought it will make it. But the API bailed and I noticed in the log that a memory enforcer started. H well at least it didn't crash my Mac yet. But I had to know the truth. Can it pass the test or will it crash everything when pushed to the limit? So I went to the memory guard settings and increased the memory limit by 4 GB leaving OS another 4 GB to spare. I had to see it in practice. Is it the same MLXM racehorse underneath? I launched the full context bench and got the pink screen of death again. Ah the agony. Will I ever capture this on camera? And after the crash, I really screwed up my resolution settings. I usually record at the lower one to keep everything clear, but for some reason OBS was recording just a portion of the screen and it didn't capture how I played with that memory guard more and how OMX crashed my Mac yet another time.

[00:08:23]O enough. OMX hits the same large context crash limit as the underlying MLXM.

[00:08:30]So I reverted back to the half file benchmark, but instead of running the two prompts, ran the full test suite. It was nice to see how consistent the cache hits were. All hot prompts completed in a second or so. I restarted the server to see the SSD cache action one more time. And this leads to one more critical observation. Do you see this big spike in SSD read bandwidth when launching the prompt for the first time?

[00:08:56]It's OMLX reading the previously computed KV cache from disk. And frankly, I'd be wary of running this kind of inference on your internal MacBook SSD. Every time Omalix saves a new KV cache, it's writing gigabytes to disk. Seriously, using such system on a day-to-day basis, and you're putting real wear and tear to your built-in SSD.

[00:09:18]Better get a separate high-speed FME disc and a solid enclosure for it. I ran a few more call prompts by clearing the SSD cache manually. They completed in 67 seconds. At this point, we're almost ready to compare OMX results to Lama and Llama CPP data from the previous video.

[00:09:37]We just need to confirm two important things. Given that Olama fits the full context fine, but MLXM and OMX crash, is it really using uncompressed FP16 KV cache? Let's check the logs. Looks good.

[00:09:52]No mention of Q8 or other quantization.

[00:09:55]Also, maybe Olama supports SSD caching too. I just don't know that. Let's do a quick run of the same benchmark on Olama. As expected, the very first time only the second prompt was cached. Let's run it again. Now, both prompts returned instantly. Good. So, let's restart the model to clear this in memory cache.

[00:10:18]Launching the test. Fast forward. And the first prompt took 74 seconds. Now this confirms that there is no disc caching involved here. One note about Lama is that its performance is a bit fluctuating which is contrary to Llama CPP or MLXLM.

[00:10:36]And now we are ready to compare the runtimes. Let's do it. We can instantly see that OMLX is practically as fast as running MLXLM directly with 741 pre-fill tokens per second and 48 for generation.

[00:10:51]It's right at the top for speed at this realworld 49K context benchmark, but unfortunately it's plagued by the very same critical stability issue. It either kills the model or crashes my Mac when context reaches the limit. So just like with MXLM, I couldn't get the big context test to complete. What's nice though is that I did so many runs, but because of consistent caching, only three of them resulted in cold prompt data points. How does it compare to Olama? Well, Olama is slightly slower, but it gets to that maximum contact size without issues, just like Llama CPP does. So, would I use Alex? I like the KV caching to SSD. You can store more cache than what your memory would allow for, which is powerful, and aentic coding is a good use case for it. I may create a separate video on that. Stay tuned. Though I'd really need an external NVME disc for everyday use, trading speed for wear protection. But the biggest issue is that OMLX cannot fit my usual 128K context even when using Turbo Quan quantization, which is actually very strange. Is it really working? And most importantly, it either unloads the model or crashes my Mac when things get tough. And I need a stable, reliable inference engine as I can't afford to lose any work mid prompt.

[00:12:18]Are you using SSD caching? Have you ever crashed when running LLM inference?

[00:12:24]Don't hesitate to share in the comments below. I hope you found this useful.

[00:12:29]Subscribe and come back for more. I wish you a great day and see you next time.

#LLM #Local Inference #Programming #llama.cpp #MacBook Pro

関連おすすめ

コンピュータサイエンス

Agentforce NOW AMA: Build with React and Salesforce Multi-Framework

SalesforceDevs

490 views•2026-05-28

コンピュータサイエンス

How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust

aiDotEngineer

450 views•2026-05-28

コンピュータサイエンス

WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅

LearnwithSahera

1K views•2026-05-29

コンピュータサイエンス

More tests are always better? How to use AI to identify tests that bring little value

Alliance4Qualification

335 views•2026-05-29

コンピュータサイエンス

Search Algorithms Explained in 60 Seconds! 🤖💨

samarthtuliofficial

218 views•2026-06-01

コンピュータサイエンス

People of Game of Thrones using JavaScript DOM

AltCampus

296 views•2026-05-30

コンピュータサイエンス

Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA

ascensionix

107 views•2026-05-29

コンピュータサイエンス

🚀 BCS613C Compiler Design | Module 1 to 5 Schema Evaluation 🔥 | VTU 6th Sem 💯 #VTU #bcs613c #exam

Pranavaa-y4y

104 views•2026-06-02

トレンド

Revisiting The Cat Cafe For The Final Time

BenGtalks

3195K views•2026-05-29

Lil bro is a menace 🤣

NotAirJordan

2037K views•2026-05-31

My response to the Police

RecklessBen

1496K views•2026-06-01

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30