upvote
It really depends on the tasks you have to perform. I am using specialized OCR models running locally to extract page layout information and text from scanned legal documents. The quality isn't perfect, but it is really good compared to desktop/server OCR software that I formerly used that cost hundreds or thousands of dollars for a license. If you have similar needs and the time to try just one model, start with GLM-OCR.

If you want a general knowledge model for answering questions or a coding agent, nothing you can run on your MacBook will come close to the frontier models. It's going to be frustrating if you try to use local models that way. But there are a lot of useful applications for local-sized models when it comes to interpreting and transforming unstructured data.

reply
> I formerly used that cost hundreds or thousands of dollars for a license

Azure Doc Intelligence charges $1.50 for 1000 pages. Was that an annual/recurring license?

Would you mind sharing your OCR model? I'm using Azure for now, as I want to focus on building the functionality first, but would later opt for a local model.

reply
I didn’t realize that you can get 128GB of memory in a notebook, that is impressive!
reply
I've got a 128 GiB unified memory Ryzen Ai Max+ 395 (aka Strix Halo) laptop.

Trying to run LLM models somehow makes 128 GiB of memory feel incredibly tight. I'm frequently getting OOMs when I'm running models that are pushing the limits of what this can fit, I need to leave more memory free for system memory than I was expecting. I was expecting to be able to run models of up to ~100 GiB quantized, leaving 28 GiB for system memory, but it turns out I need to leave more room for context and overhead. ~80 GiB quantized seems like a better max limit when trying not running on a headless system so I'm running a desktop environment, browser, IDE, compilers, etc in addition to the model.

And memory bandwidth limitations for running the models is real! 10B active parameters at 4-6 bit quants feels usable but slow, much more than that and it really starts to feel sluggish.

So this can fit models like Qwen3.5-122B-A10B but it's not the speediest and I had to use a smaller quant than expected. Qwen3-Coder-Next (80B/3B active) feels quite on speed, though not quite as smart. Still trying out models, Nemotron-3-Super-120B-A12B just came out, but looks like it'll be a bit slower than Qwen3.5 while not offering up any more performance, though I do really like that they have been transparent in releasing most of its training data.

reply
There's been some very recent ongoing work in some local AI frameworks on enabling mmap by default, which can potentially obviate some RAM-driven limitations especially for sparse MoE models. Running with mmap and too little RAM will then still come with severe slowdowns since read-only model parameters will have to be shuttled in from storage as they're needed, but for hardware with fast enough storage and especially for models that "almost" fit in the RAM filesystem cache, this can be a huge unblock at negligible cost. Especially if it potentially enables further unblocks via adding extra swap for K-V cache and long context.
reply
Most workstation class laptops (i.e. Lenovo P-series, Dell Precision) have 4 DIMM slots and you can get them with 256 GB (at least, before the current RAM shortages).

There's also the Ryzen AI Max+ 395 that has 128GB unified in laptop form factor.

Only Apple has the unique dynamic allocation though.

reply
Yep, I have a 13" gaming tablet with the 128 GB AMD Strix Halo chip (Ryzen AI Max+ 395, what a name). Asus ROG Flow Z13. It's a beast; the performance is totally disproportionate to its size & form factor.

I'm not sure what exactly you're referring to with "Only Apple has the unique dynamic allocation though." On Strix Halo you set the fixed VRAM size to 512 MB in the BIOS, and you set a few Linux kernel params that enable dynamic allocation to whatever limit you want (I'm using 110 GB max at the moment). LLMs can use up to that much when loaded, but it's shared fully dynamically with regular RAM and is instantly available for regular system use when you unload the LLM.

reply
What operating system are you using? I was looking at this exact machine as a potential next upgrade.
reply
Arch with KDE, it works perfectly out of the box.

I configured/disabled RGB lighting in Windows before wiping and the settings carried over to Linux. On Arch, install & enable power-profiles-daemon and you can switch between quiet/balanced/performance fan & TDP profiles. It uses the same profiles & fan curves as the options in Asus's Windows software. KDE has native integration for this in the GUI in the battery menu. You don't need to install asus-linux or rog-control-center.

For local AI: set VRAM size to 512 MB in the BIOS, add these kernel params:

ttm.pages_limit=31457280 ttm.page_pool_size=31457280 amd_iommu=off

Pages are 4 KiB each, so 120 GiB = 120 x 1024^3 / 4096 = 31457280

To check that it worked: sudo dmesg | grep "amdgpu.*memory" will report two values. VRAM is what's set in BIOS (minimum static allocation). GTT is the maximum dynamic quota. The default is 48 GB of GTT. So if you're running small models you actually don't even need to do anything, it'll just work out of the box.

LM Studio worked out of the box with no setup, just download the appimage and run it. For Ollama you just `pacman -S ollama-rocm` and `systemctl enable --now ollama`, then it works. I recently got ComfyUI set up to run image gen & 3d gen models and that was also very easy, took <10 minutes.

I can't believe this machine is still going for $2,800 with 128 GB. It's an incredible value.

reply
> Only Apple has the unique dynamic allocation though.

What do you mean? On Linux I can dynamically allocate memory between CPU and GPU. Just have to set a few kernel parameters to set the max allowable allocation to the GPU, and set the BIOS to the minimum amount of dedicated graphics memory.

reply
Maybe things have changed but the last time I looked at this, it was only max 96GB to the GPU. And it isn't dynamic in the sense you still have to tweak the kernel parameters, which require a reboot.

Apple has none of this.

reply
Strix Halo you can get at least 120 GB to the GPU (out of 128 GB total), I'm using this configuration.

Setting the kernel params is a one-time initial setup thing. You have 128 GB of RAM, set it to 120 or whatever as the max VRAM. The LLM will use as much as it needs and the rest of the system will use as much it needs. Fully dynamic with real-time allocation of resources. Honestly I literally haven't even thought of it after setting those kernel args a while ago.

So: "options ttm.pages_limit=31457280 ttm.page_pool_size=31457280", reboot, and that's literally all you have to do.

Oh and even that is only needed because the AMD driver defaults it to something like 35-48 GB max VRAM allocation. It is fully dynamic out of the box, you're only configuring the max VRAM quota with those params. I'm not sure why they choice that number for the default.

reply
You do have to set the kernel parameters once to set the max GPU allocation, I have it set to 110 GiB, and you have to set a BIOS setting to set the minimum GPU allocation, I have it set to 512 MiB. Once you've set those up, it's dynamic within those constraints, with no more reboots required.

On Windows, I think you're right, it's max 96 GiB to the GPU and it requires a reboot to change it.

reply
I use Raycast and connect it to LM Studio to run text clean up and summaries often. The models are small enough I keep them in memory more often than not
reply
Shouldn't we prioritize large scale open weights and open source cloud infra?

An OpenRunPod with decent usage might encourage more non-leading labs to dump foundation models into the commons. We just need infra to run it. Distilling them down to desktop is a fool's errand. They're meant to run on DC compute.

I'm fine with running everything in the cloud as long as we own the software infra and the weights.

This is conceivably the only way we could catch up to Claude Code is to have the Chinese start releasing their best coding models and for them to get significant traction with companies calling out to hosted versions. Otherwise, we're going to be stuck in a take off scenario with no bridge.

reply