The reality is even cutting edge games and consumer workloads don’t actually take full use of the PCIe bandwidth of the GPU or the bandwidth of its GDDR memory. Even local AI use cases don’t substantially or meaningfully benefit from faster memory, at least to average consumers.
A unified memory pool does two things:
1) Lets systems optimize utilization based on need, rather than be confined to specific pools
2) Reduce overall memory cost, by letting system builders purchase a single type of memory in bulk instead of having to figure out GDDR vs DDR memory placement (important for SFF/portable machines)
So at a time when memory is expensive, unified pools make more sense. Even when memory becomes cheap and plentiful again, it’s just practical at this point to allocate a larger overall pool instead of managing discrete sets.
The one big drawback is security. A shared memory pool means side-channel attacks against memory from the GPU or CPU could potentially compromise the other as well, meaning memory-safe designs are going to be critical to security going forward (which is good for Rust adherents, I figure).
As a Rust adherent, please do not put words in our mouths or set up unrealistic expectations for other people by linking together concepts at a very shallow level.
Language level memory safety has no answer for hardware security flaws which is what side channel attacks are. No programming language can provide memory privacy if another chip in your machine can read your memory. Just like no programming language can protect your application from a kernel vulnerability of the kernel it’s running on.
The 5090 ($2k MSRP but realistically $3-3.5k) is almost the same as the RTX 6000 Pro (~$10k). Same memory bandwidth (1800GB/s). Slightly different CUDA cores (21k vs 24k). Big difference? VRAM (32GB vs 96GB).
NVidia ultimately doesn't want to upset this segmentation so the RTX Spark will never undermine their other offerings. This is why I think Apple has a real market opportunity if they choose to embrace it.
Most consumers will never really care about, let alone see, the difference in PCIe or memory bandwidth impacts from such a shift to unified memory pools. We might (being, at least in my case, a huge nerd), but I’m increasingly of the opinion that if modern blockbuster games are built for upscaling/reconstruction anyhow, then suddenly such sacrifices to performance seem acceptable relative to the gains in efficiency.
No copy unified memory will help with that but you do pay the read speed costs.
I don't know who will be the winner but with some of the recent releases from gemma it seems more probable that you may run some models locally if only from a cost perspective, not even considering business security. Not sure how this type of architecture would make for good gaming though, puts into question the whole statement.
"Ranked in the top 2% of scientists globally (Stanford/Elsevier 2025) and among GitHub's top 1000 developers" - side note but this guy puts this everywhere, gives me probably the inverse of what he is marketing for.
This is the 2026 edition of Ken Olsen: "There is no reason anyone would want a computer in their home"
Digging into this:
> In conclusion, there is evidence that Ken Olsen did doubt the need for computers in the home, but the evidence is based primarily on the testimony of David Ahl who was perturbed when the personal computer project he championed at DEC was not supported by Olsen in 1974.
> Olsen’s resistance may have been similar to that expressed by another DEC executive, Gordon Bell. In 1980 Bell thought home terminals would act as gateways to remote computers which would provide appropriate services.
* https://quoteinvestigator.com/2017/09/14/home-computer/
It was supposedly said in 1977: most computers at that time were not small, and so it would not be surprising that people would not expect the general public to desire a large, power-hungry, noise-y apparatus in their house.
And, like the overly large machines of 1977, models are getting faster, leaner, and better. It's happening a lot quicker, though.
People take these quotes out of context all the time. Said in a business context, there was no need, at that time, for someone to have a personal computer.
There's no business justification in 1977 for a personal computer department at a business. It's similar to the gates quote about RAM (I think it was 64KB?).
These statements aren't meant to be forever quotes. Their business plan quotes.
640, and Bill Gates said he either never said that, or at least never remembered having said it. I think there is no evidence anywhere that he did.
https://www.computerworld.com/article/1563853/the-640k-quote...
The early popularity of Minitel, the continued popularity of ssh/tmux, and the web browser itself indicates that bespoke client applications are not the only way. He wasn’t directionally wrong.
Nobody ever said that, at least not as an assertion or prediction. The actual instances of similar language are from multiple people describing their earlier thoughts before they learned it wasn’t true.
Local models aren’t deterministically equivalent in capabilities to foundation models. Home computers are turing complete; just like a mainframe. They are just slower. Often not slower enough to matter.
Maybe if you ask them that question, but if you show them two products, they'll definitely prefer the faster one. 30 seconds is a long time to watch a progress bar.
People definitely aren't going to accept more expensive + slower ...
You could run a pretty good home server on $50 of gear and yet we never saw any real adoption of OwnCloud/NextCloud style products as an alternative to Google Drive/Photos or Apple Cloud.
Why should LLM/Transformers be any different? Especially when you need a proper expensive GPU to run them instead of a Raspberry Pi?
On-device AI is going to be important, I think. It doesn't have to take the form of a chatbot UI to be useful.
Very significant improvements may be viable for unattended inference via large-scale batches, which can reuse sparse experts and thereby mask some of the latency involved - this is quite unique to DeepSeek, again due to its efficient KV cache.
2. Qwen is much more demanding and borderline unusable on consumer hardware because it's a dense model. The 27B parameters are active all time for each token. It's not a MoE architecture where a router activates only some of them.
3. Qwen doesn't like quantization at all.
But yeah, the Qwen line is pretty impressive on commodity hardware.
To me, LLMs are for asking research questions + exploring design spaces + pointing at codebases to investigate bugs. And those all benefit from the model being as "smart" (in terms of both fluid intelligence and burned-in knowledge) as possible.
I'm guessing there exist problems where "intelligence past a certain point" doesn't matter, so these medium-sized models can match the performance of the bigger models. But what problems might those be?
Do you think he's in mensa too?
I have a hard time believing running a model on a laptop will be cheaper than running it in a datacenter. Why wouldn't economies of scale apply here as with every other computation?
The vision NVIDIA is selling is pure marketing IMHO
Local may or may not be cheaper than remote now, depending on the details, but the factors you describe won't affect the math nearly as much as they will once that subsidization ends.
You're going to need to analyze the problem much more deeply because it sound like the standards you are implicitly applying would result in "economically, everything should be centrally hosted" but that is clearly not the result that obtains. Even a modern mid-grade cell phone is no slouch; you may not be running a current-gen frontier AI on it but you certainly can do a lot of other rather intense things locally that would have been laughable 10 years ago, like suprisingly high powered games.
But they also want to taste the sweet fruit of AI so the only way to do this that a CISO will approve is on local air gapped hardware. It’s a niche but still a billion dollar niche.
Where you will need games to be rewritten for ARM to get full performance, just like on Apple's M series chips.
Especially on Dwarfstar.
anyone whose addicted to token theoughput is losing the operational knowledge and offline capabilities.
if you arent moving to the AMD 395 or MACs then youre hitching aride on the expensive calory ride
But watching everyone flounder because claude goes down or forcing you on API costs.
I'm programming things that'd take me days with a PC that, without OpenAI's VRAM shenagans, would cost you $2k.
It's more than just 'this is what I could do' it's definitely about 'this is what anyone could do with a new PC purchase'.
You're doing what the IT industry has been addicted to for decades: number goes up.
This made me laugh. I can only image how insufferable this person is to deal with.
Not everything I want to use an LLM for requires "PhD level intelligence", and increasingly I'm finding more uses that involve sharing my personal data.
Yesterday my local model helped me when looking for a doctor who is in-network for my insurance. I threw it a screenshot from the providers search results and it looked up reviews for all of them.
I own the DVDs so I'm OK upscaling/editing my own copies for my own use. But if I ran the task on an ai service I would no doubt trigger copyright issues.
Lol yeah seriously, that stinks "I ask AI to generate a huge amount of bullshit and upload it to pad irrelevant stats".
Absolute loser.
As to why he now has this on his blog? I also cringe when I read it. I presume someone told him he should self-promote more, and this is his lame attempt to do so. He's almost certainly the most cited person in his department, but it's entirely possible that none of his colleagues actually know this. Cut him some slack. Self-promotion is not his strength. He's a nerd's nerd, and not a marketer. I'll mention to him that his attempt here might be backfiring when I'm next in contact with him.
He doesn't just have it on his blog, he has it EVERYWHERE. Sometimes 2 or 3 times on the same page.
It sounds like he's gotten bad advise about how to market himself /or/ this is being marketed to people who have bigger checks to write and whom he believes will be responsive to this kind of marketing. As an academic, it rubs me very wrong - I think it's detrimental to the field when we get into h-index stacking contests or citation count comparisons. But I don't know what incentives he's responding to, which seems important for putting this stuff in context.
(as an aside, it turns out that polars + fastexcel is about 10x faster than pandas + openpyxl for searching that dataset, if anyone else is curious what he was actually talking about. :)
Being the top x% is what OnlyFans girls brag about, professor...
And it's not exactly brain surgery, is it? https://www.youtube.com/watch?v=THNPmhBl-8I
Citation needed
But perhaps more importantly. Nvidia seems to be doing a lot better with its ecosystem. Nvidia has much better distribution channels and partners building on top of their PC Gaming GPU. It also have gaming developers relations that is unmatched by any in the industry.
Qualcomm has so far failed to execute this, both in PC and on there Server CPU side.
My experience (wanted to use x13s as daily sriver) is that there was good progress for about a year, until jhovold was leading the charge, but something expired and qualcom as far as i can tell forgot that some progress should happen on x1 and x8c as well as x2.
Some distros still need extracting Qualcomm firmware from Windows to get Linux to work properly. Audio remains a challenge, like x86 Linux decades ago. Apparently camera stuff works these days but produces images of subpar quality.
These issues also occur on normal Linux. My experience with my Lenovo+Intel laptop was that it took three months after release for the firmware to work properly (and the Nvidia drivers took much longer, but that's my fault for buying something containing Nvidia hardware). Intel managed to do what Qualcomm did in months rather than years.
I hope Qualcomm finally sorts this shit out, I really do, but with the prices of computers these days, I'm going to need to see quite the discount before I'll consider buying anything with a Snapdragon.
They could have had a 128core arm chip by now.
There's also the whole giant trillion dollar company doesn't want to invest and let small ideas grow. They only focus on things that move the needle, which isn't much at the size.
Had Microsoft executed and invested, they could have made a come back imo in both search, mobile & hardware. Unfortunately major lack of leadership or they just don't want those areas.
outside of anything else, amdahls law means that as the parallel performance grows, we become _more_ limited by the inherently serial code, and thus single core performance, not less.
Given that single core performance is "harder" (can't just throw more cores/sockets at the problem), it's also critically important.
Because that't the only part this chip excels.
People are comparing apples with oranges since ages.
Qualcomm are trying harder now it seems. But it will take time to repair their reputation in the PC market.
Tuxedo computers tried and didn't succeed either.
I will never buy Qualcomm again. I avoid them on phones as well by just buying Apple. They do not support their hardware beyond the release.
To each their own, but I don't recall Apple ever mainlining any of their drivers on Linux. You're rightfully angry on the laptop side of things, but Apple is much worse than Qualcomm when it comes to open source support for their phones.
Qualcomm probably shouldn't have promised Linux support in the first place. Everyone seems to love Apple's hardware even though you're practically stuck with macOS. Had Qualcomm just stuck to Windows-only, they would've probably received a much better reception by the tech press.
Not really, the 1st. iteration got stuck in legal land and other delays.
https://discourse.ubuntu.com/t/ubuntu-concept-snapdragon-x-e...
Is there a desktop version ? For real work ?
Technically speaking, Qualcomm acquired Nuvia, which is where this came from and that company came from ex-Apple engineers wanting to do what Apple said no for their chips.
So it's almost same CPU design (origins).
1. Yes it has the same number of cores as a 5070 mobile. It’s also running at a shared peak of 2/3 the bandwidth and a shared peak of 2/3 the TDP. The GPU by itself will likely perform at half the dedicated units performance
2. Apple may not have SVE2 but they do have the AMX (private) and SME. I don’t see why he thinks the SVE2 will give him more performance than the SME.
3. He mentions a single core type but doesn’t mention the total makeup. We already have known for a year how the DGX Spark compares to Apple chips. For CPU it’s roughly equivalent to an M3 Pro and for GPU compute (not rasterization) it’s between an M4 Pro and M4 Max without considering bandwidth.
The real advantage to these is that they run CUDA. That’s it. Otherwise when they launch they’ll be 2-3 generations behind where Apple is and 1 gen behind AMD.
The other super power of the DGX Spark was the NIC for pairing them together. But that’s been removed here too.
You are likely thinking about token generation which is dependent on memory bandwidth where Apple has an edge. Spark's GPU compute is way higher than even M5 Max (17 FP32 TFlops), around 2x FP32 TFlops... It's literally 6144 CUDA cores like desktop 5070, slowed down by slow memory and lower TDP (29.7 vs 31 FP32 TFlops on 5070).
I’d also mention that you’re comparing peaks which the RTX Spark won’t be hitting. The top TDP is less than that of the DGX Spark.
I just think anyone calling this a beast and a game changer are conflating/extrapolating from different form factors and constraints
Guy suddenly became aware of a chip that the rest of the industry long knew about, seems completely unaware of the competitors, and posts about how it's a BEAST and will be a GAME CHANGER.
Like the DGX Spark was a game changer? Eh, it has mostly been a massive disappointment. An overpriced nvidia laptop isn't going to change the equation an iota.
Before we get local AI, we'll be using hybrid AI.
Running big models locally is unrealistic ($$$$$) but, if you imagine an Agentic Workflow where some bits run on the cloud and other smaller tasks locally, it's an amazing deal. You don't need Opus/Code/DeepSeek/Kimi/etc to do basic stuff that models like Gemma4:12b/Qwen-27b can do locally with much less latency.
Having a laptop where I can use a remote big model and combine it with 5 local domain specific models, is something I would love to do today. Imagine using OpenCode and you've a small model deciding which tasks run locally, then decides if you've a good local model for XYZ task or if we use a cloud model.
My main concern is: Is this hardware powerfull enough to allow local quick models switch? Unlikely but I hope I'm wrong
It's just a personal computer. It normally runs multiple operating systems just fine.
Windows PC sounds like people talking about tech who are either payed by M$, or embed pictures into Word documents to send them.
Nobody has to kill the fun those OS agnostic machine allow, by artificially bind them to a shitty OS.
Even for personal use, I'd imagine the amount of people dual booting Windows and something else are a very tiny minority.
Saying "Windows PC" is a pretty reasonable way to distinguish between "made by Apple" and "made by someone else" because the market of PCs that aren't made by Apple and don't come with Windows is really, really tiny.
To be honest, this seems like a strange hill to take such an aggressive stance upon.
For normal people, there are three computer operating systems: Windows, Apple, and ChromeOS. Nvidia isn't going with ChromeOS and Apple hates their guts, so Windows is the only normal operating system they can market.
Their marketing makes clear that these devices aren't the piddly Chromebooks that ruined the desktop experience for so many people (expensive Chromebooks were nice, but rare in practice).
Qualcomm promised Linux support, failed to deliver, and now anybody burnt by their promise won't want to buy their hardware again. If they promise a Windows PC, people won't have reason to complain when Linux or FreeBSD or SerenityOS won't boot on there. Given Qualcomm's failures here, Nvidia is probably doing the right thing.
I did this for years. We ran Resolve color correction suites with external chassis to place multiple Nvidia GPUs in it at a fraction of the cost of the shitty TrashCanMac that was available. Lots of people continued to use the 2012 Cheese Grater MacPro with its older CPUs. The only way to get modern (at the time) compute in a Mac was to use a Hackintosh. Since it wasn't for personal use, not having things like AppStore, Messages, Music, etc wasn't a big deal, so building a Hackintosh was easier.
I built one for personal prosumer use around the time of the 1080s that allowed me more machine for the dollar than Apple offered. Once the M-series chips came out and they were capable of what the Hackintosh was doing for me put me off of building anything newer.
So, the partnership is maybe natural, but not prospective. Also, note how Linux is getting popular among gamers. Of course, it's way behind Windows, but the direction of the change is clear.
I'm convinced that Nvidia is not primarily targeting the consumer market and that the ultimate goal for its CPUs is the server space. The company invests effort where the money is, and consumer products account for only a fraction of its total revenue. Maintaining a presence in the consumer market seems more like a way to avoid a complete pivot than a strategic priority.
Your x86 machines were, but these are ARM SOCs. Many of them don't even support UEFI, let alone the upstream Linux kernel.
https://nvidianews.nvidia.com/news/nvidia-microsoft-windows-...
I have been somewhat surprised at the lack of commentators observing that this is Microsoft and above all NVIDIA launching a device that is fundamentally at odds with the metered cloud model of AI.
When you look at the other announcements and murmurings (better offline BYOK for Copilot, talk of an unmetered AI future) I think it’s clear that these two firms understand that cloud-only AI is not sustainable or inherently in their interests. But their willingness to undermine OpenAI with a product like this is notable.
Copilot just got proper "offline" BYOK support, didn't it? Presumably that was one of the things they were talking about. Though I imagine that has something to do with the fact that Zed has supported that properly for months.
I've found it very useful for running big models, but it's not a screaming powerhouse in terms of raw compute.
As a side note, qualcomm chip set on Android has been doing this for years (like Apple) so it's not super unique thing. It's more like there was no need before.
[1] https://www.jeffgeerling.com/blog/2025/increasing-vram-alloc...
The GPU can still happily use all the rest of the memory for other use cases - which tend to be the bulk of allocations anyway. Though there might be performance implications - for example "moving" buffer ownership to the GPU would need to evict CPU caches, and often 4k pages and tlb lookups can be a pretty inefficient situation for GPU-style accesses.
That's been pretty standard for any SoC for decades. And "differences" to apple's SoC are more implementation details.
This isn't the first time we have UMA on the PC, btw. When SGI did their PC workstations, their 320 and 540 PC workstations had what they called Cobalt graphics chipset and crossbar with their IVC architecture. They bypassed AGP at the time completely. It was quite unique to see strict UMA on a PC. Haven't seen it since until these new systems we're seeing now on PCs and Mac.
Some software assumes pre-defined set-aside pools of memory reserved for video purposes, but the chip does actually have access to the whole pool.
That's an API issue not a hardware issue. Regardless, I believe the major APIs permit seamlessly sharing pointers at this point? (I have no experience doing that though.)
IIRC that's due to maintain BIOS and Windows (+games & apps) backwards compatibility, but memory access speeds are the same.
A RTX Pro 6000 has ~24K 5th generation tensor cores, I'm guessing this would then be 1/4 of the count but 6th generation? Wasn't clear from the images.
> The memory is not as fast as dedicated GPU memory, but it is cheap enough while delivering enough bandwidth to run AI models locally.
Also "cheap while delivering enough" certainly sounds like someone is trying to temper expectations. It sounds like something sitting in-between GPU+VRAM inference and CPU+RAM one, not as a step above/besides GPU+VRAM.
(HN reaction to Vision Pro back in 2024 is almost hilarious if not ridiculous, looking at it today. I knew it would be a flop and I was so right.)
nvidias master plan may be making it the new normal to have "only" 400GB/s bandwidth, thus gatekeeping local model usage further behind "more memory but not as fast as the cloud can do it"
Nvidia just wants to sell stuff to everyone.
And I think for professionals doing local AI work, products like Strix Halo and Apple Silicon are a competitive threat.
A big part of maintaining the leading software ecosystem is ensuring you have competitive hardware for all your users.
I also think the RTX Spark product is relatively low effort for Nvidia. Grab a Mediatek CPU and slap an Nvidia GPU on the die. Sure, that’s oversimplifying it, but still.
> The memory is not as fast as dedicated GPU memory, but it is cheap enough while delivering enough bandwidth to run AI models locally.
So, the reason "dedicated GPU memory" is fast, isn't because it's "dedicated"; it's because the types of memory built into GPU cards — GDDR and HBM — are designed for throughput over latency.
Which is to say, GDDR and HBM memory could be shared with the CPU in UMA while still being "fast" (for GPU use-cases.) In fact, the PS4/5 and Xbox 360 / One X / Series consoles have UMA architectures that use GDDR memory as their main memory, with no regular DDR memory to be found.
What I don't understand: why don't we see UMA architectures where there's both regular DDR and GDDR/HBM memory mapped into the address space of the CPU+GPU? That seems like the best of both worlds: you'd have some memory that's "tuned" for random-access CPU usage (regular DDR), and some memory that's "tuned" for streaming GPU usage (GDDR/HBM), but either type of memory can still be put to the use it wasn't "tuned" for, just with slightly-worse performance.
I guess you'd need to do a bit of software work:
1. a bit of work in the OS kernel / malloc library to get CPU workloads to "prefer" allocating DDR memory over the GDDR/HBM memory until they've exhausted DDR memory (or maybe not, if you just tell the kernel the GDDR/HBM memory is something like a zswap thinpool);
2. and a bit of work in supported ML frameworks, to teach them about a hybrid strategy between UMA "allocate anywhere, it's all the same" and NUMA "keep assets in VRAM if possible; if you spill assets to RAM, then they must stream into VRAM on access" (i.e. "at allocation time, allocate as if the system were NUMA, VRAM first then spilling to RAM; but at execution time, use the UMA codepaths, no need to copy RAM into VRAM.")
...but once that's done, it's done.
Windows 11 can run just fine on 8Gb of memory, what cant is Google Chrome.
While this NVIDIA system is inferior from the point of view of the memory capacity, its main advantage is that the top models will have a bigger GPU, i.e. with 6144 or 5120 FP32 execution units, compared to 2560 for the AMD GPU (compared to the NVIDIA CPU, the AMD CPU has a better multi-threaded performance for legacy programs, and a much better multi-threaded performance for the applications that use AVX-512).
However, these top models with big GPUs will also be much more expensive than the competing AMD system, while also being much more expensive than a laptop or mini-PC with an equivalent discrete NVIDIA GPU (which has the disadvantage of having direct access only to a much smaller, even if faster, memory).
It's an interesting "newcomer" and the more the better but calling this a "beast" and a "game changer" is ridiculous to say the least.
Then there is the price..
The obvious comparison here is the M5 Max where you can buy a Macbook Pro with 128GB of also unified memory. Obviously CUDA cores are specific to NVidia so it's hard to directly compare but I've seen claims that the M5 Max is roughly equivalent to ~4000 CUDA cores. This obviously depends on workload and whether the CPU supports the precision you want to use (eg FP4).
The M5 Max has memory bandwidth of 819GB/s. The RTX Spark I believe is ~600. So it might be slightly better than the current generation of Macs but likely worse than the expected M5 Ultras of the new Mac Studios (likely Q3 2026).
For comparison, a 5090 has >20k CUDA cores and 1800GB/s memory bandwidth with 32GB of VRAM. The RTX 6000 Pro (at ~$10k) has 96GB of VRAM, same bandwidth and ~24k CUDA cores.
We have to see what RTX Spark systems sell for but the DGX Spark is in the Mac Studio price range (~$4k).
I do think Apple has a real opportunity here but there offerings aren't quite there yet. The M5 Ultras might be a really attractive option for local LLMs. I expect them to be in high demand.
Who claimed that? The M5 is still a raster focused GPU, dedicated matmul blocks be damned. For some workloads that napkin math might work out, but for many others it's a wild overshoot. Time-to-first-token still favors CUDA, and real-world training workloads aren't getting anywhere near Apple Silicon.
All of the memory bandwidth in the world is useless if you spend 15 minutes processing 64k tokens worth of context prefill. This is where CUDA shines.
Nvidia going from GPU to CPU now?
Up to $5000 because why not?
With that money you can build a real PC with rtx 5090!
A powerful new chapter for Windows PCs, accelerated by Nvidia RTX Spark
https://news.ycombinator.com/item?id=48352693
Nvidia RTX Spark
Decent single core (a long ways from Apple level, but decent), but it makes up for it in cores to provide M5 level performance, CPU wise. Memory bandwidth it is kind of starved, at 1/6th many GPUs.
They got Microsoft to customize Windows for the RTX Spark, and will likely have to brutally throttle it when running as a laptop (it's literally a 140W TDP chip), and that's neat. It's going to be a very expensive laptop.
DGX Spark has a maximum of 273 GB/s bandwidth in ideal scenarios (hard to reach)
That puts it between an M5 (153) and M5 Pro (307)
Mind you thats not to/from memory, which indeed only has 273 GB/s.
Tech companies have strangled their own market.
Nvidia is milking the market now. We need more competition again - currently we have a mafia control the prices, not just Nvidia but all the AI companies. The price increases should be paid for them, not by us. "Free market" is being manipulated by them here.
Looking at it more, I believe the story repeats with the TSMC processes used for the CPU vs chips like GB200 as well.
Even if none of the above were the case, the question still isn't "why not make the enterprise GPU" it's "why not make the higher margin per chip area product". If the NV1/GB10 take less die space and cost a lot it's not immediately apparent the enterprise GPU actually nets Nvidia more $ per die or not. That's why it's relevant these will be sold at a premium.
And maybe for NVIDIA and MS it is also about them quietly betting that local models are, in fact, going to be good enough for most tasks pretty soon.
I'd say this relates directly to the cost of running AI models remotely.
And we won't know what the actual cost will be until AI vendors recover the huge pile of cash they've dumped into development (plus interest).
The hardware for 50 tokens per second with a four bit quantisation of Gemma 4 26B or the sparse Qwen 3.6 is not really that expensive: it’s a secondhand M1 Max.
Beyond that, I agree. I think moving planning tasks to local is a now thing, not that it really has much impact on token spend. I also think many small coding tasks are fully within the grasp of the above two models.
The main issue right now is that the software landscape is rather confusing, but I reckon uncomplicated Gemma 4 26B QAT support with MTP is a few weeks away.
But most businesses don't really care about most of the apple --- they only need their special bite out of it.
For example, doctors mainly care about medicine. Nvidia is attempting to provide the hardware needed for local, specialized models.
But I don’t know about specialised: this could run quite large models with MoE.
Running local models will stay niche for a while, unless we see breakthroughs
Most doctors don't care much about engineering or accounting or software development or 10000 other things that big vendor models address.
This area is yet to be really explored. Nvidia aims to provide the hardware to do so.
I'm not sure anyone really understands why.
Bill Gates had a quote some years ago...
People have still not learned how fast we improve our tech and how much cheaper thing gets I guess :)
Clip me :). You are currently living through the final stages of unrestricted computing in the hands of the 'public'. Our regimes are going to pull up the drawbridge in the name of 'safety'. Download the open models asap and prepare for an airgapped computing environment. That will be your frontier in not extremely neutered AI in the near future.
I am so hoping I'm completely wrong on this btw.