The conclusion, that it was not the fault of the developer was correct, but assuming anything other than a problem at some point in the software stack is unreasonable.
All neural accelerator hardware models and all neural accelerator software stacks output slightly different results. That is a truth of the world.
The same is true for GPUs and 3d rendering stacks too.
We don't usually notice that, because the tasks themselves tolerate those minor errors. You can't easily tell the difference between an LLM that had 0.00001% of its least significant bits perturbed one way and one that had them perturbed the other.
But you could absolutely construct a degenerate edge case that causes those tiny perturbances to fuck with everything fiercely. And very rarely, this kind of thing might happen naturally.
>And very rarely, this kind of thing might happen naturally.
It is not a question of rarity, it is a question of the stability of the numerical problem. Luckily most of the computation in an LLM is matrix multiplication, which is s extremely well understood numerical problem and which can be checked for good condition.
Two different numerical implementations on a well conditioned problem and which requires much computation, differing significantly would indicate a disastrous fault in the design or condition of the hardware, which would be noticed by most computations done on that hardware.
If you weigh the likelihood of OP running into a hardware bug, causing significant numerical error on one specific computational model against the alternative explanation of a problem in the software stack it is clear that the later explanation is orders of magnitude more likely. Finding a single floating point arithmetic hardware bug is exceedingly rare (although Intel had one), but stacking them up in a way in which one particular neural network does not function, while other functions on the hardware run perfectly fine, is astronomically unlikely.
You're being unfair here. The showpiece software that uses that hardware wouldn't install, and almost all software ignores it.
I highly doubt that you could have a usable iPhone with a broken neural engine, at the very least it would be obvious to the user that there is something very wrong going on.
Aah, the old "you're holding it wrong" defense.
What has existed before is the Apple Neural Engine (ANE) which is very different from the newer Neural Accelerator support within the GPU blocks. In fact MLX does not even support ANE yet since at least in previous versions it was hardware-limited to computing FP16 and INT8 MADDs, and not even that fast.
"In fact MLX does not even support ANE yet"
I didn't say otherwise. The ANE is a fantastic unit for small, power-efficient models, like extracting text from images, doing depth modelling, etc. It's not made for LLMs, or the other sorts of experimental stuff MLX is intended for. Though note that MLX's author's reason for not supporting the ANE is that it has a "closed-source" API (https://github.com/ml-explore/mlx/issues/18#issuecomment-184...), making it unsuitable for an open-source project, and given that MLX didn't want to just lean on CoreML. But anyways, the ANE is fantastically fast at what it does, while sipping juice.
In any case, the code change shown should have zero impact on the running of MLX on an iPhone 16 Pro. MLX tries to really leverage platform optimizations so maybe another bifucation is making the wrong choice.
The MLX folks have various rationales for not supporting the ANE (at least as of yet), but one of them is that any real support requires implementing explicit splits in the graph of computations, where ANE-suitable portions are to be dispatched to the ANE and everything else goes back to the GPUs. That's not necessarily trivial.