If Microslop aren't gonna train the model themselves to prove their own thesis, why would others? They've had 2 years (I think?) to prove BitNet in at least some way, are you really saying they haven't tried so far?
Personally that makes it slightly worrisome to just take what they say at face value, why wouldn't they train and publish a model themselves if this actually led to worthwhile results?
But it doesn't mean, idea is worthless.
You could have said same about Transformers, Google released it, but didn't move forward, turns out it was a great idea.
I don't think you can, Google looked at the research results, and continued researching Transformers and related technologies, because they saw the value for it particularly in translations. It's part of the original paper, what direction to take, give it a read, it's relatively approachable for being a machine learning paper :)
Sure, it took OpenAI to make it into an "assistant" that answered questions, but it's not like Google was completely sleeping on the Transformer, they just had other research directions to go into first.
> But it doesn't mean, idea is worthless.
I agree, they aren't, hope that wasn't what my message read as :) But, ideas that don't actually pan out in reality are slightly less useful than ideas that do pan out once put to practice. Root commentator seems to try to say "This is a great idea, it's all ready, only missing piece is for someone to do the training and it'll pan out!" which I'm a bit skeptical about, since it's been two years since they introduced the idea.
The core insight necessary for chatgpt was not scaling (that was already widely accepted): the insight was that instead of finetuning for each individual task, you can finetune once for the meta-task of instruction following, which brings a problem specification directly into the data stream.
A successful ternary model would basically erase all that value overnight. In fact, the entire stock market could crash!
Think about it: This is Microsoft we're talking about! They're a convicted monopolist that has a history of manipulating the market for IT goods and services. I wouldn't put it past them to refuse to invest in training a ternary model or going so far as to buy up ternary startups just to shut them down.
Want to make some easy money: Start a business training a ternary model and make an offer to Microsoft. I bet they'll buy you out for at least a few million even if you don't have a product yet!
Occam’s Razor suggests this simply doesn’t yield as good results as the status quo
The "new" on huggingface banner has weights that were uploaded 11 months ago, and it's 2B params. Work on this in the repo is 2 years old.
The amount of publicity compared to the anemic delivery for BitNet is impressive.
You'd still need full GPUs for training, but for inference the hardware would be orders of magnitude simpler than what Nvidia is making
Interestingly, a trit x float multiplier is cheaper than a trit x integer multiplier in hardware if you're willing to ignore things like NaNs.
0 and 1 are trivial, just a mux for identity and zero. But because floats are sign-magnitude, multiply by -1 is just an inverter for the sign bit, where as for integers you need a bitwise inverter and full incrermenter.
The relevant trit arithmetic should be on display in the linked repo (I haven't checked). Or try working it out for the uncompressed 2 bit form with a pen and paper. It's quite trivial. Try starting with a couple bitfields (inputs and weights), a couple masks, and see if you can figure it out without any help.
I happened to "live" on 7.0-7.5 tok/sec output speed for a while, and it is an annoying experience. It is the equivalent of walking behind someone slightly slower on a footwalk. I dealt with this by deliberately looking away for a minute until output was "buffered" and only then started reading.
For any local setup I'd try to reach for 10 tok/sec. Sacrifice some kv cache and shove a few more layers on your GPU, it's worth it.
I'm hoping that today's complaints are tomorrow's innovations. Back when 1Mb hard drive was $100,000, or when Gates said 640kb is enough.
Perhaps some 'in the (chip) industry' can comment on what RAM manufacturers are doing at the moment - better, faster, larger? Or is there not much headroom left and it's down to MOBO manufacturers, and volume?
The last logical step of this process would be figuring out how to mix the CPU transistors with the RAM capacitors on the same chip as opposed to merely stacking separate chips on the same package.
A related stopgap is the AI startup (forget which) making accelerators on giant chips full of SRAM. Not a cost effective approach outside of ML.
In what way? On modern processors, a Fused Multiply-Add (FMA) instruction generally has the exact same execution throughput as a basic addition instruction
typically for 1-bit matmul, you can get away with xors and pop_counts which should have a better throughput profile than FMA when taking into account the SIMD nature of the inputs/outputs.
actv = A[_:1] & B[_:1]
sign = A[_:0] ^ B[_:0]
dot = pop_count(actv & !sign) - pop_count(actv & sign)
It can probably be made more efficient by taking a column-first format.
Since we are in CPU land, we mostly deal with dot products that match the cache size, I don't assume we have a tiled matmul instruction which is unlikely to support this weird 1-bit format.
So it's not that individual ops are faster — it's that the packed representation lets each instruction do more useful work, and you're moving far less data from memory to do it.
But it seems the biggest model available is 10B? Somewhat unusual and does make me wonder just how challenging it will be to train any model in the 100B order of magnitude.
The key insight of the BitNet paper was that using their custom BitLinear layer instead of normal Linear layers (as well as some more training and architecture changes) lead to much, much better results than quantizing an existing model down to 1.58 bits. So you end up making a full training run in bf16 precision using the specially adapted model architecture
However this user uses — in almost all his posts and he had a speed of 1 comment per minute or so on multiple different topics.
Edit: oh, just recalled dang restricted Show HNs the other day to only non-new users (possibly with some other thresholds). I wonder if word got out and some are filling accounts with activity.
I suspect that they are trying to fake engagement prior to making their first "show" post as well.
It's not a question of if there are other bots out there, but only what % of comments on HN right now and elsewhere are bot generated. That number is only going to increase if nothing is done.
And I did not speak out
Because I was not using em dashes
Then they claimed that if you're crammar is to gud you r not hmuan
And I did not spek aut
Because mi gramar sukcs
Then they claimed that if you actually read the article that you are trying to discuss you are not human...
Not only are we losing the ability to communicate clearly without the assistance of computers, those who can are being punished for it.
Created confusion and frustration will make it much harder to separate signal from the noise for most people.
Residential Treatment Facility for Adults? Red Tail Flight Academy?
(only suggesting that it's intentional because it's been there so long)
> We evaluated bitnet.cpp in terms of both inference speed and energy cost. Comprehensive tests were conducted on models with various parameter sizes, ranging from 125M to 100B. specific configurations for each model are detailed in the Appendix A.
Boy would I love to give my agent access to my Quickbooks. They pushed out an incomplete MCP and haven't touched it since.
can we stop already with these decimals and just call it "1 trit" which it exactly is?
Also, isnt there different affinities to 8bit vs 4bit for inferences
I imagine you got 96gb because you thought you'd be running models locally? Did you not know the phrase Unified Memory is marketing speak?