undefined

points

[-]

That's exactly what I had in mind. When I started this, I was jumping back and forth between this thought: "Can this model size actually generate logical English text?" and I played with a few different models of the same size and I was really really depressed when seeing how bad they are.... but then I discovered more and more tiny models and LaMini-125M, LaMini-256M, and nanowhale-100m, and SmolLM2-135M-Instruct are very very decent. So I decided to give it a try.

by skerit3 hours ago|

parent|

[-]

I've been working on something like this too, for quite a while! Though I'm trying to get a non-quadratic-attention LLM (or SLM) up and running.

And anyway, I think the most important thing is dataset quality. Dumping in whatever dataset you find on Huggingface is a recipe for mediocrity, so I'm also spending a lot of time on that.

by giancarlostoro4 hours ago|

parent|

prev|

[-]

In my case, I have a local branch where I'm experimenting with BitNet since it can run on a CPU too.

by LoganDark5 hours ago|

prev|

[-]

Qwen seems to be going in a good direction -- hundreds of experts on their MoE models. Extremely low active-weight counts while still performing quite admirably. I look forward to models with many, many more experts, to the point where anyone with enough random access can generate hundreds or thousands of tokens per second. Because right now, 80–120t/s is pretty slow.