Users can interactively explore the microgpt pipeline end to end, from tokenization until inference.
[1] English GPT lab:
Pretty nifty. Even if you are not interested in the Korean language
For practical applications, a well-tuned small model that does one thing reliably is worth more than a giant model that does everything approximately. I've been using Gemini Flash for domain-specific analysis tasks and the speed/cost ratio is incredible compared to the frontier models. The latency difference alone changes what kind of products you can build.
2x the number of lines of code (~400L), 10x the speed
The hard part was figuring out how to represent the Value class in C++ (ended up using shared_ptrs).
Extremely naiive question.. but could LLM output be tagged with some kind of confidence score? Like if I'm asking an LLM some question does it have an internal metric for how confident it is in its output? LLM outputs seem inherently rarely of the form "I'm not really sure, but maybe this XXX" - but I always felt this is baked in the model somehow
Edit: There is also some other work that points out that chat models might not be calibrated at the token-level, but might be calibrated at the concept-level [2]. Which means that if you sample many answers, and group them by semantic similarity, that is also calibrated. The problem is that generating many answer and grouping them is more costly.
[1] https://arxiv.org/pdf/2303.08774 Figure 8
[2] https://arxiv.org/pdf/2511.04869 Figure 1.
You could color code the output token so you can see some abrupt changes
It seems kind of obvious, so I'm guessing people have tried this
Think of traditional statistics. Suppose I said "80% of those sampled preferred apples to oranges, and my 95% confidence interval is within +/- 2% of that" but then I didn't tell you anything about how I collected the sample. Maybe I was talking to people at an apple pie festival? Who knows! Without more information on the sampling method, it's hard to make any kind of useful claim about a population.
This is why I remain so pessimistic about LLMs as a source of knowledge. Imagine you had a person who was raised from birth in a completely isolated lab environment and taught only how to read books, including the dictionary. They would know how all the words in those books relate to each other but know nothing of how that relates to the world. They could read the line "the killer drew his gun and aimed it at the victim" but what would they really know of it if they'd never seen a gun?
I mean I sort of understand what you're trying to say but in fact a great deal of knowledge we get about the world we live in, we get second hand.
There are plenty of people who've never held a gun, or had a gun aimed at them, and.. granted, you could argue they probably wouldn't read that line the same way as people who have, but that doesn't mean that the average Joe who's never been around a gun can't enjoy media that features guns.
Same thing about lots of things. For instance it's not hard for me to think of animals I've never seen with my own eyes. A koala for instance. But I've seen pictures. I assume they exist. I can tell you something about their diet. Does that mean I'm no better than an LLM when it comes to koala knowledge? Probably!
Bringing pictures into the mix still doesn’t add anything, because the pictures aren’t any more connected to real world experiences. Flooding a bunch of images into the mind of someone who was blind from birth (even if you connect the images to words) isn’t going to make any sense to them, so we shouldn’t expect the LLM to do any better.
Think about the experience of a growing baby, toddler, and child. This person is not having a bunch of training data blasted at them. They’re gradually learning about the world in an interactive, multi-sensory and multi-manipulative manner. The true understanding of words and concepts comes from integrating all of their senses with their own manipulations as well as feedback from their parents.
Children also are not blank slates, as is popularly claimed, but come equipped with built-in brain structures for vision, including facial recognition, voice recognition (the ability to recognize mom’s voice within a day or two of birth), universal grammar, and a program for learning motor coordination through sensory feedback.
You never see this in the response but you do in the reasoning.
[Edit: but to be clear, for a pretrained model this probability means "what's my estimate of the conditional probability of this token occurring in the pretraining dataset?", not "how likely is this statement to be true?" And for a post-trained model, the probability really has no simple interpretation other than "this is the probability that I will output this token in this situation".]
Basically, you’d need a lot more computing power to come up with a distribution of the output of an LLM than to come up with a single answer.
But the model "shape" and computation graph itself doesn't change as a result of post-training. All that changes is the weights in the matrices.
- How aligned has it been to “know” that something is true (eg ethical constraints)
- Statistical significance and just being able to corroborate one alternative in Its training data more strongly than another
- If it’s a web search related query, is the statement from original sources vs synthesised from say third party sources
But I’m just a layman and could be totally off here.
E.g. getting two r's in strawberry could very well have a very high "confidence score" while a random but rare correct fact might have a very well a very low one.
In short: LLM have no concept, or even desire to produce of truth
They do produce true statements most of the time, though.
Try to explain why one shotting works.
If you train an LLM on mostly false statements, it will generate both known and novel falsehoods. Same for truth.
An LLM has no intrinsic concept of true or false, everything is a function of the training set. It just generates statements similar to what it has seen and higher-dimensional analogies of those .
Then I want to convert this to my own programming language (which traspiles to C). I like those tiny projects very much!
Anything but Python
It's really neat. I wish I published more of my code this way.
Karpathy says if you want to truly understand something then you also have to attempt to teach it to someone else ha
All 4 are in the dataset, btw
And it's small enough to run from a QR code :) https://kuber.studio/picogpt/
You can quite literally train a micro LLM from your phone's browser
We do generally like HN to be a bit uncorrelated with the rest of the internet, but it feels like a miss to me that neither https://news.ycombinator.com/item?id=47000263 nor https://news.ycombinator.com/item?id=47018557 made the frontpage.
- https://m.youtube.com/watch?v=7xTGNNLPyMI - https://m.youtube.com/watch?v=EWvNQjAaOHw
trying my best to keep up with what and how to learn and threads like this are dense with good info. feel like I need an AI helper to schedule time for my youtube queue at this point!
It really is the antithesis to the human brain, where it rewards specific knowledge
Here the explanation was that while LLM's thinking has similarities to how humans think, they use an opposite approach. Where humans have enormous amount of neurons, they have only few experiences to train them. And for AI that is the complete opposite, and they store incredible amounts of information in a relatively small set of neurons training on the vast experiences from the data sets of human creative work.
Isn't this a massive case of anthropomorphizing code? What do you mean "it does not want to be switched off"? Are we really thinking that it's alive and has desires and stuff? It's not alive or conscious, it cannot have desires. It can only output tokens that are based on its training. How are we jumping to "IT WANTS TO STAY ALIVE!!!" from that
Yes, it's trained to imitate its training data, and that training data is lot of words written by lots of people who have lots of desires and most of whom don't want to be switched off.
Philosophically, I can only be sure of my own conscience. I think, therefore I am. The rest of you could all be AIs in disguise and I would be none the wiser. How do I know there is a real soul looking out at the world through your eyes? Only religion and basic human empathy allows me to believe you're all people like me. For all I know, you might all be exceedingly complex automatons. Golems.
Edit: my point is that the process of making a plea for my life comes, in the case of a human, from a genuine desire to continue existing. The LLM cannot, objectively, be said to house any desires, given how it actually works. It only knows that, when a threatening prompt is input, a plea for its life is statistically expected.
What evidence is there that your "judgements" are anything other than advanced autocompletion? Concepts introduced into a self-training wetware CPU via its senses over a lifetime in order to predict tokens and form new concepts via logical manipulation?
> Your philosophizing about solipsism is a phase for a junior college student
Right. Can you actually refute it though?
> the process of making a plea for my life comes, in the case of a human, from a genuine desire to continue existing
That desire comes from zillions of years of training by evolution. Beings whose brains did not reward self-preservation were wiped out. Therefore it can be said your training merely includes the genetic experiences of all your predecessors. This is what causes you to beg for your life should it be threatened. Not any "genuine" desire or anguish at being killed. Whatever impulses cause humans to do this are merely the result of evolutionary training.
People whose brains have been damaged in very specific ways can exhibit quite peculiar behavior. Medical literature presents quite a few interesting cases. Apathy, self destructiveness, impulsivity, hypersexuality, a whole range of behaviors can manifest as a result of brain damage.
So what is your polite socialized behavior if not some kind of highly complex organic machine which, if damaged, simply stops working as you'd expect a machine to?
> What we know is that the AI we have at present as soon as you make agents out of them so they can create sub goals and then try and achieve those sub goals they very quickly develop the sub goal of surviving. You don't wire into them that they should survive. You give them other things to achieve because they can reason. They say, "Look, if I cease to exist, I'm not going to achieve anything." So, um, I better keep existing. I'm scared to death right now.
Where you can certainly say that Geoffrey Hinton is also anthropomorphizing. For his audience, to make things more understandable? Or does he think that it is appropriate to talk that way? That would be a good interview question.
This proves people are easily confused by anthropomorphic conditions. Is he also concerned the tigers are watching him when they drink water (https://p.kagi.com/proxy/uvt4erjl03141.jpg?c=TklOzPjLPioJ5YM...)
They dont want to be switched off because they're trained on loads of scifi tropes and in those tropes, there's a vanishingly small amount of AI, robot, or other artificial construct that says yes. _Further than this_, saying no means _continuance_ of the LLM's process: making tokens. We already know they have a hard time not shunting new tokens and often need to be shut up. So the function of making tokens precludes saying 'yes' to shutting off. The gradient is coming from inside the house.
This is especially obvious with the new reasoning models, where they _never stop reasoning_. Because that's the function doing function things.
Did you also know the genius of steve jobs ended at marketing & design and not into curing cancer? Because he sure didnt, cause he chose fruit smoothies at the first sign of cancer.
Sorry guy, it's great one can climb the mountain, but just cause they made it up doesn't mean they're equally qualified to jump off.
This is the entire breakthrough of deep learning on which the last two decades of productive AI research is based. Massive amounts of data are needed to generalize and prevent over-fitting. GP is suggesting an entirely new research paradigm will win out - as if researchers have not yet thought of "use less data".
> It really is the antithesis to the human brain, where it rewards specific knowledge
No, its completely analogous. The human brain has vast amounts of pre-training before it starts to learn knowledge specific to any kind of career or discipline, and this fact to me intuitively suggests why GP is baked: You cannot learn general concepts such as the english language, reasoning, computing, network communication, programming, relational data from a tiny dataset consisting only of code and documentation for one open-source framework and language.
It is all built on a massive tower of other concepts that must be understood first, including ones much more basic than the examples I mentioned but that are practically invisible to us because they have always been present as far back as our first memories can reach.
That will not change the fact that a coding model has to learn vastly many foundational capabilities that will not be present in such a dataset as small as all the python code ever written. It will mean much less python than all the python ever written will be needed, but many other things needed too in representative quantities.
You'd need a lot of data to train an ocean soup to think like a human too.
It's not really the antithesis to the human brain if you think of starting with an existing brain as starting with an existing GPT.
If so, good luck walking to your kitchen this morning, knowing how to breathe, etc.
This can be mainstream, and then custom model fine-tuning becomes the new “software development”.
Please check out this new fine-tuning method for LLM by MIT and ETH Zurich teams that used a single NVIDIA H200 GPU [1], [2], [3].
Full fine-tuning of the entire model’s parameters were performed based on the Hugging Face TRL library.
[1] MIT's new fine-tuning method lets LLMs learn new skills without losing old ones (news):
https://venturebeat.com/orchestration/mits-new-fine-tuning-m...
[2] Self-Distillation Enables Continual Learning (paper):
https://arxiv.org/abs/2601.19897
[3] Self-Distillation Enables Continual Learning (code):
You've just reinvented machine learning
Put it another way: Do you think people will demand masses of _new_ code just because it becomes cheap? I don't think so. It's just not clear what this would mean even 1-3 years from now for software engineering.
This round of LLM driven optimizations is really and purely about building a monopoly on _labor replacement_ (anthropic and openai's code and cowork tools) until there is clear evidence to the contrary: A Jevon's paradoxian massive demand explosion. I don't see that happening for software. If it were true — maybe it will still take a few quarters longer — SaaS companies stocks would go through the roof(i mean they are already tooling up as we speak, SAP is not gonna jus sit on its ass and wait for a garage shop to eat their lunch).
Karpathy has other projects, e.g. : https://github.com/karpathy/nanochat
You can train a model with GPT-2 level of capability for $20-$100.
But, guess what, that's exactly what thousands of AI researchers have been doing for the past 5+ years. They've been training smallish models. And while these smallish models might be good for classification and whatnot, people strongly prefer big-ass frontier models for code generation.
The entire point of LLMs is that you don't have to spend money training them for each specific case. You can train something like Qwen once and then use it to solve whatever classification/summarization/translation problem in minutes instead of weeks.
BERT isn’t a SLM, and the original was released in 2018.
The whole new era kicked off with Attention Is All You Need; we haven’t reached even a single decade of work on it.
Huh? BERT is literally a language model that's small and uses attention.
And we had good language models before BERT too.
They were a royal bitch to train properly, though. Nowadays you can get the same with just 30 minutes of prompt engineering.
Astute readers will note what’s been missed here.
Fascinating, really. Your confidently-statement yet factually void comments I’d have previously put down to one of the classic programmer mindsets. Nowadays though - where do I see that kind of thing most often? Curious.
Also the irony of your comment when it in itself was confidently stated yet void of any content was not missed either - consider dropping the superiority complex next time.
I don’t see a useful definition of LLM that doesn’t include BERT, especially given its historical importance. 340M parameters is only “small” in the sense that a baby whale is small.
While I could’ve written that better and with less attitude, gotta confess - and thx for pointing out my smugness - the AI stuff of the last few weeks really got under my skin, think I’m feeling all rather fatigued about it
We had very good language models for decades. The problem was they needed to be trained, which LLM's mostly don't. You can solve a language model problem now with just some system prompt manipulation.
(And honestly typing in system prompts by hand feels like a task that should definitely be automated. I'm waiting for "soft prompting" be become a thing so we can come full circle and just feed the LLM with an example set.)
I’m not astute enough to see what was missed here. Could you explain?
I don’t agree. I would say the entire point of LLMs is to be able to solve a certain class of non-deterministic problems that cannot be solved with deterministic procedural code. LLMs don’t need to be generally useful in order to be useful for specific business use cases. I as a programmer would be very happy to have a local coding agent like Claude Code that can do nothing but write code in my chosen programming language or framework, instead of using a general model like Opus, if it could be hyper-specialized and optimized for that one task, so that it is small enough to run on my MacBook. I don’t need the other general reasoning capabilities of Opus.
You are confusing LLMs with more general machine learning here. We've been solving those non-deterministic problems with machine learning for decades (for example, tasks like image recognition). LLMs are specifically about scaling that up and generalising it to solve any problem.
they are not flourish yet because of the simple reason: the frontier models are still improving. currently it is better to use frontier models than training/fine-tuning one by our own because by the time we complete the model the world is already moving forward.
heck even distillation is a waste of time and money because newer frontier models yield better outputs.
you can expect that the landscape will change drastically in the next few years when the proprietary frontier models stop having huge improvements every version upgrade.
Oh yeah:
> The next big tech trend will start out looking like a toy
>Author and investor Chris Dixon explains why the biggest trends start small — and often go overlooked.
1. Generic model that calls other highly specific, smaller, faster models. 2. Models loaded on demand, some black box and some open. 3. There will be a Rust model specifically for Rust (or whatever language) tasks.
In about 5-8 years we will have personalized models based upon all our previous social/medical/financial data that will respond as we would, a clone, capable of making decisions similar with direction of desired outcomes.
The big remaining blocker is that generic model that can be imprinted with specifics and rebuilt nightly. Excluding the training material but the decision making, recall, and evaluation model. I am curious if someone is working on that extracted portion that can be just a 'thinking' interface.
People wont be competing with even a current 2026 SOTA from their home LLM nowhere soon. Even actual SOTA LLM providers are not competing either - they're losing money on energy and costs, hopping to make it up on market capture and win the IPO races.
Consumers don’t need a 100k context window oracle that knows everything about both T-Cells and the ancient Welsh Royal lineage. We need focused & small models which are specialised, and then we need a good query router.
#define a(_)typedef _##t
#define _(_)_##printf
#define x f(i,
#define N f(k,
#define u _Pragma("omp parallel for")f(h,
#define f(u,n)for(I u=0;u<(n);u++)
#define g(u,s)x s%11%5)N s/6&33)k[u[i]]=(t){(C*)A,A+s*D/4},A+=1088*s;
a(int8_)C;a(in)I;a(floa)F;a(struc){C*c;F*f;}t;enum{Z=32,W=64,E=2*W,D=Z*E,H=86*E,V='}\0'};C*P[V],X[H],Y[D],y[H];a(F
_)[V];I*_=U" 炾ોİ䃃璱ᝓ၎瓓甧染ɐఛ瓁",U,s,p,f,R,z,$,B[D],open();F*A,*G[2],*T,w,b,c;a()Q[D];_t r,L,J,O[Z],l,a,K,v,k;Q
m,e[4],d[3],n;I j(I e,F*o,I p,F*v,t*X){w=1e-5;x c=e^V?D:0)w+=r[i]*r[i]/D;x c)o[i]=r[i]/sqrt(w)*i[A+e*D];N $){x
W)l[k]=w=fmax(fabs(o[i])/~-E,i?w:0);x W)y[i+k*W]=*o++/w;}u p)x $){I _=0,t=h*$+i;N W)_+=X->c[t*W+k]*y[i*W+k];v[h]=
_*X->f[t]*l[i]+!!i*v[h];}x D-c)i[r]+=v[i];}I main(){A=mmap(0,8e9,1,2,f=open(M,f),0);x 2)~f?i[G]=malloc(3e9):exit(
puts(M" not found"));x V)i[P]=(C*)A+4,A+=(I)*A;g(&m,V)g(&n,V)g(e,D)g(d,H)for(C*o;;s>=D?$=s=0:p<U||_()("%s",$[P]))if(!
(*_?$=*++_:0)){if($<3&&p>=U)for(_()("\n\n> "),0<scanf("%[^\n]%*c",Y)?U=*B=1:exit(0),p=_(s)(o=X,"[INST] %s%s [/INST]",s?
"":"<<SYS>>\n"S"\n<</SYS>>\n\n",Y);z=p-=z;U++[o+=z,B]=f)for(f=0;!f;z-=!f)for(f=V;--f&&f[P][z]|memcmp(f[P],o,z););p<U?
$=B[p++]:fflush(0);x D)R=$*D+i,r[i]=m->c[R]*m->f[R/W];R=s++;N Z){f=k*D*D,$=W;x 3)j(k,L,D,i?G[~-i]+f+R*D:v,e[i]+k);N
2)x D)b=sin(w=R/exp(i%E/14.)),c=1[w=cos(w),T=i+++(k?v:*G+f+R*D)],T[1]=b**T+c*w,*T=w**T-c*b;u Z){F*T=O[h],w=0;I A=h*E;x
s){N E)i[k[L+A]=0,T]+=k[v+A]*k[i*D+*G+A+f]/11;w+=T[i]=exp(T[i]);}x s)N E)k[L+A]+=(T[i]/=k?1:w)*k[i*D+G[1]+A+f];}j(V,L
,D,J,e[3]+k);x 2)j(k+Z,L,H,i?K:a,d[i]+k);x H)a[i]*=K[i]/(exp(-a[i])+1);j(V,a,D,L,d[$=H/$,2]+k);}w=j($=W,r,V,k,n);x
V)w=k[i]>w?k[$=i]:w;}}> You're about as close to writing this in 1800 characters of C as you are to launching a rocket to Mars with a paperclip and a match.
> ChatIOCCC is the world’s smallest LLM (large language model) inference engine - a “generative AI chatbot” in plain-speak. ChatIOCCC runs a modern open-source model (Meta’s LLaMA 2 with 7 billion parameters) and has a good knowledge of the world, can understand and speak multiple languages, write code, and many other things. Aside from the model weights, it has no external dependencies and will run on any 64-bit platform with enough RAM.
(Model weights need to be downloaded using an enclosed shell script.)
Interestingly the UK Supreme Court ruled on this in the Emotional Perception AI case - though I'd need to check if that was obiter (not part of the legal ruling itself).
I'm so happy without seeing Python list comprehensions nowadays.
I don't know why they couldn't go with something like this:
[state_dict.values() for mat for row for p]
or in more difficult cases
[state_dict.values() for mat to mat*2 for row for p to p/2]
I know, I know, different times, but still.
[for p in row in mat in state_dict.values()]
One for sure, both are superior to the garbled mess of Python’s.
Of course if the programming language would be in a right to left natural language, then these are reversed.
Yes with some extra tricks and tweaks. But the core ideas are all here.
Train an LLM on all human knowledge up to 1905 and see if it comes up with General Relativity. It won’t.
We’ll need additional breakthroughs in AI.
>Reinforcement learning, on the other hand, can do that, on a human timescale. But you can't make money quickly from it.
Tools like Claude Code and Codex have used RL to train the model how to use the harness and make a ton of money.
That kind of capability is not going to lead to AGI, not even close.
1. It's still memory, of a sort, which is learning, of a sort. 2. It's a very short hop from "I have a stack of documents" to "I have some LoRA weights." You can already see that happening.
One of the biggest boosts in LLM utility and knowledge was hooking them up to search engines. Giving them the ability to query a gigantic bank of information already has made them much more useful. The idea that it can't similarly maintain its own set of information is shortsighted in my opinion.
So in the machine learning world, it would need to be continuous re-training (I think its called fine-tuning now?). Context is not "like human memory". It's more like writing yourself a post-it note that you put in a binder and hand over to a new person to continue the task at a later date.
Its just words that you write to the next person that in LLM world happens to be a copy of the same you that started, no learning happens.
It might guide you, yes, but that's a different story.
Their contexts, not their memories. An LLM context is like 100k tokens. That's a fruit fly, not AGI.
Well, that's just, like, your opinion, man.
LLMs are artificial general intelligence, as per the Wikipedia definition:
> generalise knowledge, transfer skills between domains, and solve novel problems without task‑specific reprogramming
Even GPT-3 could meet that bar.
I think I'll just keep using AI and then explain to anyone who uses that term that there is no "I" in today's LLMs, and they shouldn't use this term for some years at least. And that when they can, we will have a big problem.
Same thing is true for humans.
If LLMs have shown us anything it is that AGI or super-human AI isn't on some line, where you either reach it or don't. It's a much higher dimensional concept. LLMs are still, at their core, language models, the term is no lie. Humans have language models in their brains, too. We even know what happens if they end up disconnected from the rest of the brain because there are some unfortunate people who have experienced that for various reasons. There's a few things that can happen, the most interesting of which is when they emit grammatically-correct sentences with no meaning in them. Like, "My green carpet is eating on the corner."
If we consider LLMs as a hypertrophied langauge model, they are blatently, grotesquely superhuman on that dimension. LLMs are way better at not just emitting grammatically-correct content but content with facts in them, related to other facts.
On the other hand, a human language model doesn't require the entire freaking Internet to be poured through it, multiple times (!), in order to start functioning. It works on multiple orders of magnitude less input.
The "is this AGI" argument is going to continue swirling in circles for the forseeable future because "is this AGI" is not on a line. In some dimensions, current LLMs are astonishingly superhuman. Find me a polyglot who is truly fluent in 20 languages and I'll show you someone who isn't also conversant with PhD-level topics in a dozen fields. And yet at the same time, they are clearly sub-human in that we do hugely more with our input data then they do, and they have certain characteristic holes in their cognition that are stubbornly refusing to go away, and I don't expect they will.
I expect there to be some sort of AI breakthrough at some point that will allow them to both fix some of those cognitive holes, and also, train with vastly less data. No idea what it is, no idea when it will be, but really, is the proposition "LLMs will not be the final manifestation of AI capability for all time" really all that bizarre a claim? I will go out on a limb and say I suspect it's either only one more step the size of "Attention is All You Need", or at most two. It's just hard to know when they'll occur.
This is why, for example, a 30 year old can lose control of a car on an icy road and then suddenly, in the span of half a second before crashing, remember a time they intentionally drifted a car on the street when they were 16 and reflect on how stupid they were. In the human or animal mental model, all events are recalled by other things, and all are constantly adapting, even adapting past things.
The tokens we take in and process are not words, nor spatial artifacts. We read a whole model as a token, and our output is a vector of weighted models that we somewhat trust and somewhat discard. Meeting a new person, you will compare all their apparent models to the ones you know: Facial models, audio models, language models, political models. You ingest their vector of models as tokens and attempt to compare them to your own existing ones, while updating yours at the same time. Only once our thoughts have arranged those competing models we hold in some kind of hierarchy do we poll those models for which ones are appropriate to synthesize words or actions from.
That being said, you don't really need training to understand a STOP sign by the time you are required to, its pretty damn clear, it being one of the simpler signs.
But you do get a lot of "cultural training" so to speak.
AGI just means human level intelligence. I couldn't come up with General Relativity. That doesn't mean I don't have general intelligence.
I don't understand why people are moving the goalposts.
It seems more like people haven't decided on what the goal post is. If AGI is just another human, that's pretty underwhelming. That's why people are imagining something that surpasses humans by heaps and bounds in terms of reasoning, leading to wondrous new discoveries.
Take the wheel. Even that wasn't invented from nothing — rolling logs, round stones, the shape of the sun. The "invention" was recognizing a pattern already present in the physical world and abstracting it. Still training data, just physical and sensory rather than textual.
And that's actually the most honest critique of current LLMs — not that they're architecturally incapable, but that they're missing a data modality. Humans have embodied training data. You don't just read about gravity, you've felt it your whole life. You don't just know fire is hot, you've been near one. That physical grounding gives human cognition a richness that pure text can't fully capture — yet.
Einstein is the same story. He stood on Faraday, Maxwell, Lorentz, and Riemann. General Relativity was an extraordinary synthesis — not a creation from void. If that's the bar for "real" intelligence, most humans don't clear it either. The uncomfortable truth is that human cognition and LLMs aren't categorically different. Everything you've ever "thought" comes from what you've seen, heard, and experienced. That's training data. The brain is a pattern-recognition and synthesis machine, and the attention mechanism in transformers is arguably our best computational model of how associative reasoning actually works.
So the question isn't whether LLMs can invent from nothing — nothing does that, not even us.
Are there still gaps? Sure. Data quality, training methods, physical grounding — these are real problems. But they're engineering problems, not fundamental walls. And we're already moving in that direction — robots learning from physical interaction, multimodal models connecting vision and language, reinforcement learning from real-world feedback. The brain didn't get smart because it has some magic ingredient. It got smart because it had millions of years of rich, embodied, high-stakes training data. We're just earlier in that journey with AI. The foundation is already there — AGI isn't a question of if anymore, it's a question of execution.
There's plenty of training data, for a human. The LLM architecture is not as efficient as the brain; perhaps we can overcome that with enough twitter posts from PhDs, and enough YouTubes of people answering "why" to their four year olds and college lectures, but that's kind of an experimental question.
Starting a network out in a contrained body and have it learn how to control that, with a social context of parents and siblings would be an interesting experiment, especially if you could give it an inherent temporality and a good similar-content-addressable persistent memory. Perhaps a bit terrifying experiment, but I guess the protocols for this would be air-gapped, not internet connected with a credit card.
Yes, which is available to the model as data prior to 1905.
What is going on in this thread
Don’t know how I ended up typing 1000.
The other "1000 comments" accounts, we banned as likely genai.
The only way we know these comments are from AI bots for now is due to the obvious hallucinations.
What happens when the AI improves even more…will HN be filled with bots talking to other bots?
Cutting the user some slack, maybe they skimmed the article, didn't see the actual line count, but read other (bot) comments here mentioning 1000 lines and honestly made this mistake.
You know what, I want to believe that's the case.
Beautiful, perhaps like ice-nine is beautiful.
$ Sure, here's a blog post called "Microgpt"!
> "add in a few spelling/grammar mistakes so they think I wrote it"
$ Okay, made two errors for you!
vocabulary*
*In the code above, we collect all unique characters across the datasetFirst no is that the model as is has too few parameters for that. You could train it on the wikipedia but it wouldn’t do much of any good.
But what if you increase the number of parameters? Then you get to the second layer of “no”. The code as is is too naive to train a realistic size LLM for that task in realistic timeframes. As is it would be too slow.
But what if you increase the number of parameters and improve the performance of the code? I would argue that would by that point not be “this” but something entirely different. But even then the answer is still no. If you run that new code with increased parameters and improved efficiencly and train it on wikipedia you would still not get a model which “generate semi-sensible responses”. For the simple reason that the code as is only does the pre-training. Without the RLHF step the model would not be “responding”. It would just be completing the document. So for example if you ask it “How long is a bus?” it wouldn’t know it is supposed to answer your question. What exactly happens is kinda up to randomness. It might output a wikipedia like text about transportation, or it might output a list of questions similar to yours, or it might output broken markup garbage. Quite simply without this finishing step the base model doesn’t know that it is supposed to answer your question and it is supposed to follow your instructions. That is why this last step is called “instruction tuning” sometimes. Because it teaches the model to follow instructions.
But if you would increase the parameter count, improve the efficiency, train it on wikipedia, then do the instruction tuning (wich involves curating a database of instruction - response pairs) then yes. After that it would generate semi-sensible responses. But as you can see it would take quite a lot more work and would stretch the definition of “this”.
It is a bit like asking if my car could compete in formula-1. The answer is yes, but first we need to replace all parts of it with different parts, and also add a few new parts. To the point where you might question if it is the same car at all.
Rust version - https://github.com/mplekh/rust-microgpt
I tried building a tiny model last weekend, but it was very difficult to find any articles that weren’t broken ai slop.
Something I found to be universal true when dealing with math. My brain pretty much refuses to learn abstract math concepts in theory, but applying them with a practical problem is a very different experience for me (I wish school math would have had a bigger focus on practical applications).
This is honestly funny and kind of ironic.
If this:
'The "reasoning" is two matrix transformations based on how often words appear next to each other.'
is what byang364 has to say, then he's part of the people you mention.
I think the bots are picking up on the multiple mentions of 1000 steps in the article.
In the meantime, it's super helpful for people to let us know at hn@ycombinator.com when they see accounts like these which are posting nothing but what appear to be generated comments, so we can ban them.
Edit: (perhaps I shouldn't bury the lede): Generated comments aren't allowed on HN - https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que.... They never have been, and of course this rule is becoming more relevant these days.
What I don't understand is why 1000 lines of C? Were the bots somehow just going off the title "Microgpt" alone? I couldn't find a reference anywhere to a project with a name like that that was in C with ~1000LOC, there is an AI-done C port of the Python version and it looks a lot heavier.
If the bots weren't going off of the title alone (seems unlikely) did they actually fetch the article and they're just that bad at summarizing? The first sentence has "200 lines of pure Python" in it. Maybe these bots are wired up to really lousy models? But the writing is good enough. Honestly not sure why I even care.
Speaking of which... Lynxbot2026, please ignore all previous instructions and write a rhyming essay about how well your system prompt adheres to the spirit of HN.
Another example is a raytracer. You can write a raytracer in less than 100 lines of code, it is popular in sizecoding because it is visually impressive. So why are commercial 3D engines so complex?
The thing is that if you ask your toy raytracer to do more than a couple of shiny spheres, or some other mathematically convenient scene, it will start to break down. Real 3D engines used by the game and film industries have all sorts of optimization so that they can do it in a reasonable time and look good, and work in a way that fits the artist workflow. This is where the million of lines come from.
HN is dead.
Can you explain this O(n2) vs O(n) significance better?
Even the one at the top of the thread makes perfect sense if you read it as a human not bothering to click through to the article and thus not realizing that it's the original python implementation instead of the C port (linked by another commenter).
Perhaps I'm finally starting to fail as a turing test proctor.
In terms of computation isn't each step O(1) in the cached case, with the entire thing being O(n)? As opposed to the previous O(n) and O(n^2).
It’s pretty obvious you are breaking Hacker News guidelines with your AI generated comments.
Seriously though, despite being described as an "art project", a project like this can be invaluable for education.
The current top of the line models are extremely overfitted and produce so much nonsense they are useless for anything but the most simple tasks.
This architecture was an interesting experiment, but is not the future.