upvote
This has been my dream ever since. Instead of encoding "all the knowledge" into those parameters, how about just making a model that has the same size, but all (or rather most) it does is reasoning? Just give it the ability to browse the net (e.g. language specifications, documentation and best practices) and just have it do its thing. Why does my coding agent need to know the population of New York, know a cheese cake recipe or the general lifespan of an ostrich? Just give it the bare minimum knowledge to think and reason about, and let it figure out the rest.

Sadly that's not how LLMs work, since all they do is "token prediction". At least the models we have to today ...

reply
I think this is a well known concept, which we can't deliver yet. LLM/transformer give us reasoning engine as a byproduct of its design, but it is quite ineffective. If we can distill reasoning, if reasoning can be achieved without general knowledge, it will be a very effective machine.

Some amount of knowledge is required for reasoning. Maybe such model can dynamically knowledge domains to have taxonomy. For example, model can't effective reason about development task, if it has no knowledge about development best practices. But population of New York or recipies can definitely be loaded run time with tools.

reply
>Some amount of knowledge is required for reasoning.

This is the root of problem. If you think about STEM universities, they don't really teach you things you need in the real world. They teach you what you need to know in order to go out there and accumulate the necessary information which can then be used to solve problems. Giving a person access to the internet or a super powerful calculator (like Mathematica) won't mean that they can do anything useful. They need tons of experience to use these tools in an effective way. That experience is basically all that implicit adjacent knowledge that we pick up along the way getting our degrees. And LLMs pick that up during pre-training. Drop this part and the outcome will be worthless.

reply
Take mathematics as an example. Humanity has found math notation, which allowed to express math rules — distill them to the core. Before math was expressed in prose — a very inefficient way, very similar to current LLMs.

In my school, math teacher was giving me prose, which I was converting to math notation. I could argue, that this prose→reasoning conversion is not required at training, and can be obtained at inference time with search tools.

reply
Yup, you still need knowledge. Even if you have access to all the data and tools, you still need to know what to search for, what tools to use and to understand what the user is asking.

Our computers can already do everything, have access to all the tools and information, yet they still need a human/intelligence to use it and apply to specific problems.

Even defining the problem requires knowledge.

As for the tools, if the model has access to 1000 tools, how would it know which one to use if it doesn't have any knowledge itself?

What if I ask for "table tennis spin" it had a "magnus effect calculator", how would it know to make the connection between the two?

reply
Model can use tools to get that knowledge. In your example, read Wikipedia page about table tennis. Imagine a reasoning engine with a big enough context, that knows nothing. A path built from first principles to understand "table tennis spin" — does not look very long for me.
reply
How would it know about Wikipedia and when to use it? From the tool description? If we had 100k such tools, then that wouldn't even fit in the context.

This is only one example, plus if the topic is more complex, maybe it had to search/learn everything (what is table tennis, what is spin, what is a human, what is a ball), etc. So it would be like spawning a baby human, have it spend an (instant) life learning about the world before providing an answer. Maybe this could work in 10 yesrs, if models get stronger with huge context lengths and almost instant data retrieval. Is it the best way to go about things though? Most animals have most of their core abilities embedded in their DNA and "instincts". A cat doesn't have to learn what a bird is in order to hunt it, it's already "embedded" in its neural pathways, or even deeper, at a full-body level. Those type of systems are a lot more efficient than the learned ones. Maybe the best future AI, will have everything already embedded, instead of just being a strong reasoning machine. All AI responses should be instant and like "reflexes" instead of reasoned steps.

reply
Imagine you only know how to cook (use fry pan skill) and know how to cook omelette (recipe). You get the task to cook doner kebab. How many Wikipedia pages do you need to read to get a good understanding? I guess its max 5.

I think grounding your abstract problem to an example makes it more trivial, than it sounds in general.

> How would it know about Wikipedia and when to use it?

2 general concepts "You have to get good understanding of subject area before you do actions" + "Wikipedia is a good source of knowledge of subject areas" will get a model there.

> spawning a baby human, have it spend an (instant) life learning

Humans spend 99% of their life on boring repeating tasks, not learning anything, just navigating on heuristics.

reply
>Doner kebab or döner kebab[a] is a Turkish

(what is turkish)->(parse lots of potentially relevant/irrelevant context because I have no way of knowing which if any of this informs the doner kebab before I've looked at it)

>dish made of meat

(what is meat) -> (parse lots of potentially ir/relevant context because I don't know if the specific origin/chemistry/mechanics or whether maillard reactions are important before I learn about them)

>cooked on a vertical rotisserie.

(what is a rotisserie) -> etc etc etc

Seems significantly less efficient than just having the various (how to cook > meat, tools > rotisserie, how to cook > seasoning > tomato; lettuce; cabbage; onion with sumac; fresh or pickled cucumber or chili; various sauces, etc) just already built in to the weights.

reply
I'm just playing devil's advocate here.

Yes, but still "how to cook" is not atomic. It involves knowing how to move stuff, how to measure, what "cooked" looks like in different environment (i.e. different lighting) or variations in ingredients, how to recover from specific failures (i.e. a good cook can fix accidentally adding too much salt, by counter-balancing with an ingredient that absorbs the extra salt). And this is only one skill.

It's a bit how deep image neural nets work, where simply detecting shape primitives is not enough, the net is also the connection and relation between those primitives.

Even saying, the AI should just have the "cooking" or "coding" skill, trivializes the problem.

> Humans spend 99% of their life on boring repeating tasks

But we are also non-stop unconciously learning about the world non-stop, from the analgous stream of inputs and seeing the immediate result/feedback. Even looking at static picture is like over-training a specific dataset.

reply
Just boiling water would be difficult. Do I just add heat until I see bubbles? Or should I have a world model in which I understand that boiling water will be of varying temperatures at varying altitudes and given different liquids.

Because if the recipe just says "boil for 10 minutes" but the thing being cooked really needs a temperature of 212F for 10 minutes, the thing isn't going to be cooked if you're not actually at 212 for 10.

reply
It probably can't use all the truth in its content window, not yet anyway.

E.g. you put a graph in its content window, and you ask it to find a Hamiltonian cycle, can it do it?

Probably this could be a next step in the future for more powerful AIs, a layer that abstracts the facts in its content window away, and a layer that solves this types of abstractions.

reply
This is me vibe-splaining something I don't know a lot about, but I doubt there is such a thing.

If "all the knowledge" is what our models now do, what exactly would be the most extreme "none of the knowledge +search" ?

> language specifications.

It would load in all the knowledge to figure it what "language" means, then it would continue trying to decode what "specifications" means.

That might sound absurd, but to figure out the population of New York It's either: Just going to google it, or derive from primary sources.

But how is it ever going to interpret the primary sources? It needs to understand the question, how complex a question is, and how complete an answer is and how things relate. Thats just _too_ much language.

There might be a way to compact this down into a LLM-native language such that the request of `the population of New York` or `use best practices` is encoded without our messy human language for a reasoning model to work with, but the encoding itself has to be done by the "all the knowledge" llm. Now it seems we just rebuild something related to MoE with extra step afaict.

reply
Education had this sad 15 year period where it thought “competences” are all you need.

Turns out that without the world knowledge to have a base of facts, it is not.

reply
Basically: you can't teach people to think without giving them some facts and ideas to think with. It's like trying to teach woodworking without giving the students any wood.
reply
“Theoretical tennis”
reply
Competences were always supposed to be supported by demonstrable knowledge and skills and behavior.

So I don't think it's true that relevant knowledge was deprioritized. At least it wasn't supposed to be.

reply
It would also reduce training costs to nothing. Current methodology requires continual retraining to scoop up new facts. If you can do a one time "this is how to think" - that could conceptually work forever, just plug in a new database layer that can be queried as required.
reply
But isn’t that what “training” is anyway? They train LLM today like that and the database becomes the parameters. You can post train on smaller corpus for purpose-built stuff.
reply
Any sufficiently general superintelligence will deduce the existence of rice pudding and income tax from Cartesian first principles.
reply
deleted
reply
I mean, this really doesn't sound useful even if LLMs worked that way.

First, if you know nothing you don't even know what you're missing or what to search for.

Then, without unlimited context, you have to do research for every task all over again every time.

reply
> First, if you know nothing you don't even know what you're missing or what to search for.

RAG on the initial prompt would be the first thing to try.

> Then, without unlimited context, you have to do research for every task all over again every time.

Thing is, we're really really good at building very fast search engines. Doing research all over again every time shouldn't be a problem.

reply
Couldn't you build some internal knowledge that would stay and you could teach a model this way. A very fast local memory of some sort. You could also specialize model this way so it is very skilled in your domain. The more you use it, the smarter it gets. I guess the problem is for the model to decide whether the information stored in memory is sufficient or not.
reply
You could, but it's driving in the wrong direction to try to build that knowledge into the model weights because you'll always run into a capacity limit sooner with a small model than with a larger one. The thing the model is specialised for is linguistic understanding and the reasoning process itself, and you max that out at the expense of domain-specific knowledge. If you take "as few weights as possible" as a given, I think the interesting question is how small you can make the model with externalised memory. The openclaw and hermes people are all over this sort of memory problem: using the local filesystem or a local database of some sort is exactly a "very fast local memory" where the more you use it, the more knowledge it gathers. Whether that translates to it being "smarter" is a deeper question than it looks.
reply
The model they built knows a fair bit apparently. You can't get 94.3 on AIME26 knowing nothing.
reply
Reasoning alone can’t always predict all the bits of knowledge you’d need to sufficiently solve a problem, that you would research when planning.
reply
Because reasoning is an emergent byproduct of training it on all knowledge. It still doesn't "know" things in this form and just generates tokens, no matter how weird we spin it.

So if you don't train it on a large dataset of a lot of words with a lot of sensible connections, it won't be able to reason, as it won't be able to make proper connections between words and sentences.

You can try training a really small model and seeing the gibberish outputs when you train it on only a small dataset.

Minmaxing the dataset to extract maximum generation with minimal data does sound like fun, but if you want to build SoTA models as a company, the economic tradeoff of doing that vs slapping a few more GPU's together is terrible.

reply
I think small expert models could be pretty powerful from open weight providers.

Imagine, for example, a model that's primarily train on typescript and general programming. It would be faster to train and it could be a lot smaller than a generalist model. It might be the best model to pick when you are doing typescript programming. And if you could squeeze that into 3B parameters a lot of consumer hardware could run it locally.

You could even expand it to just "webdev tech" or the like.

reply
I think you could probably train a model to consider boolean logic, modal logic, and mathematics reasonably well, but there is still a pretty big leap between that and thinking about things.

Even the most basic questions such as put a ball in a cup and place it on a table upside down then pick up the cup and put it in a box.

Requires knowledge of things not mentioned in the question (notably gravity).

Strict definition of all terms quickly gets you into a quagmire of complexity. Some base level of knowledge about things is required for you to give it instructions. If it only knows how to reason, it lacks any idea of what to aim to achieve.

There is quite a pronounced disconnect between the vast stores of written data that models are trained on and robust consideration of a topic. I do wonder if the path can be directed by the order of training.

For example if you train a model to basic literacy using tinystories, then math and philosopy texts, then psychology, and sociology texts, and then finally the mass data of everything from conversations and rants, to code and fiction.

Does that end up with a significantly different model to one that is trained on books on acting, creative writing, and fantasy novels, before introducing the same final mass data set.

How much does it's current ability allow it to contextualise new training data?

reply
>Even the most basic questions such as put a ball in a cup and place it on a table upside down then pick up the cup and put it in a box.

That reminds me - this used to be my go-to question for smaller models and on which they would always fail miserably on:

A small strawberry is placed in a large cup. The cup is placed upside down on the kitchen table. Someone then lifts the cup as-is and puts it in the microwave. Where is the strawberry when the cup is in the microwave?

Here's what the 1.9GB VibeThinker-3B-GGUF:Q4_K_M answered:

Answer: The strawberry is still on the kitchen table – it fell out when the cup was turned upside‑down, and the subsequent lift‑and‑microwave move doesn’t change that.

So it seems there is definite progress here. Both specialized and yet improved common sense on things outside its domain of specialization.

reply
Is that learned common sense or has it learned the structure of that particular problem?

What happens if you ask

A small strawberry is placed in a large cup. The cup is placed upside down on a saucer on the kitchen table. Someone then lifts the cup and saucer as-is and puts them in the microwave. Where is the strawberry when the cup is in the microwave?

reply
The hard part was always the number of 'r's
reply
> Even the most basic questions such as put a ball in a cup and place it on a table upside down then pick up the cup and put it in a box.

I do not think this is a great example. First, it is not a question. Second, it seems very related to robotics. A model itself cannot put a ball anywhere, it can just call tools and answer in text, image, etc.

An LLM seeing "put a x in a y and place it on a z upside down then pick up the y and put it in a z2." and then a question about what happens could check a rag for properties of those x,y,z,z2 and still answer. Alternatively, this could be useful for coding, for example. And that is a very extreme example. Some basic language plus tool use could go quite far. I think it is a very interesting direction vs here is a gpu the price of a car.

reply
I wasn't explicitly stating the question, It was paraphrasing a common test question for world knowledge.

That you don't need to have a ball, cup, table, or even the ability to perform physical actions in order to consider where the ball ends up is in-itself required knowledge.

reply
The thing is we tried that for decades, using more formal logic to build reasoning engines. And we never got it to be even a fraction as good and generic as learning-based LLMs are today.
reply
I dont think think my point is getting across. This is in the context of how much world knowledge a model needs to be trained on, not llm vs not llm.
reply
I have been obsessed with the idea of this for a while, theres a Qwen with Opus reasoning distilled that works nicely as well. I think the next frontier is optimizing the models to be more capable on less hardware especially if it can learn on the fly.
reply
other way around. it's trained to generate long CoT to reason through problems (and does it well!) but has ~no tool calling capability, and ~no ability to manage more than 1-2 messages.

see the warning at the top of https://huggingface.co/WeiboAI/VibeThinker-3B

reply
Then smaller the models are, the longer they have to reason when dealing with complex problems. The trade-off is real.
reply
"The right tools" in this case might presumably include, eg, a set of repos + docs and specs on the various technologies being used. Or a library of text/images and background docs on style and techniques use to create them.

That plus this model should give you a very powerful and focussed assistant.

reply
deleted
reply
Sure it is small, 3B. But on Pi Zero, I can tell you from my experience, you'll be disappointed.
reply
> Am I right in thinking this is a tiny model which has been trained well to reason, and that's it?

i remember karpathy mentioning in dwarkesh podcast. But is reasoning really possible without all the knowledge.

reply
Even Karpathy acknowledged that this would require some baseline of human knowledge. The idea wasn't pure logic/reasoning, but some subset to bootstrap from.
reply
Choosing between a model that can only "reason" and a model that has extensive knowledge and "reasoning", the latter will be undeniably better. Advanced reasoning requires cross-domain knowledge, superb pattern recognition, which can only be gained through the same mechanisms which give you a knowledgeable model.

Except for the most basic of tasks, such as "turn on my lights" or "cross-reference these two lists", I wouldn't trust a small model to be as conscientious and reliable as one with deep knowledge.

reply
Yeah but don't you think like that's an oversimplication with the metaphor if we assume this model can do a smart human-level analysis and distillation of knowledge, no? I mean if that were true (i.e. its just like that) then yeah there is no need for massive models but I really would doubt that.

Even recent massive models do not work anything like a smart human does at the moment so why are we assuming this can?

reply