undefined

points

[-]

> Even a mouse needs hundreds of millions of neurons to do what a mouse does.

Under the very light assumption that a mouse doesn’t have neurons it doesn’t need, a mouse needs whatever number of neurons it has to do what a mouse does, so that’s not saying much.

Reading https://en.wikipedia.org/wiki/List_of_animals_by_number_of_n..., an ant has only 250k neurons and many reptiles can do with around 10 million.

That page also says 71 million for the house mouse. So what is it that a mouse does that reptiles do not do that requires them to have that much larger a brain? Caring for their children?

by musebox3515 hours ago|

prev|

[-]

Thanks for posting a through and accurate summary of the historical picture. I think it is important to know the past trajectory to extrapolate to the future correctly.

For a bit more context: Before 2012 most approaches were based on hand crafted features + SVMs that achieved state of the art performance on academic competitions such as Pascal VOC and neural nets were not competitive on the surface. Around 2010 Fei Fei Li of Stanford University collected a comparatively large dataset and launched the ImageNet competition. AlexNet cut the error rate by half in 2012 leading to major labs to switch to deeper neural nets. The success seems to be a combination of large enough dataset + GPUs to make training time reasonable. The architecture is a scaled version of ConvNets of Yan Lecun tying to the bitter lesson that scaling is more important than complexity.

by coppsilgold15 hours ago|

prev|

[-]

Comparing Deep Learning with neuroscience may turn out to be erroneous. They may be orthogonal.

The brain likely has more in common with Reservoir Computing (sans the actual learning algorithm) than Deep Learning.

Deep Learning relies on end to end loss optimization, something which is much more powerful than anything the brain can be doing. But the end-to-end limitation is restricting, credit assignment is a big problem.

Consider how crazy the generative diffusion models are, we generate the output in its entirety with a fixed number of steps - the complexity of the output is irrelevant. If only we could train a model to just use Photoshop directly, but we can't.

Interestingly, there are some attempts at a middle ground where a variable number of continuous variables describe an image: <https://visual-gen.github.io/semanticist/>

by ACCount372 hours ago|

parent|

[-]

Modern systems like Nano Banana 2 and ChatGPT Images 2.0 are very close to "just use Photoshop directly" in concept, if not in execution.

They seem to use an agentic LLM with image inputs and outputs to produce, verify, refine and compose visual artifacts. Those operations appear to be learned functions, however, not an external tool like Photoshop.

This allows for "variable depth" in practice. Composition uses previous images, which may have been generated from scratch, or from previous images.

by jvanderbot15 hours ago|

parent|

prev|

[-]

If you think a 2 year old is doing deep learning, you're probably wrong. But if you think natural selection was providing end to end loss optimization, you might be closer to right. An _awful lot_ of our brain structure and connectivity is born, vs learned, and that goes for Mice and Men.

by ACCount372 hours ago|

parent|

[-]

Why not both? A pre-trained LLM has an awful lot of structure, and during SFT, we're still doing deep learning to teach it further. Innate structure doesn't preclude deep learning at all.

There's an entire line of work that goes "brain is trying to approximate backprop with local rules, poorly", with some interesting findings to back it.

Now, it seems unlikely that the brain has a single neat "loss function" that could account for all of learning behaviors across it. But that doesn't preclude deep learning either. If the brain's "loss" is an interplay of many local and global objectives of varying complexity, it can be still a deep learning system at its core. Still doing a form of gradient descent, with non-backpropagation credit assignment and all. Just not the kind of deep learning system any sane engineer would design.

by imtringued4 hours ago|

parent|

prev|

[-]

I don't know what you mean by end to end loss optimization in particular, but if you mean something that involves global propagation of errors e.g. backpropagation you are dead wrong.

Predictive coding is more biologically plausible because it uses local information from neighbouring neurons only.

by roenxi11 hours ago|

parent|

prev|

[-]

> If only we could train a model to just use Photoshop directly, but we can't.

It is probably coming, I get the impression - just from following the trend of the progress - that internal world models are the hardest part. I was playing with Gemma 4 and it seemed to have a remarkable amount of trouble with the idea of going from its house to another house, collecting something and returning; starting part-way through where it was already at house #2. It figured it out but it seemed to be working very hard with the concept to a degree that was really a bit comical.

It looks like that issue is solving itself as text & image models start to unify and they get more video-based data that makes the object-oriented nature of physical reality obvious. Understanding spatial layouts seems like it might be a prerequisite to being able to consistently set up a scene in Photoshop. It is a bit weird that it seems pulling an image fully formed from the aether is statistically easier than putting it together piece by piece.

by antonvs8 hours ago|

parent|

prev|

[-]

> If only we could train a model to just use Photoshop directly, but we can't.

What kind of sadist would wish this on an intelligent entity?

by cdavid11 hours ago|

prev|

[-]

Indeed. I would add a third factor to compute and datasets: the lego-like aspect of NN that enabled scalable OSS DL frameworks.

I did some ML in mid 2000s, and it was a PITA to reuse other people code (when available at all). You had some well known libraries for SVM, for HMM you had to use HTK that had a weird license, and otherwise looking at experiments required you to reimplement stuff yourself.

Late 2000s had a lot of practical innovation that democratized ML: theano and then tf/keras/pytorch for DL, scikit learn for ML, etc. That ended up being important because you need a lot of tricks to make this work on top of "textbook" implementation. E.g. if you implement EM algo for GMM, you need to do it in the log space to avoid underflow, DL as well (gorot and co initialization, etc.).

by alasdair_9 hours ago|

parent|

[-]

I think your post may have more acronyms than any other post I have ever read on hn. Do you have a guide to which specific things you are talking about with each acronym? Deep Learning and Machine Learning are obvious but some of the others I can’t follow at all - they could be so many different things.

by AgentMatt7 hours ago|

parent|

[-]

NN - neural networks OSS DL frameworks - open source deep learning frameworks

PITA - pain in the ass

SVM - support vector machines HMM - hidden Markov model EM - expectation maximization GMM - gaussian mixture model HTK - hidden Markov model tool kit

by ButlerianJihad9 hours ago|

parent|

prev|

[-]

I think he maintains pinball machines and jukeboxes for a chain of Greek restaurants

by jesseab10 hours ago|

parent|

prev|

[-]

Remember watching Alec Radford's Theano tutorial and feeling like I had found literal gold.

by Sohakes14 hours ago|

prev|

[-]

> but I think most ML people now think of neural-network architectures as being, essentially, choices of tradeoffs that facilitate learning in one context or another when data and compute are in short supply, but not as being fundamental to learning.

I feel like you are downplaying the importance of architecture. I never read the bitter lesson, but I have always heard more as a comment on embedding knowledge into models instead of making them to just scale with data. We know algorithmic improvement is very important to scale NNs (see https://www.semanticscholar.org/paper/Measuring-the-Algorith...). You can't scale an architecture that has catastrophic forgetting embedded in it. It is not really a matter of tradeoffs, some are really worse in all aspects. What I agree is just that architectures that scale better with data and compute do better. And sure, you can say that smaller architectures are better for smaller problems, but then the framing with the bitter lesson makes less sense.

by hodgehog1114 hours ago|

prev|

[-]

> Intelligence, even of a limited sort, seems to emerge only after crossing a high threshold of compute capacity. Probably this has to do with the need for a lot of parameters to deal with the intrinsic complexity of a complex learning environment.

Real intelligence deals with information over a ludicrous number of size scales. Simple models effectively blur over these scales and fail to pull them apart. However, extra compute is not enough to do this effectively, as nonparametric models have demonstrated.

The key is injecting a sensible inductive bias into the model. Nonparametric models require this to be done explicitly, but this is almost impossible unless you're God. A better way is to express the bias as a "post-hoc query" in terms of the trained model and its interaction with the data. The only way to train such a model is iteratively, as it needs to update its bias retroactively. This can only be accomplished by a nonlinear (in parameters) parametric model that is dense in function space and possesses parameter counts proportional to the data size. Every model we know of that does this is called "a neural network".

by getnormality15 hours ago|

prev|

[-]

> I think most ML people now think of neural-network architectures as being, essentially, choices of tradeoffs that facilitate learning in one context or another when data and compute are in short supply, but not as being fundamental to learning.

Is this a practical viewpoint? Can you remove any of the specific architectural tricks used in Transformers and expect them to work about equally well?

by musebox3515 hours ago|

parent|

[-]

I think this question is one of the more concrete and practical ways to attack the problem of understanding transformers. Empirically the current architecture is the best to converge training by gradient descent dynamics. Potentially, a different form might be possible and even beneficial once the core learning task is completed. Also the requirements of iterated and continuous learning might lead to a completely different approach.

by etiam15 hours ago|

parent|

prev|

[-]

Did you see this one?

https://news.ycombinator.com/item?id=41732853

by slickytail14 hours ago|

parent|

prev|

[-]

[dead]

by tbrownaw15 hours ago|

prev|

[-]

> The conventional wisdom is that it was the combination of (1) exponentially more compute than in earlier eras with (2) exponentially larger, high-quality datasets (e.g., the curated and hand-labeled ImageNet set) that finally allowed deep neural networks to shine.

I'd thought it was some issue with training where older math didn't play nice with having too many layers.

by etiam15 hours ago|

parent|

[-]

Sigmoid-type activation functions were popular, probably for the bounded activity and some measure of analogy to biological neuron responses. They work, but get problematic scaling of gradient feedback outside their most dynamic span.

My understanding of the development is that persistent layer-wise pretraining with RBM or autoencoder created an initiation state where the optimization could cope even for more layers, and then when it was proven that it could work, analysis of why led to some changes such as new initiation heuristics, rectified linear activation, eventually normalizations ... so that the pretraining was usually not needed any more.

One finding was that the supervised training with the old arrangement often does work on its own, if you let it run much longer than people reasonably could afford to wait around for just on speculation contrary to observations in CPU computations in the 80s--00s. It has to work its way to a reasonably optimizable state using a chain of poorly scaled gradients first though.

by mystraline14 hours ago|

prev|

[-]

Ive yet to see a model that trains AND applies the trained data real-time. Thats basically every living being, from bacteria to plants to mammals.

Even PID loops have a training phase separate from recitation phase.

by seanhunter6 hours ago|

parent|

[-]

That’s not a meaningful technical obstacle. If you wanted to, you could just take the output of the model and use it at each iteration of the training phase to perform (badly) whatever task the model is intended to do.

The reason noone does this is you don’t have to and you’ll get much better results if you first fully train and then apply the best model you have to whatever problem. Biological systems don’t have that luxury.