undefined

points

[-]

This is called mechanistic interpretability. There is lots of fascinating insights already since you can do basically everything down to the neuron or weight level thousands of times. The human brain is many orders of magnitude harder to make sense of.

by sometimelurker22 hours ago|

parent|

[-]

well its actually called ablation, and its one way to do mech interp. anthriopics got a bunch of work on mech interp here https://transformer-circuits.pub/, like SAEs and NLAs

by Cantinflas21 hours ago|

prev|

[-]

by mdp202118 hours ago|

prev|

[-]

Of course tampering with chunks or nodes in the NNs is a way to study the "spawned" (through gradient descent etc.) configuration and "reverse-engineer the black box" to get "AI transparency".

Anthropic published an important work around one year and a half ago.

by mdp202112 hours ago|

parent|

[-]

> Anthropic published an important work around one year and a half ago

> #Tracing the thoughts of a large language model#

https://www.anthropic.com/research/tracing-thoughts-language...

https://news.ycombinator.com/item?id=43495617 (27 March 2025)

by Computer022 hours ago|

prev|

[-]

Reminds me of Golden Gate Claude (https://www.anthropic.com/news/golden-gate-claude)