undefined

points

by krackers2 hours ago |

comments

by dTal36 minutes ago|

[-]

Huh okay, there was a major gap in my mental model. Thanks for helping to clear it up.

by krackers14 minutes ago|

parent|

[-]

Well to be fair the fact that they "can" doesn't mean models necessarily do it. You'd need some interp research to see if they actually do meaningfully "do other computations" when processing low perplexity tokens. But the fact that by the computational graph the architecture should be capable of it, means that _not_ doing this is leaving loss on the table, so hopefully optimizer would force it to learn to so.