upvote
When the context is auto-generated and may include irrelevant data.

This is a common AI benchmark and has been for years before GPT-2 even existed. LLMs need to not get distracted by irrelevant facts and there are tests that measure this. It's the motivation for attention mechanisms, which are the breakthrough that enabled LLMs to scale up.

reply
An example is translation. I MTLed some text recently where the name of a (fictional) city was translated about a dozen different ways. Sometimes you'd get a calque, sometimes you'd get a transliteration (including several wrong ones). Ironically "dumb" MTLs are often much more consistent about this than LLMs.
reply