The set of 11-digit numbers with any given failure mode (or even successful output) has no discernable pattern, merely whatever randomness the training process baked into the model.
You can't predict ahead of time when they will fail spectacularly, nor draw a clear boundary around the failure cases. And early major example of this were the "glitch tokens" introduced into most LLMs by training on reddit data.
But there is an "in general"/"average failure rate across all inputs of a given size" answer: LLMs performance drops off a cliff once the input reaches too much complexity. (A "┐" shaped curve) In contrast to humans, where you can ask a child to add two N-digit numbers and the error rate will be approximately linear to N.
Also, there exist autistic savants who prove that a human brain can be used to perform rote calculations on large numbers much faster than a human with a calculator can.
For instance the current high score model (311 params [0]), when given 12345678900 + 1, responds with 96913456789.
An interesting experiment would be: what's the minimum number of parameters required to handle unbounded addition (without offloading it to tool calls).
Of course memory constraints would preclude such an experiment. And so a sensible proxy would be: what kind of neural-net architecture and training would allow a model to handle numbers lengths it hasn't been trained on. I suspect, this may be not be possible.