My guess is that it's a known problem, which steered the frontier models into bullet point preference.
To be fair, as you can see in the clip, the two models handled the prompt slightly differently. The pxpipe variant gave the right count initially but needed a quick follow-up to output the ledger balance in a single line. The standard model, on the other hand, nailed the formatting on its first try. We've completely solved readability here on Fable; our only real hurdle left is getting the models to follow formatting constraints perfectly on the very first reply.
Of course, this was just rewritten by another LLM.