This is also an probably part of extended prompt that disallowed coding, Gemini always does calculation with a little python snippet because it is deterministic and accurate.
Flash 3.5 fails exactly like in your sample: https://gemini.google.com/share/97521a8752d9
but Flash 3.1 Lite initially fails, but then corrects itself: https://gemini.google.com/share/dc0889ec85ba