Enough that the prompt is different at a token-level, but not enough that the meaning changes.
It would be very difficult for them to catch that, especially if the prompts were not made public.
Run the variations enough times per day, and you'd get some statistical significance.
The guess the fuzzy part is judging the output.