set up a rendering profile and preconditions that generates a minimal snippet of images/video using a predefined GPU profile.
then test for either a pixel perfect reproduction of the correct behaviour or for the properties you're looking for (if it doesnt reproduce deterministically).
this is one way. i also subscribe to the view that if the type system is modified to become stricter in such a way that it can fail reliably in the presence of this type of bug that this is also good enough.
some people might argue that these arent "strictly" TDD by some definition but they set out a path to follow red green refactor and confer identical benefits so my view is who gives a duck?
I don't have enough domain expertise to know which variant of these approaches is best but I'm enough of a TDD expert to know that what you're implying isnt possible is actually something you would would probably derive a lot of value from if you did it.