points
performance is easy. you can craft a test suite that will allow a ralph loop to iterate until it hits the metrics.
the hard part of style/feel/usability. LLMs still suck at that stuff, and crafting tests to produce those metrics is nigh impossible.