Essentially they can't do clean room anything!
You might as well hire the entire former mid level of a businesses programming team and claim it's clean room work
https://www.itprotoday.com/server-virtualization/windows-nt-...
In any case, an interesting experiment.
In fact this would make for an interesting benchmark - writing entire non-trivial apps based on the same prompt. Each model might be expected to write and use it's own test cases, but then all could be judged based on a common set of test cases provided as part of the benchmark suite.