I’m working on project - a password manger, where I have full end to end test harnesses - cli client makes changes, sync them to the server and then observe the data in iOS app running in the emulator. More then once I noticed codex just hard coded expected values from test harnesses directly into UI layout in iOS app to make the test pass .
Similar issues in the crypto layer - test were written first , then code was written . During the review I noticed that code was made to just pass the test - check if signature values exists instead of checking if signature is valid. LLM can help with code review as well, but it has to be guided specifically what to look for for. This is with codex 5.4 model
I would find it a bit tricky to write a full test suite for a product without any code though. You'd need to understand the architecture a bit and likely end up assuming, or mocking, what helpers, classes, config, etc will be built.