It wrote an extensive test suite on just fake data and then said the app is perfectly working as all tests passed.
This is a model that was supposed to match sonnet 4.5 in benchmarks. I don't think sonnet would be that dumb.
I use LLMs a lot to code, but these chinese models don't match anthropic and openai in being able to decide stuff for themselves. They work well if you give them explicit instructions that leaves little for it to mess up, but we are slowly approaching where OpenAI and anthropic models will make the right decisions on their own
Just now it added some stuff to a file starting at L30 and I said "that one line L30 will do remove the rest", it interpreted 'the rest' as the file, and not what it added.