I think a bunch of these harnesses are open source so it surprises me that there can be such a gulf between them.
It goes into loops and never completes a task 8 times out of 10 that i've used it.
I haven't tried 3.1 yet, but 3 is just incompetent at tool use. In particular in editing chunks of text in files, it gets very confused and goes into loops.
The model also does this thing where it degrades into loops of nonsense thought patterns over time.
For shorter sessions where it's more analysis than execution, it is a strong model.
We'll see about 3.1. I don't know why it's not showing in my gemini CLI as available yet.