1) How do you compare accuracy? by checking if the answer is in any of the returned grep/bm25/semble snippets?
2) How do you measure token use without the agent, prompt, and tools?
e.g. agents often run `grep -m 5 "QUERY"` with different queries, instead of one big grep for all items.
I guess the point we’re trying to make is that you need fewer semble queries to achieve the same outcome, compared to grep+readfile calls.