upvote
Token counts alone tell you nothing about correctness, latency, or developer ergonomics. Run a deterministic test suite that exercises representative MCP calls against both native MCP and mcp2cli while recording token usage, wall time, error rate, and output fidelity.

Measure fidelity with exact diffs and embedding similarity, and include streaming behavior, schema-change resilience, and rate-limit fallbacks in the cases you care about. Check the repo for a runnable benchmark, archived fixtures captured with vcrpy or WireMock, and a clear test harness that reproduces the claimed 96 to 99 percent savings.

reply
The AI prose is getting so tiring to read

"We measured this. Not estimates — actual token counts using the cl100k_base tokenizer against real schemas, verified by an automated test suite."

reply