I looked superficially at your site/repo and based on that initial impression:
- Your approach of comparing different parts of the "black box" which affects agent behaviour (Harness, foundation model, skills, context (in your case the loaded on AGENTS.md context) is closely aligned with how I both think and operate. - You're both tackling the "regression" and the "answer hypothesis easily" problems.
> Stalking your profile (sorry..) I see you're pretty deep in the eval space, so I'm super curious what your approach has been to being rigorous for things like skill changes?
It depends on the level of automation and risk profile. For skills I use this framework of thinking [1] and encourage evals/ground truth as soon as possible so that you can have automatic feedback loops for the markdown part and for the deterministic part (scripts). Once you have the eval/ground truth pair, you're almost doing TDD or Eval Driven Development (which is quite hard the first times you try and realize you actually need to think about intent). The scripts should definitely have their own unit tests for the "skill iteration" in the event that a mutation is desired to cover new behaviour/fix wrong behaviour.
On Agent Skills, it may seem tempting to want more "openness" for the AI to solve the problem creatively but, more often than not, you've described a repeatable workflow and you want predictability and stability instead of novelty so it's really about 1) How can I freeze it to keep being good enough as much as possible 2) How can I know if something happened somewhere which changed the black box (e.g. coding harness auto model picking screws things up 3) How can I make the skill itself ETC (Easy to change), to keep control. Local Models can be a great tool for stability in some scenarios.
In particular, I prefer pass/fail (binary) outcomes instead of scoring which doesn't help regression decisions. Defining "good enough" should be very clear. Flakiness is not a good thing to accept, if the outcomes are consequential.
Anything actually risky should be solid RBAC/policy which doesn't really depend on the LLM.
I had a site that I didn't manage to make visible in HN to create a community for ai-evals.io. I've since interacted with a few people, developed further insights and given some private talks but need to get back to publishing outfacing and trying to contact more people interested in this space because it's absolutely critical. There's a lot of nuance in how different environments think about the eval problem differently: It's all about tracing and course correcting after launch, it's about simulations, sandboxing, security, automatic eval generation, etc.
In any case, I'll try to be more present from now on, and especially from June onwards to try to exchange insights in the open with people who are exploring different solutions in this space.
[1] - https://alexhans.github.io/posts/series/evals/building-agent...