AI Evaluation Framework | Adi's Digital Garden

Hamel’s framework - observe - judge - improve

Observe - telemetry and tracing - open source vs closed source
Judge - domain experts, annotation tools, code-based-evaluators, LLM-as-judges (benchmarked against domain expert’s annotations - you can create confidence intervals)
Improve - making specification updates (specifying instructions in a prompt more clearly will fix your problem; adding a boolean flag check will fix your problem) vs generalization updates (update behavior of your system by writing net new instructions in your prompt; updating workflow to remove or add nodes in your agent graph - requires re-evaluation)

Ability to measure datasets against different configurations of your app -

How to judge when backend system is complex chain of many things i.e. Agentic