Hamel’s framework - observe - judge - improve
- Observe - telemetry and tracing - open source vs closed source
- Judge - domain experts, annotation tools, code-based-evaluators, LLM-as-judges (benchmarked against domain expert’s annotations - you can create confidence intervals)
- Improve - making specification updates (specifying instructions in a prompt more clearly will fix your problem; adding a boolean flag check will fix your problem) vs generalization updates (update behavior of your system by writing net new instructions in your prompt; updating workflow to remove or add nodes in your agent graph - requires re-evaluation)
Ability to measure datasets against different configurations of your app -
- single turn vs multi turn
- single LLM call vs complex agentic system
How to judge when backend system is complex chain of many things i.e. Agentic
- Agentic traces
- measure every chain - process oriented
- measure human input and final AI output - output oriented
- failure funnels (quote hamel)