Stages of Evaluation | Adi's Digital Garden

Ideas are fragile things and evaluation is a crushingly heavy burden

when working on early demos - vibe based evaluation is good enough - i.e. dogfooding your own tools with your own interactions and fixing bugs realtime; one specific thing that helps a lot is having an AI simulation harness - this especially forces you to describe your target consumer in meticulous detail which helps shed clarity on who you’re building this system for alongside creating a way to generate interactions at scale - high temperature, meaningfully variable key pointers
when working on proven product and taking it to alpha - beta releases - a full fledged evaluation with actual domain experts is most necessary in these stages
when the app is deployed, you have users and you’re trying to fish and fix failures from real world usage - use TRELLIS - more LLM-powered-DA type evaluation than hardcore AI Evals - this keeps your system safe, well functioning and gives you a clear pulse into actual usage

do hardcore evaluation for - every model switch, every prompt change, every agent architecture change

Imbibe algo-judge into your app if you have very high confidence on the precision/recall of the algo-judge