Ideas are fragile things and evaluation is a crushingly heavy burden
- when working on early demos - vibe based evaluation is good enough - i.e. dogfooding your own tools with your own interactions and fixing bugs realtime; one specific thing that helps a lot is having an AI simulation harness - this especially forces you to describe your target consumer in meticulous detail which helps shed clarity on who you’re building this system for alongside creating a way to generate interactions at scale - high temperature, meaningfully variable key pointers
- when working on proven product and taking it to alpha - beta releases - a full fledged evaluation with actual domain experts is most necessary in these stages
- when the app is deployed, you have users and you’re trying to fish and fix failures from real world usage - use TRELLIS - more LLM-powered-DA type evaluation than hardcore AI Evals - this keeps your system safe, well functioning and gives you a clear pulse into actual usage
do hardcore evaluation for - every model switch, every prompt change, every agent architecture change
Imbibe algo-judge into your app if you have very high confidence on the precision/recall of the algo-judge