- PM should design product roadmap with AI evaluations accounted for, should help hire domain experts and facilitate interactions between domain experts, the AI product and the annotation tool
- AIE should own observability, building algo-judges and improvements to the agentic system
- DA should own measurement of qualities from annotations and algo-judges, benchmarking algo-judges for confidence
- DE should tackle data safety, storage, transfer and transformations into various systems
- Eng ensures dev, stage, prod working environments are isolated including the data flowing through each system
- Domain experts should be assigned tasks with overlapping data points for fair results
- Careful when generating data from AI app for judgement - dogfooding works when team has high overlap with target customer profile; using domain experts to generate data can be counterproductive because that does not directly signal customer satisfaction; AI simulation interactions can be great in the initial stages but AI is not trained to interact like humans on chat systems (long messages, perfect english words, great grammar, much higher than 5th grade reading-writing competency) - might need to look into contracting actual target consumers into alpha, beta releases of your app
The entire evals measurement process can take weeks of effort if not well setup and days even if implemented properly. Improvements will generally need weeks too.
All the above can bog you down on product dev, especially when working on fresh ideas that are yet to be proven out. Due to this refer to Stages of Evaluation