Every AI app deployed in healthcare should have a -
- simulation testing harness that works as unit tests
- algo-judges that judge real-world usage at scale on all known failure modes
- human-domain-experts that judge fraction of real-world usage and benchmark algo-judges and real-world performance of you AI tool
- LLM-DAs to judge real-world usage intents and edge cases
- observability and monitoring platform that informs you of operational failures - like high latency timeouts, tool call failures etc.
- ability to run evals before any critical system configuration update and measure against versions