This is the landing page for a series on evaluating AI deployments in healthcare, specifically focused on safety.
Read the Series
- AI Safety is a Hard Problem (Introduction)
- AI Evaluation Framework
- Ideal Safety Spec
- Stages of Evaluation
- AI Evaluation in Practice
- building demos with AI is incredibly accelerated; building production systems is exponentially hard.
- quick review of some safety tools available out in the field - like hail break, hallucination etc. and why they cannot be trusted blindly - because the quality you care about is subjective and no one out there has your definition of good; because these are brittle poorly trained on out of distribution data; because these could be LLMs-as-judge of LLMs that don’t align well to your niche problems and of which you can’t change prompts
- quick review of LLMs as judge - what is this? quickest way to build one for yourself? and why this guarantee safety for your AI product (tl;dr because you cannot perfectly spec out all possible unsafe behaviors; because models are not perfect at following all instructions; because even if both the previous is true you cannot capture the confidence levels of the LLM-as-judge’s judgement)
- drive home 3 points -
- only you know the definition of “good” - so someone else’s black box algo cannot judge what matters to your AI tool
- commercial safety tools are out of distribution for your use-case, black box and unconfigurable - and their corporate appeal will make them useless trophy systems
- LLMs-as-judge are promising but not guaranteed to work and you’re left in the dark to figure out reliability of the system - hope is not a strategy