LLM model scoring hides AI agent performance
Date
May 13, 2025
Category
CareOps
Author
Thomas Vande Casteele
With “Agents” cropping up everywhere, so do evaluations of those “Agents”. On closer inspection, most of the time these evaluations track how the model is doing, not the Agent itself - let alone if multiple agents need to collaborate.
In this article we argue why model metrics miss the point, what metrics we should actually be focusing on when evaluating AI agents for healthcare and how the 2 are connected.
Where AI agents actually show up today
Before we worry about metrics, let’s define what an AI agent is and map the user‑interface patterns that dominate current deployments.
Whereas the software industry prefers talking about agentic systems, the term getting wider adoption in healthcare is AI Agents. An informal definition of such an AI agent could be:
AI agent: a digital teammate, powered by an LLM (or other AI model), that observes its environment, uses the model to plan next steps, and executes by calling the right tools - cycling through a “see-think-do” loop governed by guardrails until the job is done or is handed off to a human.
Next is defining the taxonomy of UX patterns with AI agents:
UX Interaction Pattern | Description | Healthcare Use Cases |
Copilot-Style Interfaces (Embedded Assistants) | AI integrated into user’s workflow, providing real-time assistance and suggestions within existing tools. User is actively guiding and supervising the AI. | - Ambient scribes / Clinical documentation: AI drafts of notes, letters, reports for clinician to edit. - Medical coding: suggestions of billing codes or structured data from text.- Decision support: in-app prompts for diagnoses or next steps during patient care (e.g. differential diagnosis hints). - Workflow design & coding: AI helps create care pathways or software logic from natural language descriptions. |
Conversational Interfaces (Chatbots & Voice Agents) | AI agent interacts through natural language dialogue (text chat or voice), engaging in back-and-forth conversation with users. Can handle free-form queries and responses with contextual memory. | - Patient engagement: chat or voice bots for symptom triage, answering FAQs, health coaching, medication adherence, etc.. - Patient outreach: automated calls or chats for appointment reminders, preventive screening outreach, post-discharge check-ins. - Clinician Q&A: virtual assistant answers clinicians’ questions or retrieves info (medical knowledge, patient data) via chat/voice.- Operational support: cha tbots for scheduling appointments via text, or voice assistants routing calls (automating call center tasks). |
Background Automation Agents (Workflow Automation, both unattended and Human In The Loop) | AI operates autonomously on tasks behind the scenes, without needing continuous user input. Often event-driven or scheduled; interacts with systems and possibly users via notifications, queries or escalations. Focus on completing processes end-to-end. | - Scheduling & reminders: automatically booking appointments, sending reminders or follow-ups without human schedulers. - Data processing: reading and summarizing documents (faxes, PDFs), entering data into EHR (patient info, labs). - Insurance and billing: handling prior auth submissions, generating insurance appeals, coding audits, and claim processing. - Care coordination: multi-step workflows (e.g. discharge follow-up series, chronic disease outreach program) executed by AI agents that communicate with patients and update care teams as needed. - Monitoring & alerts: continuous monitoring of data (vitals, queues, inboxes) and delivering summary reports or alerting staff when criteria met. |
The majority of these Agents, including the background automation one, operate in silos. This is not how care is getting delivered (definitely not how good care should be delivered). The target UX pattern is one of cross-functional collaboration, hand-offs, achievement of a joint goal.
Why model metrics miss the point
Most evaluation suites judge an LLM model inside one of the above glass boxes. They score tokens for accuracy or style and call it a day, or run it through a standard benchmark that hardly represents reality. That’s hardly enough for copilots working in isolation. It’s lethal for care delivery, where success hinges on three messy questions:
Did the baton make it to the next human or agent intact?
Did noise—typos, missing vitals, conflicting meds—throw the system off?
Did the agent ask for help only when it really had to?
Generic scores ignore all three. They tell you the flashlight is bright, not whether the runway is clear. These problems are widely discussed:
Industry baselines are domain-agnostic. Amazon Bedrock and Google Vertex AI both default to helpfulness and faithfulness scores, with no notion of escalation quality or patient-safety impact. Amazon Web Services, Inc.AWS DocumentationGoogle Cloud
Clinical researchers keep flagging blind spots. A May-2025 review in Digital Health found that typical LLM benchmarks “rarely reflect workflow risk or downstream cost” and called for simulation-based studies of hand-offs and escalation. ScienceDirect
Simulation papers echo the same gap. Nature Medicine authors argue that only high-fidelity clinical simulations can reveal where agents break care-coordination chains. PMC
JAMIA’s conversational-agent study stresses stakeholder legibility. Their proposed rubric swaps “BLEU” for clinician-rated usefulness and workload impact. Oxford Academic

The hidden cost of blind spots
Dropped context = duplicated work. If an agent forgets to pass the new meds list, the nurse re‑checks the chart and the benefit evaporates.
False alarms = burnout. Raise the wrong flag often enough and clinicians mute the channel - right when a real emergency hits.
Dirty data = readmissions. If vitals stream in half‑broken and the agent shrugs, the patient ends up back in the ED.
These are workflow failures, not language errors. Yet they never decide whether an AI Agent becomes a contract or a post‑mortem, as the evaluations today focus on underlying models.
Four lenses for a workflow‑first test
We decided to pressure‑tested these lenses during a recent hackathon at Awell. The use case we picked was post-discharge. Picture a crew of AI sidekicks juggling noisy hypertension data, human handoffs and edge‑case escalations - could they land the plane in a safe and predictable way?
Lens | Question | KPI |
---|---|---|
Scope Resilience | How many subtasks can the agent handle before it cracks? | Unassisted‑success % |
Coordination Quality | Does context survive every baton pass? | Baton‑loss rate |
Noise Tolerance | How gracefully does performance degrade as data gets messy? | Graceful‑degradation curve |
Escalation Precision | When the agent panics, is it right? | Cry‑Wolf ratio |
This bundle could be called an Agentic Evaluation Framework (AEF). It borrows the best of academia (MultiAgentBench’s milestone logic, Med‑HELM datasets) but translates outputs into metrics an ops leader can approve.
What business teams gain
Trust you can see. Dashboards turn green only when pyjama time drops and hand‑offs close without a second call.
ROI you can forecast. A 10‑point dip in baton‑loss isn’t an abstract curve; it’s X minutes saved per discharge and Y readmissions avoided per 1 000 patients - numbers finance can drop into next quarter’s budget.
Evidence regulators recognise. AEF logs slot straight into the FDA’s SaMD lifecycle template and mirror Joint Commission hand‑off requirements, lightening the compliance lift.
Strategic agility. Because KPIs live at the workflow layer, it becomes possible to swap models, prompts, or even vendors without losing the baseline the board cares about.
References
“Automation moves data; orchestration moves care” article for voice alignment Awell
AWS Bedrock evaluation docs (token-level faithfulness/coherence) AWS Documentation
Google Vertex AI generative-AI eval guide Google Cloud
JAMA Internal Medicine study on omissions in LLM discharge notes JAMA Network
JAMA Network Open cohort on early-warning false alerts JAMA Network
Med-HELM leaderboard scope (single-turn focus) Med-HELM
arXiv MultiAgentBench milestone/coordination benchmark arXiv
Mixed-methods ED handoff study (context loss impact) MDPI
Agent-based ED simulation review (cascade effects) ScienceDirect
FDA Feb-2025 SaMD draft calling for lifecycle risk logs fda.gov
Galileo.ai blog on multi-agent benchmarks Galileo AI
Back