Evaluation Metrics

Success, reliability, and efficiency of agent workflows
Refreshing…
End-to-end task success
% workflows completed correctly (missed dose / BP trend / glucose spike / follow-up)
Tool execution success
% tool calls that succeeded (D1 ops, RAG retrieve, model call, etc.)
Workflows run
Tool calls
Avg steps
Avg tool calls
Avg latency
Wall-clock time to complete a workflow
Avg token estimate
Rough proxy for cost-to-success
Reliability notes
For consistency testing, run the same workflow multiple times (different seeds/inputs) and compare success + failure_mode distributions.
Raw JSON
Loading...