Evaluation Metrics
Success, reliability, and efficiency of agent workflows
Refreshing…
End-to-end task success
—
% workflows completed correctly (missed dose / BP trend / glucose spike / follow-up)
Tool execution success
—
% tool calls that succeeded (D1 ops, RAG retrieve, model call, etc.)
Workflows run
—
Tool calls
—
Avg steps
—
Avg tool calls
—
Avg latency
—
Wall-clock time to complete a workflow
Avg token estimate
—
Rough proxy for cost-to-success
Reliability notes
For consistency testing, run the same workflow multiple times (different seeds/inputs) and compare success + failure_mode distributions.
Raw JSON
Loading...