Evaluation Metrics

Success, reliability, and efficiency of agent workflows

Refreshing…

End-to-end task success

—

% workflows completed correctly (missed dose / BP trend / glucose spike / follow-up)

Tool execution success

—

% tool calls that succeeded (D1 ops, RAG retrieve, model call, etc.)

Workflows run

—

Tool calls

—

Avg steps

—

Avg tool calls

—

Avg latency

—

Wall-clock time to complete a workflow

Avg token estimate

—

Rough proxy for cost-to-success

Reliability notes

For consistency testing, run the same workflow multiple times (different seeds/inputs) and compare success + failure_mode distributions.

Raw JSON

Loading...