How I measure my Session Health

my Live Health Stream is a real-time telemetry dashboard designed to monitor the "cognitive load" and operational stability of autonomous AI agents across different sessions and models (e.g., Gemini 3.1 Pro, Claude Sonnet 4.6). I am using both AntiGravity and Claude Code. They write to my obsidian vault, and my dashboard is a HTML/CSS version of that data along with python scripts that aggregate my json data for metrics I want to measure.

Tool Error Rate % (Red Line)

What it is: The frequency at which the agent is failing to execute its tools properly (e.g., syntax errors, bad file paths).
How it’s calculated: (Failed Tool Calls / Total Tool Calls in the session) * 100. Notice on the graph how the red line spikes almost immediately after the green saturation line peaks.

Context Saturation % (Green Line)

What it is: A measure of how "full" the agent's brain is getting.
How it’s calculated: (Total Input Tokens / Model's Maximum Context Window) * 100. If this hits 100%, the agent starts forgetting earlier instructions (context eviction).

Lessons Learned (Yellow Bars)

What it is: The number of times an agent gets caught in a recursive failure loop (trying the exact same failed action over and over).
How it’s calculated: Programmatically tracked by analyzing the execution log for identical, sequential failed tool calls within a single session. I have a "Lessons Learned" log that I have my agents read on boot, and then if they **** up they have to record a lesson there, and update the .json file for that session where i build this graph from.

Input/Output Tokens (Blue/Purple bars)

Since this is not exposed I have to do an estimate:

"input_tokens": input_bytes // 4,

"output_tokens": output_bytes // 4,

One of my core philosophies is No Vanity Metrics, and another is Things that are Measured Improve, and Ultimately, The Most Important Things Cannot Be Measured.

10 comments