Python: Aggregate token usage across tool-call loop iterations in invoke_agent span by eavanvalkenburg · Pull Request #4739 · microsoft/agent-framework

eavanvalkenburg · 2026-03-17T12:00:12Z

Motivation and Context

When an agent performs multiple LLM round-trips during function invocation (e.g., tool calls followed by a final response), the invoke_agent telemetry span only reported token usage from the last chat completion instead of the cumulative total across all iterations. This caused misleading observability data for cost tracking and debugging.

Fixes #4062

Description

The root cause was in FunctionInvocationLayer._get_response() in _tools.py, which ran multiple super_get_response calls in a loop but returned only the last ChatResponse without accumulating usage_details. The fix introduces an aggregated_usage variable that uses the existing add_usage_details() helper to sum token counts after each chat completion, then assigns the aggregate back to response.usage_details before returning. Two regression tests verify correct aggregation for multi-call tool invocation and single-call scenarios.

Contribution Checklist

The code builds clean without any errors or warnings
The PR follows the Contribution Guidelines
All unit tests pass, and I have added new tests where possible
Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

Note: PR autogenerated by eavanvalkenburg's agent

…osoft#4062) The FunctionInvocationLayer._get_response() loop was overwriting the response on each iteration, so usage_details only reflected the last chat completion call. Now tracks aggregated_usage across all iterations using add_usage_details() and sets it on the returned response. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Fixes token usage reporting in the invoke_agent telemetry span by aggregating usage across all LLM round-trips in the tool-call loop, instead of only reporting the last call's usage.

Changes:

Accumulate token usage via add_usage_details() after each chat completion in FunctionInvocationLayer._get_response() and assign the aggregate back before returning
Add two regression tests verifying correct aggregation for multi-call and single-call scenarios

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`python/packages/core/agent_framework/_tools.py`	Introduces `aggregated_usage` accumulator across loop iterations and assigns it to `response.usage_details` before return
`python/packages/core/tests/core/test_observability.py`	Adds two tests for multi-call aggregation and single-call passthrough

You can also share your feedback on Copilot code review. Take the survey.

markwallace-microsoft · 2026-03-17T12:02:55Z

Python Test Coverage Report •

File	Stmts	Miss	Cover	Missing
packages/core/agent_framework
_tools.py	934	83	91%	186–187, 362, 364, 377, 406–408, 416, 434, 448, 455, 462, 479, 481, 488, 496, 531, 580, 584, 627–629, 631, 637, 682–684, 686, 709, 735, 777–779, 783, 805, 917–923, 959, 971, 973, 975, 978–981, 1002, 1006, 1010, 1024–1026, 1367, 1389, 1476–1482, 1611, 1615, 1661, 1780, 1810, 1830, 1832, 1888, 1951, 2181–2182, 2237, 2310–2311, 2373, 2378, 2385
TOTAL	24011	2638	89%

Python Unit Test Overview

Tests	Skipped	Failures	Errors	Time
5239	20 💤	0 ❌	0 🔥	1m 23s ⏱️

moonbox3

Automated Code Review

Reviewers: 4 | Confidence: 89%

✓ Correctness

The diff correctly adds usage aggregation across multiple LLM round-trips in FunctionInvocationLayer's non-streaming path. The add_usage_details helper is correctly used: it handles None inputs safely, creates new dicts when summing, and the aggregated usage is properly set on the final ChatResponse before returning. All exit paths in the loop ("return" action, loop exhaustion + final call) correctly aggregate and assign usage. The ChatTelemetryLayer captures individual per-call usage in its own spans before FunctionInvocationLayer mutates the response, so individual chat spans retain their original values. Tests are well-structured and validate both multi-call and single-call scenarios correctly.

✓ Security Reliability

The diff adds token usage aggregation across multiple LLM round-trips in the function invocation loop. The add_usage_details helper correctly handles None inputs for both arguments, and the aggregated usage is properly assigned back to the response before returning. The UsageDetails type is a TypedDict with integer fields, so there are no injection or deserialization risks. The test coverage exercises both the multi-call and single-call paths. No security or reliability concerns found.

✗ Test Coverage

The diff adds usage aggregation across multiple LLM calls in the non-streaming function invocation loop and includes two well-structured tests: one for multi-round-trip tool calls and one for a single call. Both tests make meaningful assertions on both individual chat spans and the aggregate invoke_agent span. However, test coverage has gaps: there is no test exercising the max-iterations/loop-exhaustion code path (diff lines 2268-2269) where aggregation also occurs, no test for the case where one or more ChatResponse objects return usage_details=None, and no test covering the streaming path (which notably does NOT receive the aggregation logic — this may be intentional but is worth a test to document expected behavior).

✓ Design Approach

The fix correctly addresses a real bug where only the last LLM call's usage_details was surfaced on the returned ChatResponse when the function-invocation loop made multiple round-trips. The accumulation logic is sound: usage is added to aggregated_usage immediately after each super_get_response call, and is stamped onto the response before every exit point (early return inside the loop, and the post-loop fallback call). The streaming path does not need the same fix because it yields UsageContent items from every inner stream, which ChatResponse.from_updates → _process_update already sums correctly. One minor structural issue exists: the from ._types import add_usage_details import is placed inside the nested _get_response() closure rather than alongside the existing deferred from ._types import (...) block at the top of the outer get_response() method, where it would be more consistent with the surrounding code style.

Flagged Issues

No test covers the loop-exhaustion / max_iterations code path (diff lines ~2268-2269) where usage is also aggregated. This is a distinct branch in _get_response that exits after the for-loop with tool_choice='none', and the aggregation there is completely untested.

Suggestions

Add a test where one or more ChatResponse objects have usage_details=None to verify graceful handling through the aggregation path at the integration level.
Consider adding a test confirming that streaming responses do not aggregate usage separately (since the streaming path already sums via UsageContent items), to document the expected asymmetry between streaming and non-streaming paths.
When add_usage_details(None, usage2) is called it returns the same usage2 object rather than a copy. This is harmless today because aggregated_usage is never mutated in-place, but a defensive copy (e.g., UsageDetails(**usage2)) would prevent subtle aliasing bugs if future code changes that assumption.
Move from ._types import add_usage_details out of the nested _get_response() closure and into the existing from ._types import (ChatResponse, ChatResponseUpdate, ResponseStream) block at the top of get_response(), for consistency with the surrounding deferred-import pattern.

Automated review by moonbox3's agents

python/packages/core/tests/core/test_observability.py

python/packages/core/agent_framework/_tools.py

Copilot and others added 3 commits March 17, 2026 11:55

Apply pre-commit auto-fixes

5aecbd6

Remove reproduction report artifact

d71191c

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

eavanvalkenburg self-assigned this Mar 17, 2026

Copilot AI review requested due to automatic review settings March 17, 2026 12:00

markwallace-microsoft added the python label Mar 17, 2026

Copilot started reviewing on behalf of eavanvalkenburg March 17, 2026 12:01 View session

Copilot AI reviewed Mar 17, 2026

View reviewed changes

eavanvalkenburg enabled auto-merge March 17, 2026 12:51

moonbox3 reviewed Mar 17, 2026

View reviewed changes

python/packages/core/tests/core/test_observability.py Show resolved Hide resolved

python/packages/core/agent_framework/_tools.py Show resolved Hide resolved

Copilot added 3 commits March 17, 2026 13:16

Merge remote-tracking branch 'origin/main' into agent/fix-4062-1

8949b56

Apply pre-commit auto-fixes

80a0cf7

Apply pre-commit auto-fixes

dc63729

moonbox3 approved these changes Mar 18, 2026

View reviewed changes

giles17 approved these changes Mar 18, 2026

View reviewed changes

eavanvalkenburg added this pull request to the merge queue Mar 18, 2026

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 18, 2026

eavanvalkenburg enabled auto-merge March 18, 2026 19:05

eavanvalkenburg added this pull request to the merge queue Mar 18, 2026

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: Aggregate token usage across tool-call loop iterations in invoke_agent span#4739

Python: Aggregate token usage across tool-call loop iterations in invoke_agent span#4739
eavanvalkenburg wants to merge 6 commits intomicrosoft:mainfrom
eavanvalkenburg:agent/fix-4062-1

eavanvalkenburg commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

markwallace-microsoft commented Mar 17, 2026 •

edited

Loading

Uh oh!

moonbox3 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

eavanvalkenburg commented Mar 17, 2026

Motivation and Context

Description

Contribution Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

markwallace-microsoft commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python Unit Test Overview

Uh oh!

moonbox3 left a comment

Choose a reason for hiding this comment

Automated Code Review

✓ Correctness

✓ Security Reliability

✗ Test Coverage

✓ Design Approach

Flagged Issues

Suggestions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

markwallace-microsoft commented Mar 17, 2026 •

edited

Loading