Skip to content

Python: Aggregate token usage across tool-call loop iterations in invoke_agent span#4739

Open
eavanvalkenburg wants to merge 6 commits intomicrosoft:mainfrom
eavanvalkenburg:agent/fix-4062-1
Open

Python: Aggregate token usage across tool-call loop iterations in invoke_agent span#4739
eavanvalkenburg wants to merge 6 commits intomicrosoft:mainfrom
eavanvalkenburg:agent/fix-4062-1

Conversation

@eavanvalkenburg
Copy link
Member

Motivation and Context

When an agent performs multiple LLM round-trips during function invocation (e.g., tool calls followed by a final response), the invoke_agent telemetry span only reported token usage from the last chat completion instead of the cumulative total across all iterations. This caused misleading observability data for cost tracking and debugging.

Fixes #4062

Description

The root cause was in FunctionInvocationLayer._get_response() in _tools.py, which ran multiple super_get_response calls in a loop but returned only the last ChatResponse without accumulating usage_details. The fix introduces an aggregated_usage variable that uses the existing add_usage_details() helper to sum token counts after each chat completion, then assigns the aggregate back to response.usage_details before returning. Two regression tests verify correct aggregation for multi-call tool invocation and single-call scenarios.

Contribution Checklist

  • The code builds clean without any errors or warnings
  • The PR follows the Contribution Guidelines
  • All unit tests pass, and I have added new tests where possible
  • Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

Note: PR autogenerated by eavanvalkenburg's agent

Copilot and others added 3 commits March 17, 2026 11:55
…osoft#4062)

The FunctionInvocationLayer._get_response() loop was overwriting the
response on each iteration, so usage_details only reflected the last
chat completion call. Now tracks aggregated_usage across all iterations
using add_usage_details() and sets it on the returned response.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@eavanvalkenburg eavanvalkenburg self-assigned this Mar 17, 2026
Copilot AI review requested due to automatic review settings March 17, 2026 12:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes token usage reporting in the invoke_agent telemetry span by aggregating usage across all LLM round-trips in the tool-call loop, instead of only reporting the last call's usage.

Changes:

  • Accumulate token usage via add_usage_details() after each chat completion in FunctionInvocationLayer._get_response() and assign the aggregate back before returning
  • Add two regression tests verifying correct aggregation for multi-call and single-call scenarios

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
python/packages/core/agent_framework/_tools.py Introduces aggregated_usage accumulator across loop iterations and assigns it to response.usage_details before return
python/packages/core/tests/core/test_observability.py Adds two tests for multi-call aggregation and single-call passthrough

You can also share your feedback on Copilot code review. Take the survey.

@markwallace-microsoft
Copy link
Member

markwallace-microsoft commented Mar 17, 2026

Python Test Coverage

Python Test Coverage Report •
FileStmtsMissCoverMissing
packages/core/agent_framework
   _tools.py9348391%186–187, 362, 364, 377, 406–408, 416, 434, 448, 455, 462, 479, 481, 488, 496, 531, 580, 584, 627–629, 631, 637, 682–684, 686, 709, 735, 777–779, 783, 805, 917–923, 959, 971, 973, 975, 978–981, 1002, 1006, 1010, 1024–1026, 1367, 1389, 1476–1482, 1611, 1615, 1661, 1780, 1810, 1830, 1832, 1888, 1951, 2181–2182, 2237, 2310–2311, 2373, 2378, 2385
TOTAL24011263889% 

Python Unit Test Overview

Tests Skipped Failures Errors Time
5239 20 💤 0 ❌ 0 🔥 1m 23s ⏱️

Copy link
Contributor

@moonbox3 moonbox3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review

Reviewers: 4 | Confidence: 89%

✓ Correctness

The diff correctly adds usage aggregation across multiple LLM round-trips in FunctionInvocationLayer's non-streaming path. The add_usage_details helper is correctly used: it handles None inputs safely, creates new dicts when summing, and the aggregated usage is properly set on the final ChatResponse before returning. All exit paths in the loop ("return" action, loop exhaustion + final call) correctly aggregate and assign usage. The ChatTelemetryLayer captures individual per-call usage in its own spans before FunctionInvocationLayer mutates the response, so individual chat spans retain their original values. Tests are well-structured and validate both multi-call and single-call scenarios correctly.

✓ Security Reliability

The diff adds token usage aggregation across multiple LLM round-trips in the function invocation loop. The add_usage_details helper correctly handles None inputs for both arguments, and the aggregated usage is properly assigned back to the response before returning. The UsageDetails type is a TypedDict with integer fields, so there are no injection or deserialization risks. The test coverage exercises both the multi-call and single-call paths. No security or reliability concerns found.

✗ Test Coverage

The diff adds usage aggregation across multiple LLM calls in the non-streaming function invocation loop and includes two well-structured tests: one for multi-round-trip tool calls and one for a single call. Both tests make meaningful assertions on both individual chat spans and the aggregate invoke_agent span. However, test coverage has gaps: there is no test exercising the max-iterations/loop-exhaustion code path (diff lines 2268-2269) where aggregation also occurs, no test for the case where one or more ChatResponse objects return usage_details=None, and no test covering the streaming path (which notably does NOT receive the aggregation logic — this may be intentional but is worth a test to document expected behavior).

✓ Design Approach

The fix correctly addresses a real bug where only the last LLM call's usage_details was surfaced on the returned ChatResponse when the function-invocation loop made multiple round-trips. The accumulation logic is sound: usage is added to aggregated_usage immediately after each super_get_response call, and is stamped onto the response before every exit point (early return inside the loop, and the post-loop fallback call). The streaming path does not need the same fix because it yields UsageContent items from every inner stream, which ChatResponse.from_updates_process_update already sums correctly. One minor structural issue exists: the from ._types import add_usage_details import is placed inside the nested _get_response() closure rather than alongside the existing deferred from ._types import (...) block at the top of the outer get_response() method, where it would be more consistent with the surrounding code style.

Flagged Issues

  • No test covers the loop-exhaustion / max_iterations code path (diff lines ~2268-2269) where usage is also aggregated. This is a distinct branch in _get_response that exits after the for-loop with tool_choice='none', and the aggregation there is completely untested.

Suggestions

  • Add a test where one or more ChatResponse objects have usage_details=None to verify graceful handling through the aggregation path at the integration level.
  • Consider adding a test confirming that streaming responses do not aggregate usage separately (since the streaming path already sums via UsageContent items), to document the expected asymmetry between streaming and non-streaming paths.
  • When add_usage_details(None, usage2) is called it returns the same usage2 object rather than a copy. This is harmless today because aggregated_usage is never mutated in-place, but a defensive copy (e.g., UsageDetails(**usage2)) would prevent subtle aliasing bugs if future code changes that assumption.
  • Move from ._types import add_usage_details out of the nested _get_response() closure and into the existing from ._types import (ChatResponse, ChatResponseUpdate, ResponseStream) block at the top of get_response(), for consistency with the surrounding deferred-import pattern.

Automated review by moonbox3's agents

@eavanvalkenburg eavanvalkenburg added this pull request to the merge queue Mar 18, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 18, 2026
@eavanvalkenburg eavanvalkenburg added this pull request to the merge queue Mar 18, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python: [Bug]: invoke_agent span reports last LLM call's token usage instead of aggregate total

5 participants