Python: Aggregate token usage across tool-call loop iterations in invoke_agent span#4739
Python: Aggregate token usage across tool-call loop iterations in invoke_agent span#4739eavanvalkenburg wants to merge 6 commits intomicrosoft:mainfrom
Conversation
…osoft#4062) The FunctionInvocationLayer._get_response() loop was overwriting the response on each iteration, so usage_details only reflected the last chat completion call. Now tracks aggregated_usage across all iterations using add_usage_details() and sets it on the returned response. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Fixes token usage reporting in the invoke_agent telemetry span by aggregating usage across all LLM round-trips in the tool-call loop, instead of only reporting the last call's usage.
Changes:
- Accumulate token usage via
add_usage_details()after each chat completion inFunctionInvocationLayer._get_response()and assign the aggregate back before returning - Add two regression tests verifying correct aggregation for multi-call and single-call scenarios
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
python/packages/core/agent_framework/_tools.py |
Introduces aggregated_usage accumulator across loop iterations and assigns it to response.usage_details before return |
python/packages/core/tests/core/test_observability.py |
Adds two tests for multi-call aggregation and single-call passthrough |
You can also share your feedback on Copilot code review. Take the survey.
Python Test Coverage Report •
Python Unit Test Overview
|
||||||||||||||||||||||||||||||
moonbox3
left a comment
There was a problem hiding this comment.
Automated Code Review
Reviewers: 4 | Confidence: 89%
✓ Correctness
The diff correctly adds usage aggregation across multiple LLM round-trips in FunctionInvocationLayer's non-streaming path. The add_usage_details helper is correctly used: it handles None inputs safely, creates new dicts when summing, and the aggregated usage is properly set on the final ChatResponse before returning. All exit paths in the loop ("return" action, loop exhaustion + final call) correctly aggregate and assign usage. The ChatTelemetryLayer captures individual per-call usage in its own spans before FunctionInvocationLayer mutates the response, so individual chat spans retain their original values. Tests are well-structured and validate both multi-call and single-call scenarios correctly.
✓ Security Reliability
The diff adds token usage aggregation across multiple LLM round-trips in the function invocation loop. The
add_usage_detailshelper correctly handlesNoneinputs for both arguments, and the aggregated usage is properly assigned back to the response before returning. TheUsageDetailstype is a TypedDict with integer fields, so there are no injection or deserialization risks. The test coverage exercises both the multi-call and single-call paths. No security or reliability concerns found.
✗ Test Coverage
The diff adds usage aggregation across multiple LLM calls in the non-streaming function invocation loop and includes two well-structured tests: one for multi-round-trip tool calls and one for a single call. Both tests make meaningful assertions on both individual chat spans and the aggregate invoke_agent span. However, test coverage has gaps: there is no test exercising the max-iterations/loop-exhaustion code path (diff lines 2268-2269) where aggregation also occurs, no test for the case where one or more ChatResponse objects return usage_details=None, and no test covering the streaming path (which notably does NOT receive the aggregation logic — this may be intentional but is worth a test to document expected behavior).
✓ Design Approach
The fix correctly addresses a real bug where only the last LLM call's
usage_detailswas surfaced on the returnedChatResponsewhen the function-invocation loop made multiple round-trips. The accumulation logic is sound: usage is added toaggregated_usageimmediately after eachsuper_get_responsecall, and is stamped onto the response before every exit point (early return inside the loop, and the post-loop fallback call). The streaming path does not need the same fix because it yieldsUsageContentitems from every inner stream, whichChatResponse.from_updates→_process_updatealready sums correctly. One minor structural issue exists: thefrom ._types import add_usage_detailsimport is placed inside the nested_get_response()closure rather than alongside the existing deferredfrom ._types import (...)block at the top of the outerget_response()method, where it would be more consistent with the surrounding code style.
Flagged Issues
- No test covers the loop-exhaustion / max_iterations code path (diff lines ~2268-2269) where usage is also aggregated. This is a distinct branch in _get_response that exits after the for-loop with tool_choice='none', and the aggregation there is completely untested.
Suggestions
- Add a test where one or more ChatResponse objects have usage_details=None to verify graceful handling through the aggregation path at the integration level.
- Consider adding a test confirming that streaming responses do not aggregate usage separately (since the streaming path already sums via UsageContent items), to document the expected asymmetry between streaming and non-streaming paths.
- When add_usage_details(None, usage2) is called it returns the same usage2 object rather than a copy. This is harmless today because aggregated_usage is never mutated in-place, but a defensive copy (e.g., UsageDetails(**usage2)) would prevent subtle aliasing bugs if future code changes that assumption.
- Move
from ._types import add_usage_detailsout of the nested_get_response()closure and into the existingfrom ._types import (ChatResponse, ChatResponseUpdate, ResponseStream)block at the top ofget_response(), for consistency with the surrounding deferred-import pattern.
Automated review by moonbox3's agents
Motivation and Context
When an agent performs multiple LLM round-trips during function invocation (e.g., tool calls followed by a final response), the
invoke_agenttelemetry span only reported token usage from the last chat completion instead of the cumulative total across all iterations. This caused misleading observability data for cost tracking and debugging.Fixes #4062
Description
The root cause was in
FunctionInvocationLayer._get_response()in_tools.py, which ran multiplesuper_get_responsecalls in a loop but returned only the lastChatResponsewithout accumulatingusage_details. The fix introduces anaggregated_usagevariable that uses the existingadd_usage_details()helper to sum token counts after each chat completion, then assigns the aggregate back toresponse.usage_detailsbefore returning. Two regression tests verify correct aggregation for multi-call tool invocation and single-call scenarios.Contribution Checklist