-
Notifications
You must be signed in to change notification settings - Fork 3k
Feat/resilience plugin #4537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
+771
−0
Closed
Feat/resilience plugin #4537
Changes from all commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
f53a5d8
feat(plugins): add LlmResiliencePlugin with retry/backoff and model f…
77c3aa8
fix(plugins): use CallbackContext directly (no private attrs) in LlmR…
7f1cab2
test(plugins): stabilize LlmResiliencePlugin tests; support Invocatio…
0344186
docs(samples): add resilient_agent.py sample demonstrating LlmResilie…
2364f4c
chore(plugins): export LlmResiliencePlugin in plugins package __init__
971984b
fix(plugins): remove InvocationContext import to avoid circular impor…
679f7ba
docs: add PR_BODY.md describing LlmResiliencePlugin motivation, desig…
c9875a0
fix(plugins): duck-typed InvocationContext resolution to avoid NameEr…
a216a41
fix(samples): use valid agent name (underscores) in resilient_agent.py
2179788
feat(plugins): add LlmResiliencePlugin with retries and fallbacks
chillum-codeX b3a6bac
Merge branch 'main' into feat/resilience-plugin
chillum-codeX 20a7e71
fix: address PR review comments
chillum-codeX 8261659
fix(sample): use ClassVar for shared state across model instances
chillum-codeX 39a2277
fix: narrow exception handlers to (ImportError, AttributeError)
chillum-codeX f962117
Merge branch 'main' into feat/resilience-plugin
chillum-codeX File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,92 @@ | ||
| LlmResiliencePlugin Contribution Note | ||
| ===================================== | ||
|
|
||
| What we implemented | ||
| ------------------- | ||
| 1) New plugin: | ||
| - Added src/google/adk/plugins/llm_resilience_plugin.py | ||
| - Provides retry + backoff + jitter + optional model fallbacks for LLM errors. | ||
|
|
||
| 2) Plugin export: | ||
| - Updated src/google/adk/plugins/__init__.py | ||
| - Exported LlmResiliencePlugin in __all__ for discoverability. | ||
|
|
||
| 3) Unit tests: | ||
| - Added tests/unittests/plugins/test_llm_resilience_plugin.py | ||
| - Covered: | ||
| - retry success on same model | ||
| - fallback model after retries | ||
| - non-transient errors bubbling correctly | ||
|
|
||
| 4) Usage sample: | ||
| - Added samples/resilient_agent.py | ||
| - Demonstrates plugin setup and recovery behavior. | ||
|
|
||
| 5) PR narrative and testing evidence: | ||
| - Updated PR_BODY.md to match repository PR template: | ||
| - issue/description | ||
| - testing plan | ||
| - manual E2E output | ||
| - checklist | ||
|
|
||
|
|
||
| Why this contribution is meaningful | ||
| ----------------------------------- | ||
| 1) Solves a real reliability gap: | ||
| Production agents frequently face transient failures (timeouts, 429, 5xx). | ||
| This change centralizes resilience behavior and removes repeated ad-hoc retry code. | ||
|
|
||
| 2) Low-risk architecture: | ||
| The feature is plugin-based and opt-in. | ||
| Existing users are unaffected unless they configure the plugin. | ||
|
|
||
| 3) Practical for maintainers and users: | ||
| Includes tests and a runnable sample, reducing review friction and making adoption easier. | ||
|
|
||
| 4) Aligns with ADK extensibility: | ||
| Keeps resilience logic at the plugin layer without changing core runner/flow behavior. | ||
|
|
||
|
|
||
| Key design reasons | ||
| ------------------ | ||
| 1) on_model_error_callback hook: | ||
| Best fit for intercepting model failures and deciding retry/fallback behavior. | ||
|
|
||
| 2) Exponential backoff with jitter: | ||
| Reduces retry storms and aligns with standard distributed-system reliability practices. | ||
|
|
||
| 3) Model fallback support: | ||
| Improves chance of successful completion when a single provider/model is degraded. | ||
|
|
||
| 4) Robust provider response handling: | ||
| Supports async-generator and coroutine style returns to handle provider differences. | ||
|
|
||
| 5) Type-safety/cycle-safe update: | ||
| Added TYPE_CHECKING import pattern for InvocationContext to avoid runtime issues. | ||
|
|
||
|
|
||
| Validation performed | ||
| -------------------- | ||
| 1) Formatting: | ||
| - isort applied to changed Python files | ||
| - pyink applied to changed Python files | ||
|
|
||
| 2) Unit tests: | ||
| - Command: | ||
| .venv/Scripts/python -m pytest tests/unittests/plugins/test_llm_resilience_plugin.py -v | ||
| - Result: 3 passed | ||
|
|
||
| 3) Manual E2E sample run: | ||
| - Command: | ||
| .venv/Scripts/python samples/resilient_agent.py | ||
| - Observed: | ||
| LLM retry attempt 1 failed: TimeoutError('Simulated transient failure') | ||
| Collected 1 events | ||
| MODEL: Recovered on retry! | ||
|
|
||
|
|
||
| Scope and limitations | ||
| --------------------- | ||
| - This PR focuses on LLM call resilience only. | ||
| - Live bidirectional streaming paths are out of scope for this change. | ||
| - Future enhancements can add per-exception policies and circuit-breaker style controls. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,90 @@ | ||
| # feat(plugins): LlmResiliencePlugin – configurable retries/backoff and model fallbacks | ||
|
|
||
| ### Link to Issue or Description of Change | ||
|
|
||
| **1. Link to an existing issue (if applicable):** | ||
|
|
||
| - Closes: N/A | ||
| - Related: #1214 | ||
| - Related: #2561 | ||
| - Related discussions: #2292, #3199 | ||
|
|
||
| **2. Or, if no issue exists, describe the change:** | ||
|
|
||
| **Problem:** | ||
| Production agents need first-class resilience to transient LLM/API failures | ||
| (timeouts, HTTP 429/5xx). Today, retry/fallback logic is often ad-hoc and | ||
| duplicated across projects. | ||
|
|
||
| **Solution:** | ||
| Introduce an opt-in plugin, `LlmResiliencePlugin`, that handles transient LLM | ||
| errors with configurable retries (exponential backoff + jitter) and optional | ||
| model fallbacks, without modifying core runner/flow logic. | ||
|
|
||
| ### Summary | ||
|
|
||
| - Added `src/google/adk/plugins/llm_resilience_plugin.py`. | ||
| - Exported `LlmResiliencePlugin` in `src/google/adk/plugins/__init__.py`. | ||
| - Added unit tests in | ||
| `tests/unittests/plugins/test_llm_resilience_plugin.py`: | ||
| - `test_retry_success_on_same_model` | ||
| - `test_fallback_model_used_after_retries` | ||
| - `test_non_transient_error_bubbles` | ||
| - Added `samples/resilient_agent.py` demo. | ||
|
|
||
| ### Testing Plan | ||
|
|
||
| **Unit Tests:** | ||
|
|
||
| - [x] I have added or updated unit tests for my change. | ||
| - [x] All unit tests pass locally. | ||
|
|
||
| Command run: | ||
|
|
||
| ```shell | ||
| .venv/Scripts/python -m pytest tests/unittests/plugins/test_llm_resilience_plugin.py -v | ||
| ``` | ||
|
|
||
| Result summary: | ||
|
|
||
| ```text | ||
| collected 3 items | ||
| tests/unittests/plugins/test_llm_resilience_plugin.py::TestLlmResiliencePlugin::test_fallback_model_used_after_retries PASSED | ||
| tests/unittests/plugins/test_llm_resilience_plugin.py::TestLlmResiliencePlugin::test_non_transient_error_bubbles PASSED | ||
| tests/unittests/plugins/test_llm_resilience_plugin.py::TestLlmResiliencePlugin::test_retry_success_on_same_model PASSED | ||
| 3 passed | ||
| ``` | ||
|
|
||
| **Manual End-to-End (E2E) Tests:** | ||
|
|
||
| Run sample: | ||
|
|
||
| ```shell | ||
| .venv/Scripts/python samples/resilient_agent.py | ||
| ``` | ||
|
|
||
| Observed output: | ||
|
|
||
| ```text | ||
| LLM retry attempt 1 failed: TimeoutError('Simulated transient failure') | ||
| Collected 1 events | ||
| MODEL: Recovered on retry! | ||
| ``` | ||
|
|
||
| ### Checklist | ||
|
|
||
| - [x] I have read the [CONTRIBUTING.md](https://github.com/google/adk-python/blob/main/CONTRIBUTING.md) document. | ||
| - [x] I have performed a self-review of my own code. | ||
| - [x] I have commented my code, particularly in hard-to-understand areas. | ||
| - [x] I have added tests that prove my fix is effective or that my feature works. | ||
| - [x] New and existing unit tests pass locally with my changes. | ||
| - [x] I have manually tested my changes end-to-end. | ||
| - [x] Any dependent changes have been merged and published in downstream modules. (N/A; no dependent changes) | ||
|
|
||
| ### Additional context | ||
|
|
||
| - Non-breaking: users opt in via | ||
| `Runner(..., plugins=[LlmResiliencePlugin(...)])`. | ||
| - Transient detection currently targets common HTTP/timeouts and can be extended | ||
| in follow-ups (e.g., per-exception policy, circuit breaking). | ||
| - Live bidirectional streaming paths are out of scope for this PR. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,105 @@ | ||
| # Sample: Using LlmResiliencePlugin for robust model calls | ||
| # | ||
| # Run with: | ||
| # PYTHONPATH=$(pwd)/src python samples/resilient_agent.py | ||
| # | ||
| # This demonstrates: | ||
| # - Configuring LlmResiliencePlugin for retries and fallbacks | ||
| # - Running a minimal in-memory agent with a mocked model | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import asyncio | ||
| from typing import ClassVar | ||
|
|
||
| from google.adk.agents.llm_agent import LlmAgent | ||
| from google.adk.artifacts.in_memory_artifact_service import InMemoryArtifactService | ||
| from google.adk.memory.in_memory_memory_service import InMemoryMemoryService | ||
| from google.adk.models.base_llm import BaseLlm | ||
| from google.adk.models.llm_request import LlmRequest | ||
| from google.adk.models.llm_response import LlmResponse | ||
| from google.adk.models.registry import LLMRegistry | ||
| from google.adk.plugins.llm_resilience_plugin import LlmResiliencePlugin | ||
| from google.adk.runners import Runner | ||
| from google.adk.sessions.in_memory_session_service import InMemorySessionService | ||
| from google.genai import types | ||
|
|
||
|
|
||
| class DemoFailThenSucceedModel(BaseLlm): | ||
| model: str = "demo-fail-succeed" | ||
| attempts: ClassVar[int] = ( | ||
| 0 # Class variable for shared state across instances | ||
| ) | ||
|
|
||
| @classmethod | ||
| def supported_models(cls) -> list[str]: | ||
| return ["demo-fail-succeed"] | ||
|
|
||
| async def generate_content_async( | ||
| self, llm_request: LlmRequest, stream: bool = False | ||
| ): | ||
| # Fail for the first attempt, then succeed | ||
| DemoFailThenSucceedModel.attempts += 1 | ||
| if DemoFailThenSucceedModel.attempts < 2: | ||
| raise TimeoutError("Simulated transient failure") | ||
| yield LlmResponse( | ||
| content=types.Content( | ||
| role="model", | ||
| parts=[types.Part.from_text(text="Recovered on retry!")], | ||
| ), | ||
| partial=False, | ||
| ) | ||
|
|
||
|
|
||
| # Register test models | ||
| LLMRegistry.register(DemoFailThenSucceedModel) | ||
|
|
||
|
|
||
| async def main(): | ||
| # Agent with the failing-then-succeed model | ||
| agent = LlmAgent(name="resilient_agent", model="demo-fail-succeed") | ||
|
|
||
| # Build services and runner in-memory | ||
| artifact_service = InMemoryArtifactService() | ||
| session_service = InMemorySessionService() | ||
| memory_service = InMemoryMemoryService() | ||
|
|
||
| runner = Runner( | ||
| app_name="resilience_demo", | ||
| agent=agent, | ||
| artifact_service=artifact_service, | ||
| session_service=session_service, | ||
| memory_service=memory_service, | ||
| plugins=[ | ||
| LlmResiliencePlugin( | ||
| max_retries=2, | ||
| backoff_initial=0.1, | ||
| backoff_multiplier=2.0, | ||
| jitter=0.1, | ||
| fallback_models=["mock"], # Demonstration; not used here | ||
| ) | ||
| ], | ||
| ) | ||
|
|
||
| # Create a session and run once | ||
| session = await session_service.create_session( | ||
| app_name="resilience_demo", user_id="demo" | ||
| ) | ||
| events = [] | ||
| async for ev in runner.run_async( | ||
| user_id=session.user_id, | ||
| session_id=session.id, | ||
| new_message=types.Content( | ||
| role="user", parts=[types.Part.from_text(text="hello")] | ||
| ), | ||
| ): | ||
| events.append(ev) | ||
|
|
||
| print("Collected", len(events), "events") | ||
| for e in events: | ||
| if e.content and e.content.parts and e.content.parts[0].text: | ||
| print("MODEL:", e.content.parts[0].text.strip()) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| asyncio.run(main()) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stateful logic in
DemoFailThenSucceedModelrelies on theattemptscounter being incremented across calls. However, because the agent is configured with the model name as a string (model="demo-fail-succeed"), a new instance ofDemoFailThenSucceedModelis created for the initial call and for each retry attempt. This resetsself.attemptsto 0 for each new instance, preventing the model from succeeding after a failure as intended in this demo.To ensure the state is shared across these distinct instances,
attemptsshould be a true class variable, accessed via the class itself.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What changed:
• Used ClassVar[int] from typing module to properly declare a class variable
• Access the counter via DemoFailThenSucceedModel.attempts instead of self._attempts
• This ensures the counter is shared across all instances created during retries
Verified:
• ✅ Sample runs correctly: Recovered on retry!
• ✅ Pushed to PR