Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 92 additions & 0 deletions CONTRIBUTION_NOTE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
LlmResiliencePlugin Contribution Note
=====================================

What we implemented
-------------------
1) New plugin:
- Added src/google/adk/plugins/llm_resilience_plugin.py
- Provides retry + backoff + jitter + optional model fallbacks for LLM errors.

2) Plugin export:
- Updated src/google/adk/plugins/__init__.py
- Exported LlmResiliencePlugin in __all__ for discoverability.

3) Unit tests:
- Added tests/unittests/plugins/test_llm_resilience_plugin.py
- Covered:
- retry success on same model
- fallback model after retries
- non-transient errors bubbling correctly

4) Usage sample:
- Added samples/resilient_agent.py
- Demonstrates plugin setup and recovery behavior.

5) PR narrative and testing evidence:
- Updated PR_BODY.md to match repository PR template:
- issue/description
- testing plan
- manual E2E output
- checklist


Why this contribution is meaningful
-----------------------------------
1) Solves a real reliability gap:
Production agents frequently face transient failures (timeouts, 429, 5xx).
This change centralizes resilience behavior and removes repeated ad-hoc retry code.

2) Low-risk architecture:
The feature is plugin-based and opt-in.
Existing users are unaffected unless they configure the plugin.

3) Practical for maintainers and users:
Includes tests and a runnable sample, reducing review friction and making adoption easier.

4) Aligns with ADK extensibility:
Keeps resilience logic at the plugin layer without changing core runner/flow behavior.


Key design reasons
------------------
1) on_model_error_callback hook:
Best fit for intercepting model failures and deciding retry/fallback behavior.

2) Exponential backoff with jitter:
Reduces retry storms and aligns with standard distributed-system reliability practices.

3) Model fallback support:
Improves chance of successful completion when a single provider/model is degraded.

4) Robust provider response handling:
Supports async-generator and coroutine style returns to handle provider differences.

5) Type-safety/cycle-safe update:
Added TYPE_CHECKING import pattern for InvocationContext to avoid runtime issues.


Validation performed
--------------------
1) Formatting:
- isort applied to changed Python files
- pyink applied to changed Python files

2) Unit tests:
- Command:
.venv/Scripts/python -m pytest tests/unittests/plugins/test_llm_resilience_plugin.py -v
- Result: 3 passed

3) Manual E2E sample run:
- Command:
.venv/Scripts/python samples/resilient_agent.py
- Observed:
LLM retry attempt 1 failed: TimeoutError('Simulated transient failure')
Collected 1 events
MODEL: Recovered on retry!


Scope and limitations
---------------------
- This PR focuses on LLM call resilience only.
- Live bidirectional streaming paths are out of scope for this change.
- Future enhancements can add per-exception policies and circuit-breaker style controls.
90 changes: 90 additions & 0 deletions PR_BODY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# feat(plugins): LlmResiliencePlugin – configurable retries/backoff and model fallbacks

### Link to Issue or Description of Change

**1. Link to an existing issue (if applicable):**

- Closes: N/A
- Related: #1214
- Related: #2561
- Related discussions: #2292, #3199

**2. Or, if no issue exists, describe the change:**

**Problem:**
Production agents need first-class resilience to transient LLM/API failures
(timeouts, HTTP 429/5xx). Today, retry/fallback logic is often ad-hoc and
duplicated across projects.

**Solution:**
Introduce an opt-in plugin, `LlmResiliencePlugin`, that handles transient LLM
errors with configurable retries (exponential backoff + jitter) and optional
model fallbacks, without modifying core runner/flow logic.

### Summary

- Added `src/google/adk/plugins/llm_resilience_plugin.py`.
- Exported `LlmResiliencePlugin` in `src/google/adk/plugins/__init__.py`.
- Added unit tests in
`tests/unittests/plugins/test_llm_resilience_plugin.py`:
- `test_retry_success_on_same_model`
- `test_fallback_model_used_after_retries`
- `test_non_transient_error_bubbles`
- Added `samples/resilient_agent.py` demo.

### Testing Plan

**Unit Tests:**

- [x] I have added or updated unit tests for my change.
- [x] All unit tests pass locally.

Command run:

```shell
.venv/Scripts/python -m pytest tests/unittests/plugins/test_llm_resilience_plugin.py -v
```

Result summary:

```text
collected 3 items
tests/unittests/plugins/test_llm_resilience_plugin.py::TestLlmResiliencePlugin::test_fallback_model_used_after_retries PASSED
tests/unittests/plugins/test_llm_resilience_plugin.py::TestLlmResiliencePlugin::test_non_transient_error_bubbles PASSED
tests/unittests/plugins/test_llm_resilience_plugin.py::TestLlmResiliencePlugin::test_retry_success_on_same_model PASSED
3 passed
```

**Manual End-to-End (E2E) Tests:**

Run sample:

```shell
.venv/Scripts/python samples/resilient_agent.py
```

Observed output:

```text
LLM retry attempt 1 failed: TimeoutError('Simulated transient failure')
Collected 1 events
MODEL: Recovered on retry!
```

### Checklist

- [x] I have read the [CONTRIBUTING.md](https://github.com/google/adk-python/blob/main/CONTRIBUTING.md) document.
- [x] I have performed a self-review of my own code.
- [x] I have commented my code, particularly in hard-to-understand areas.
- [x] I have added tests that prove my fix is effective or that my feature works.
- [x] New and existing unit tests pass locally with my changes.
- [x] I have manually tested my changes end-to-end.
- [x] Any dependent changes have been merged and published in downstream modules. (N/A; no dependent changes)

### Additional context

- Non-breaking: users opt in via
`Runner(..., plugins=[LlmResiliencePlugin(...)])`.
- Transient detection currently targets common HTTP/timeouts and can be extended
in follow-ups (e.g., per-exception policy, circuit breaking).
- Live bidirectional streaming paths are out of scope for this PR.
105 changes: 105 additions & 0 deletions samples/resilient_agent.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Sample: Using LlmResiliencePlugin for robust model calls
#
# Run with:
# PYTHONPATH=$(pwd)/src python samples/resilient_agent.py
#
# This demonstrates:
# - Configuring LlmResiliencePlugin for retries and fallbacks
# - Running a minimal in-memory agent with a mocked model

from __future__ import annotations

import asyncio
from typing import ClassVar

from google.adk.agents.llm_agent import LlmAgent
from google.adk.artifacts.in_memory_artifact_service import InMemoryArtifactService
from google.adk.memory.in_memory_memory_service import InMemoryMemoryService
from google.adk.models.base_llm import BaseLlm
from google.adk.models.llm_request import LlmRequest
from google.adk.models.llm_response import LlmResponse
from google.adk.models.registry import LLMRegistry
from google.adk.plugins.llm_resilience_plugin import LlmResiliencePlugin
from google.adk.runners import Runner
from google.adk.sessions.in_memory_session_service import InMemorySessionService
from google.genai import types


class DemoFailThenSucceedModel(BaseLlm):
model: str = "demo-fail-succeed"
attempts: ClassVar[int] = (
0 # Class variable for shared state across instances
)

@classmethod
def supported_models(cls) -> list[str]:
return ["demo-fail-succeed"]

async def generate_content_async(
self, llm_request: LlmRequest, stream: bool = False
):
# Fail for the first attempt, then succeed
DemoFailThenSucceedModel.attempts += 1
if DemoFailThenSucceedModel.attempts < 2:
raise TimeoutError("Simulated transient failure")
yield LlmResponse(
content=types.Content(
role="model",
parts=[types.Part.from_text(text="Recovered on retry!")],
),
partial=False,
)
Comment on lines 28 to 51
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The stateful logic in DemoFailThenSucceedModel relies on the attempts counter being incremented across calls. However, because the agent is configured with the model name as a string (model="demo-fail-succeed"), a new instance of DemoFailThenSucceedModel is created for the initial call and for each retry attempt. This resets self.attempts to 0 for each new instance, preventing the model from succeeding after a failure as intended in this demo.

To ensure the state is shared across these distinct instances, attempts should be a true class variable, accessed via the class itself.

class DemoFailThenSucceedModel(BaseLlm):
  model: str = "demo-fail-succeed"
  _attempts: int = 0

  @classmethod
  def supported_models(cls) -> list[str]:
    return ["demo-fail-succeed"]

  async def generate_content_async(
      self, llm_request: LlmRequest, stream: bool = False
  ):
    # Fail for the first attempt, then succeed
    DemoFailThenSucceedModel._attempts += 1
    if DemoFailThenSucceedModel._attempts < 2:
      raise TimeoutError("Simulated transient failure")
    yield LlmResponse(
        content=types.Content(
            role="model",
            parts=[types.Part.from_text(text="Recovered on retry!")],
        ),
        partial=False,
    )

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What changed:
• Used ClassVar[int] from typing module to properly declare a class variable
• Access the counter via DemoFailThenSucceedModel.attempts instead of self._attempts
• This ensures the counter is shared across all instances created during retries

Verified:
• ✅ Sample runs correctly: Recovered on retry!
• ✅ Pushed to PR



# Register test models
LLMRegistry.register(DemoFailThenSucceedModel)


async def main():
# Agent with the failing-then-succeed model
agent = LlmAgent(name="resilient_agent", model="demo-fail-succeed")

# Build services and runner in-memory
artifact_service = InMemoryArtifactService()
session_service = InMemorySessionService()
memory_service = InMemoryMemoryService()

runner = Runner(
app_name="resilience_demo",
agent=agent,
artifact_service=artifact_service,
session_service=session_service,
memory_service=memory_service,
plugins=[
LlmResiliencePlugin(
max_retries=2,
backoff_initial=0.1,
backoff_multiplier=2.0,
jitter=0.1,
fallback_models=["mock"], # Demonstration; not used here
)
],
)

# Create a session and run once
session = await session_service.create_session(
app_name="resilience_demo", user_id="demo"
)
events = []
async for ev in runner.run_async(
user_id=session.user_id,
session_id=session.id,
new_message=types.Content(
role="user", parts=[types.Part.from_text(text="hello")]
),
):
events.append(ev)

print("Collected", len(events), "events")
for e in events:
if e.content and e.content.parts and e.content.parts[0].text:
print("MODEL:", e.content.parts[0].text.strip())


if __name__ == "__main__":
asyncio.run(main())
2 changes: 2 additions & 0 deletions src/google/adk/plugins/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@

from .base_plugin import BasePlugin
from .debug_logging_plugin import DebugLoggingPlugin
from .llm_resilience_plugin import LlmResiliencePlugin
from .logging_plugin import LoggingPlugin
from .plugin_manager import PluginManager
from .reflect_retry_tool_plugin import ReflectAndRetryToolPlugin
Expand All @@ -24,4 +25,5 @@
'LoggingPlugin',
'PluginManager',
'ReflectAndRetryToolPlugin',
'LlmResiliencePlugin',
]
Loading