-
Notifications
You must be signed in to change notification settings - Fork 205
Description
Background
The current evaluation pipeline is tightly coupled to the legacy src/ architecture (AgentHive, ReAct+Reflexion, reactxen library). The new mcp/plan_execute/ runner produces OrchestratorResult objects with rich traces but has no path to grading or MLflow assessment. Meanwhile, aobench/scenario-server/ has the right pluggable-grader design but is disconnected from mcp/.
Current state
| Layer | Location | Status |
|---|---|---|
| Scenario loading | src/assetopsbench/core/scenarios.py |
Legacy, no MCP integration |
| Grading strategies | aobench/scenario-server/grading/graders.py |
Good design; depends on reactxen for evaluation_agent strategy |
| Grading pipeline | aobench/scenario-server/grading/grading.py |
Async, MLflow-ready; not wired to mcp/ |
| EvaluationAgent | src/evaluation_agent/agent.py |
Uses reactxen WatsonX wrapper (legacy LLM interface) |
| MLflow tracking | aobench/scenario-server/grading/grading.py |
Implemented but unreachable from mcp/ |
| Benchmark runner | benchmark/cods_track{1,2}/run_track_*.py |
Hardcoded to AgentHive workflows |
| MCP runner output | mcp/plan_execute/models.py (OrchestratorResult) |
Has plan + history; no grading hookup |
What needs to change
1. Evaluation runner for mcp/
Add mcp/evaluation/ (or mcp/benchmark/) that:
- Loads scenarios from HuggingFace (
ibm-research/AssetOpsBench) using the existingScenariomodel - Feeds each scenario's
textthroughPlanExecuteRunner.run() - Collects
OrchestratorResult(answer + full step history as trace) - Passes results to the grading pipeline
2. MLflow tracing in PlanExecuteRunner
Instrument mcp/plan_execute/runner.py (or executor.py) to emit MLflow traces tagged with scenario_id, matching the schema that aobench/scenario-server/grading/grading.py already expects.
3. Decouple EvaluationAgent from reactxen
aobench/scenario-server/grading/graders.py::evaluation_agent() instantiates EvaluationAgent (from reactxen) which hard-codes a WatsonX LLM wrapper. Replace with the LLMBackend abstraction already in mcp/llm/ so it works with any LiteLLM-compatible model.
4. Connect grading strategies to OrchestratorResult
The three graders (exact_string_match, numeric_match, evaluation_agent) expect (actual, expected, ...). Map:
actual←OrchestratorResult.answertrace← serializedOrchestratorResult.history- grading strategy selected by
Scenario.deterministic/Scenario.type
5. Update benchmark entry points
Replace (or wrap) benchmark/cods_track{1,2}/run_track_*.py with MCP-based runners that invoke plan-execute and feed results through the new grading pipeline.
Out of scope
- Changes to
aobench/scenario-server REST API - Changes to the
src/legacy code (keep as-is for reproducibility) - Deferred grading / PostgreSQL backend (can be added later)
Acceptance criteria
-
uv run mcp-evaluate --scenario-type iotruns all IoT scenarios throughPlanExecuteRunnerand prints per-scenario grades - Grading results are logged to MLflow (traces tagged with
scenario_id, 6-dimensional assessments as feedbacks) -
evaluation_agentgrader usesLiteLLMBackend— noreactxendependency inmcp/ - Unit tests for the new evaluation runner (mocked
PlanExecuteRunner+ mocked graders)