Skip to content

refactor: adapt evaluation pipeline to MCP architecture #175

@ShuxinLin

Description

@ShuxinLin

Background

The current evaluation pipeline is tightly coupled to the legacy src/ architecture (AgentHive, ReAct+Reflexion, reactxen library). The new mcp/plan_execute/ runner produces OrchestratorResult objects with rich traces but has no path to grading or MLflow assessment. Meanwhile, aobench/scenario-server/ has the right pluggable-grader design but is disconnected from mcp/.

Current state

Layer Location Status
Scenario loading src/assetopsbench/core/scenarios.py Legacy, no MCP integration
Grading strategies aobench/scenario-server/grading/graders.py Good design; depends on reactxen for evaluation_agent strategy
Grading pipeline aobench/scenario-server/grading/grading.py Async, MLflow-ready; not wired to mcp/
EvaluationAgent src/evaluation_agent/agent.py Uses reactxen WatsonX wrapper (legacy LLM interface)
MLflow tracking aobench/scenario-server/grading/grading.py Implemented but unreachable from mcp/
Benchmark runner benchmark/cods_track{1,2}/run_track_*.py Hardcoded to AgentHive workflows
MCP runner output mcp/plan_execute/models.py (OrchestratorResult) Has plan + history; no grading hookup

What needs to change

1. Evaluation runner for mcp/

Add mcp/evaluation/ (or mcp/benchmark/) that:

  • Loads scenarios from HuggingFace (ibm-research/AssetOpsBench) using the existing Scenario model
  • Feeds each scenario's text through PlanExecuteRunner.run()
  • Collects OrchestratorResult (answer + full step history as trace)
  • Passes results to the grading pipeline

2. MLflow tracing in PlanExecuteRunner

Instrument mcp/plan_execute/runner.py (or executor.py) to emit MLflow traces tagged with scenario_id, matching the schema that aobench/scenario-server/grading/grading.py already expects.

3. Decouple EvaluationAgent from reactxen

aobench/scenario-server/grading/graders.py::evaluation_agent() instantiates EvaluationAgent (from reactxen) which hard-codes a WatsonX LLM wrapper. Replace with the LLMBackend abstraction already in mcp/llm/ so it works with any LiteLLM-compatible model.

4. Connect grading strategies to OrchestratorResult

The three graders (exact_string_match, numeric_match, evaluation_agent) expect (actual, expected, ...). Map:

  • actualOrchestratorResult.answer
  • trace ← serialized OrchestratorResult.history
  • grading strategy selected by Scenario.deterministic / Scenario.type

5. Update benchmark entry points

Replace (or wrap) benchmark/cods_track{1,2}/run_track_*.py with MCP-based runners that invoke plan-execute and feed results through the new grading pipeline.

Out of scope

  • Changes to aobench/ scenario-server REST API
  • Changes to the src/ legacy code (keep as-is for reproducibility)
  • Deferred grading / PostgreSQL backend (can be added later)

Acceptance criteria

  • uv run mcp-evaluate --scenario-type iot runs all IoT scenarios through PlanExecuteRunner and prints per-scenario grades
  • Grading results are logged to MLflow (traces tagged with scenario_id, 6-dimensional assessments as feedbacks)
  • evaluation_agent grader uses LiteLLMBackend — no reactxen dependency in mcp/
  • Unit tests for the new evaluation runner (mocked PlanExecuteRunner + mocked graders)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions