refactor: adapt evaluation pipeline to MCP architecture

## Background

The current evaluation pipeline is tightly coupled to the legacy `src/` architecture (AgentHive, ReAct+Reflexion, `reactxen` library). The new `mcp/plan_execute/` runner produces `OrchestratorResult` objects with rich traces but has no path to grading or MLflow assessment. Meanwhile, `aobench/scenario-server/` has the right pluggable-grader design but is disconnected from `mcp/`.

## Current state

| Layer | Location | Status |
|---|---|---|
| Scenario loading | `src/assetopsbench/core/scenarios.py` | Legacy, no MCP integration |
| Grading strategies | `aobench/scenario-server/grading/graders.py` | Good design; depends on `reactxen` for `evaluation_agent` strategy |
| Grading pipeline | `aobench/scenario-server/grading/grading.py` | Async, MLflow-ready; not wired to `mcp/` |
| EvaluationAgent | `src/evaluation_agent/agent.py` | Uses `reactxen` WatsonX wrapper (legacy LLM interface) |
| MLflow tracking | `aobench/scenario-server/grading/grading.py` | Implemented but unreachable from `mcp/` |
| Benchmark runner | `benchmark/cods_track{1,2}/run_track_*.py` | Hardcoded to AgentHive workflows |
| MCP runner output | `mcp/plan_execute/models.py` (`OrchestratorResult`) | Has plan + history; no grading hookup |

## What needs to change

### 1. Evaluation runner for `mcp/`

Add `mcp/evaluation/` (or `mcp/benchmark/`) that:
- Loads scenarios from HuggingFace (`ibm-research/AssetOpsBench`) using the existing `Scenario` model
- Feeds each scenario's `text` through `PlanExecuteRunner.run()`
- Collects `OrchestratorResult` (answer + full step history as trace)
- Passes results to the grading pipeline

### 2. MLflow tracing in `PlanExecuteRunner`

Instrument `mcp/plan_execute/runner.py` (or `executor.py`) to emit MLflow traces tagged with `scenario_id`, matching the schema that `aobench/scenario-server/grading/grading.py` already expects.

### 3. Decouple `EvaluationAgent` from `reactxen`

`aobench/scenario-server/grading/graders.py::evaluation_agent()` instantiates `EvaluationAgent` (from `reactxen`) which hard-codes a WatsonX LLM wrapper. Replace with the `LLMBackend` abstraction already in `mcp/llm/` so it works with any LiteLLM-compatible model.

### 4. Connect grading strategies to `OrchestratorResult`

The three graders (`exact_string_match`, `numeric_match`, `evaluation_agent`) expect `(actual, expected, ...)`. Map:
- `actual` ← `OrchestratorResult.answer`
- `trace` ← serialized `OrchestratorResult.history`
- grading strategy selected by `Scenario.deterministic` / `Scenario.type`

### 5. Update benchmark entry points

Replace (or wrap) `benchmark/cods_track{1,2}/run_track_*.py` with MCP-based runners that invoke `plan-execute` and feed results through the new grading pipeline.

## Out of scope

- Changes to `aobench/` scenario-server REST API
- Changes to the `src/` legacy code (keep as-is for reproducibility)
- Deferred grading / PostgreSQL backend (can be added later)

## Acceptance criteria

- [ ] `uv run mcp-evaluate --scenario-type iot` runs all IoT scenarios through `PlanExecuteRunner` and prints per-scenario grades
- [ ] Grading results are logged to MLflow (traces tagged with `scenario_id`, 6-dimensional assessments as feedbacks)
- [ ] `evaluation_agent` grader uses `LiteLLMBackend` — no `reactxen` dependency in `mcp/`
- [ ] Unit tests for the new evaluation runner (mocked `PlanExecuteRunner` + mocked graders)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: adapt evaluation pipeline to MCP architecture #175

Background

Current state

What needs to change

1. Evaluation runner for `mcp/`

2. MLflow tracing in `PlanExecuteRunner`

3. Decouple `EvaluationAgent` from `reactxen`

4. Connect grading strategies to `OrchestratorResult`

5. Update benchmark entry points

Out of scope

Acceptance criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Layer	Location	Status
Scenario loading	`src/assetopsbench/core/scenarios.py`	Legacy, no MCP integration
Grading strategies	`aobench/scenario-server/grading/graders.py`	Good design; depends on `reactxen` for `evaluation_agent` strategy
Grading pipeline	`aobench/scenario-server/grading/grading.py`	Async, MLflow-ready; not wired to `mcp/`
EvaluationAgent	`src/evaluation_agent/agent.py`	Uses `reactxen` WatsonX wrapper (legacy LLM interface)
MLflow tracking	`aobench/scenario-server/grading/grading.py`	Implemented but unreachable from `mcp/`
Benchmark runner	`benchmark/cods_track{1,2}/run_track_*.py`	Hardcoded to AgentHive workflows
MCP runner output	`mcp/plan_execute/models.py` (`OrchestratorResult`)	Has plan + history; no grading hookup

refactor: adapt evaluation pipeline to MCP architecture #175

Description

Background

Current state

What needs to change

1. Evaluation runner for mcp/

2. MLflow tracing in PlanExecuteRunner

3. Decouple EvaluationAgent from reactxen

4. Connect grading strategies to OrchestratorResult

5. Update benchmark entry points

Out of scope

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Evaluation runner for `mcp/`

2. MLflow tracing in `PlanExecuteRunner`

3. Decouple `EvaluationAgent` from `reactxen`

4. Connect grading strategies to `OrchestratorResult`