Skip to content

Feat/agent test run eval @W-21482725@#350

Merged
WillieRuemmele merged 12 commits intomainfrom
feat/agent-test-run-eval
Mar 6, 2026
Merged

Feat/agent test run eval @W-21482725@#350
WillieRuemmele merged 12 commits intomainfrom
feat/agent-test-run-eval

Conversation

@WillieRuemmele
Copy link
Contributor

@WillieRuemmele WillieRuemmele commented Mar 6, 2026

What does this PR do?

Brings along Tanner's fork PR and new command
adds NUTs
upgrades to use standard libraries
uses OCLIF stdin

What issues does this PR fix or reference?

@W-21482725@

Tanner McGrath and others added 6 commits March 3, 2026 17:22
Add a new command that runs evaluation tests against Agentforce agents
using the Einstein Eval Labs API. Complements `sf agent test run` by
supporting direct JSON payloads with no org metadata deployment step.

Features:
- 8+ evaluator types (string/JSON assertions, text alignment, etc.)
- Smart payload normalization (field correction, shorthand refs, defaults)
- Agent ID resolution from DeveloperName via --agent-api-name
- Batch execution (max 5 tests per API request)
- CI/CD output formats (human, JSON, JUnit XML, TAP)
- Exit code 1 on failures
Accept YAML test specs (same format as `sf agent generate test-spec`)
and translate them to Einstein Eval Labs API calls. The same YAML spec
that works with `sf agent test run` now also works with `run-eval`,
gaining access to richer evaluators (topic/action assertions,
bot_response_rating, string/numeric assertions, text alignment).

- New `--spec` flag replaces `--payload` (accepts YAML or JSON)
- Auto-detects format by content (testCases+subjectName = YAML)
- Auto-infers `--agent-api-name` from YAML spec's `subjectName`
- Smart `get_state` optimization (only when evaluators need it)
- Translates `$.generatedData.*` JSONPaths to Eval API refs
- Maps customEvaluations (string_comparison, numeric_comparison)
- 39 new unit tests for the translator (197 total passing)
The scenario agent and MCP tools generate test payloads using MCP
shorthand format (`type: "evaluator"` + `evaluator_type`) instead of
the raw Eval API format (`type: "evaluator.planner_topic_assertion"`).

Add `normalizeMcpShorthand` as the first normalization pass:
- Merges `type: "evaluator"` + `evaluator_type: "xxx"` → `type: "evaluator.xxx"`
- Converts `field: "gs1.planner_state.topic"` → `actual: "{gs1.response...}"`
- Maps MCP field paths to Eval API JSONPaths
- Auto-generates missing `id` fields on evaluator steps
- 11 new unit tests (208 total passing)
@WillieRuemmele WillieRuemmele requested a review from a team as a code owner March 6, 2026 18:22
@salesforce-cla
Copy link

salesforce-cla bot commented Mar 6, 2026

Thanks for the contribution! Unfortunately we can't verify the commit author(s): Tanner McGrath <t***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request.

@WillieRuemmele WillieRuemmele changed the title Feat/agent test run eval Feat/agent test run eval @W-21482725@ Mar 6, 2026
Copy link
Contributor

@shetzel shetzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll definitely refactor some things before this can go GA, but very nice beta addition!


// Set exit code to 1 if any tests failed
if (summary.failed > 0 || summary.errors > 0) {
process.exitCode = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll want to match apex test run and the other agent test command for exit codes.

@WillieRuemmele WillieRuemmele merged commit 0eb4a6d into main Mar 6, 2026
14 of 15 checks passed
@WillieRuemmele WillieRuemmele deleted the feat/agent-test-run-eval branch March 6, 2026 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants