Skip to content

Add agent experience testing framework, expand .claude#561

Open
muhsinking wants to merge 8 commits intomainfrom
agent-tests
Open

Add agent experience testing framework, expand .claude#561
muhsinking wants to merge 8 commits intomainfrom
agent-tests

Conversation

@muhsinking
Copy link
Collaborator

@muhsinking muhsinking commented Mar 20, 2026

Agent experience testing framework

Summary

Adds a lightweight framework for testing documentation quality by having AI coding agents attempt real-world tasks using only the docs. Tests reveal documentation gaps by simulating what happens when a user asks "how do I deploy a vLLM endpoint?" without any prior context.

Philosophy

Tests are intentionally hard to pass. Each test is a single sentence—no hints, no steps, no doc references. If the docs are good, an agent can figure it out. If not, the test reveals exactly where users get stuck.

How it works

  1. Define tests as rows in a table (tests/TESTS.md)
  2. Run tests via natural language in Claude Code: Run the vllm-deploy test
  3. Agent searches docs, attempts the goal, cleans up resources
  4. Agent writes a report identifying documentation gaps and suggesting improvements

Two doc source modes

Mode Command Use case
Published Run the vllm-deploy test Test live documentation
Local Run the vllm-deploy test using local docs Validate changes before publishing

Local mode reads .mdx files directly from the repo, letting you test doc changes on a branch before merging.

Test coverage

~85 tests across 13 product areas:

  • Flash SDK (13 tests): Deploying Python functions to GPUs
  • Serverless Endpoints (18 tests): Creating endpoints, scaling, streaming, webhooks
  • vLLM (6 tests): LLM deployment and OpenAI compatibility
  • Pods (9 tests): GPU instances, SSH, storage
  • Storage (11 tests): Network volumes, S3 API, file transfer
  • Templates (6 tests): Creating and using templates
  • Instant Clusters (4 tests): Multi-node deployments
  • SDKs & APIs (8 tests): Python, JavaScript, GraphQL
  • CLI (6 tests): runpodctl operations
  • Integrations (4 tests): Cursor, Vercel AI, SkyPilot
  • Tutorials (9 tests): End-to-end workflows

Test format

Tests are minimal by design:

| serverless-serve-qwen | Create an endpoint to serve a Qwen model | Hard |

That's it. The agent must figure out everything else from the docs.

Report output

After each test, reports are saved to tests/reports/ (gitignored):

# Test Report: Create a serverless endpoint

**Date:** 2026-03-19 20:16:07
**Status:** PASS

## What Happened
Successfully created a serverless endpoint...

## Where I Got Stuck
Finding the templateId was not obvious...

## Documentation Gaps
1. No "list templates" step in endpoint creation docs
2. Template ID discovery is buried...

## Suggestions
1. Add a "Prerequisites" section...

Files changed

tests/
├── README.md          # Quick reference
├── TESTS.md           # All test definitions (single file)
└── reports/           # Test reports (gitignored)

.claude/
└── testing.md         # Agent instructions for running tests

README.md              # Added "Agent experience testing" section
.gitignore             # Added tests/reports/

Requirements

  • Claude Code with MCP servers configured:
    claude mcp add runpod -e RUNPOD_API_KEY=your_key -- npx -y @runpod/mcp-server@latest
    claude mcp add runpod-docs --transport http https://docs.runpod.io/mcp

Safety

  • All test resources use doc_test_ prefix
  • Cleanup runs after each test
  • Tests only create/delete their own resources

Copy link

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #561 — Add agent experience testing framework, expand .claude

No prior reviews found — this is a first-time review.


1. MCP tool name typo — .claude/testing.md

Issue: Line 413 references mcp__runpod-dops__search_runpod_documentation. The MCP server is registered as runpod-docs, so the tool name should be mcp__runpod-docs__search_runpod_documentation. The misspelling will cause all Published Docs mode test runs to fail — the tool won't resolve.


2. Test table format doesn't match documented spec

Issue: tests/README.md and .claude/testing.md both say each test has three fields: ID, Goal, and Cleanup. The actual tables in TESTS.md have columns ID | Goal | Difficulty — no Cleanup column. An agent reading a test definition won't know what resources to clean up from the table; it has to infer from the bottom section. Either update the spec to say cleanup rules are global (not per-test), or add the Cleanup column back to the tables.


3. Port limit accuracy — runpodctl-create-pod.mdx

Question: The original said "Maximum of 1 HTTP port and 1 TCP port allowed." The new text says "up to 10 HTTP ports and multiple TCP ports." Is that backed by actual runpodctl behavior? If the original limit still applies to the CLI (even if the REST API allows more), this would mislead users into configurations that fail.


4. Framework vs catalog

This is a well-conceived idea but the implementation is a catalog, not a framework. A few structural gaps will limit how useful it is in practice:

No automation layer. Tests are triggered by a human typing natural language to Claude Code. There's no runner, no CI hook, no batch mode. 85 tests that require manual one-by-one triggering will never get run systematically.

Results are ephemeral. tests/reports/ is gitignored. There's no history, no trend tracking, no way to know which tests consistently fail across doc changes.

No smoke test tier. Many tests require live GPU deploys. There's no defined fast subset (10–15 tests) suitable for running before every merge. Without that, the full suite is too expensive to run regularly.

No success criteria. Difficulty (Easy/Hard) isn't a pass condition. The agent decides what PASS means, which will be inconsistent across runs. A brief expected outcome per test (e.g. "endpoint responds with 200 to a /runsync request") would anchor the verdict.

No cleanup safety net. If a test crashes mid-run, doc_test_ resources are orphaned. A cleanup script (e.g. delete all resources matching doc_test_*) would prevent cost surprises.

The local-docs mode for pre-merge validation is the most immediately useful feature here. The published-docs batch testing vision is worth pursuing but needs the automation layer first.


Nits

  • .gitignore is missing a trailing newline.
  • Double --- separator before the Cleanup Rules section in TESTS.md — looks like a copy-paste artifact.

Verdict

NEEDS WORK — The MCP tool name typo (#1) silently breaks Published Docs mode. The Cleanup column mismatch (#2) creates ambiguity for running agents. The port limit change (#3) needs factual verification. Section 4 is not a blocker but worth discussing before the suite grows further.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants