Add agent experience testing framework, expand .claude by muhsinking · Pull Request #561 · runpod/docs

muhsinking · 2026-03-20T02:57:27Z

Agent experience testing framework

Summary

Adds a lightweight framework for testing documentation quality by having AI coding agents attempt real-world tasks using only the docs. Tests reveal documentation gaps by simulating what happens when a user asks "how do I deploy a vLLM endpoint?" without any prior context.

Philosophy

Tests are intentionally hard to pass. Each test is a single sentence—no hints, no steps, no doc references. If the docs are good, an agent can figure it out. If not, the test reveals exactly where users get stuck.

How it works

Define tests as rows in a table (tests/TESTS.md)
Run tests via natural language in Claude Code: Run the vllm-deploy test
Agent searches docs, attempts the goal, cleans up resources
Agent writes a report identifying documentation gaps and suggesting improvements

Two doc source modes

Mode	Command	Use case
Published	`Run the vllm-deploy test`	Test live documentation
Local	`Run the vllm-deploy test using local docs`	Validate changes before publishing

Local mode reads .mdx files directly from the repo, letting you test doc changes on a branch before merging.

Test coverage

~85 tests across 13 product areas:

Flash SDK (13 tests): Deploying Python functions to GPUs
Serverless Endpoints (18 tests): Creating endpoints, scaling, streaming, webhooks
vLLM (6 tests): LLM deployment and OpenAI compatibility
Pods (9 tests): GPU instances, SSH, storage
Storage (11 tests): Network volumes, S3 API, file transfer
Templates (6 tests): Creating and using templates
Instant Clusters (4 tests): Multi-node deployments
SDKs & APIs (8 tests): Python, JavaScript, GraphQL
CLI (6 tests): runpodctl operations
Integrations (4 tests): Cursor, Vercel AI, SkyPilot
Tutorials (9 tests): End-to-end workflows

Test format

Tests are minimal by design:

| serverless-serve-qwen | Create an endpoint to serve a Qwen model | Hard |

That's it. The agent must figure out everything else from the docs.

Report output

After each test, reports are saved to tests/reports/ (gitignored):

# Test Report: Create a serverless endpoint

**Date:** 2026-03-19 20:16:07
**Status:** PASS

## What Happened
Successfully created a serverless endpoint...

## Where I Got Stuck
Finding the templateId was not obvious...

## Documentation Gaps
1. No "list templates" step in endpoint creation docs
2. Template ID discovery is buried...

## Suggestions
1. Add a "Prerequisites" section...

Files changed

tests/
├── README.md          # Quick reference
├── TESTS.md           # All test definitions (single file)
└── reports/           # Test reports (gitignored)

.claude/
└── testing.md         # Agent instructions for running tests

README.md              # Added "Agent experience testing" section
.gitignore             # Added tests/reports/

Requirements

Claude Code with MCP servers configured:

claude mcp add runpod -e RUNPOD_API_KEY=your_key -- npx -y @runpod/mcp-server@latest
claude mcp add runpod-docs --transport http https://docs.runpod.io/mcp

Safety

All test resources use doc_test_ prefix
Cleanup runs after each test
Tests only create/delete their own resources

runpod-Henrik

PR #561 — Add agent experience testing framework, expand .claude

No prior reviews found — this is a first-time review.

1. MCP tool name typo — `.claude/testing.md`

Issue: Line 413 references mcp__runpod-dops__search_runpod_documentation. The MCP server is registered as runpod-docs, so the tool name should be mcp__runpod-docs__search_runpod_documentation. The misspelling will cause all Published Docs mode test runs to fail — the tool won't resolve.

2. Test table format doesn't match documented spec

Issue: tests/README.md and .claude/testing.md both say each test has three fields: ID, Goal, and Cleanup. The actual tables in TESTS.md have columns ID | Goal | Difficulty — no Cleanup column. An agent reading a test definition won't know what resources to clean up from the table; it has to infer from the bottom section. Either update the spec to say cleanup rules are global (not per-test), or add the Cleanup column back to the tables.

3. Port limit accuracy — `runpodctl-create-pod.mdx`

Question: The original said "Maximum of 1 HTTP port and 1 TCP port allowed." The new text says "up to 10 HTTP ports and multiple TCP ports." Is that backed by actual runpodctl behavior? If the original limit still applies to the CLI (even if the REST API allows more), this would mislead users into configurations that fail.

4. Framework vs catalog

This is a well-conceived idea but the implementation is a catalog, not a framework. A few structural gaps will limit how useful it is in practice:

No automation layer. Tests are triggered by a human typing natural language to Claude Code. There's no runner, no CI hook, no batch mode. 85 tests that require manual one-by-one triggering will never get run systematically.

Results are ephemeral. tests/reports/ is gitignored. There's no history, no trend tracking, no way to know which tests consistently fail across doc changes.

No smoke test tier. Many tests require live GPU deploys. There's no defined fast subset (10–15 tests) suitable for running before every merge. Without that, the full suite is too expensive to run regularly.

No success criteria. Difficulty (Easy/Hard) isn't a pass condition. The agent decides what PASS means, which will be inconsistent across runs. A brief expected outcome per test (e.g. "endpoint responds with 200 to a /runsync request") would anchor the verdict.

No cleanup safety net. If a test crashes mid-run, doc_test_ resources are orphaned. A cleanup script (e.g. delete all resources matching doc_test_*) would prevent cost surprises.

The local-docs mode for pre-merge validation is the most immediately useful feature here. The published-docs batch testing vision is worth pursuing but needs the automation layer first.

Nits

.gitignore is missing a trailing newline.
Double --- separator before the Cleanup Rules section in TESTS.md — looks like a copy-paste artifact.

Verdict

NEEDS WORK — The MCP tool name typo (#1) silently breaks Published Docs mode. The Cleanup column mismatch (#2) creates ambiguity for running agents. The port limit change (#3) needs factual verification. Section 4 is not a blocker but worth discussing before the suite grows further.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

Add agent experience testing framework, expand .claude

5bdd3be

mintlify bot deployed to staging March 20, 2026 02:58 View deployment

muhsinking requested a review from runpod-Henrik March 20, 2026 02:59

Update Pods docs to improve AX

23b7496

mintlify bot deployed to staging March 20, 2026 13:08 View deployment

Add terminal workflow to the Pods quickstart

ab6ac90

mintlify bot deployed to staging March 20, 2026 13:40 View deployment

Improve Pod quickstart terminal steps

4f001b0

mintlify bot deployed to staging March 20, 2026 13:51 View deployment

runpod-Henrik reviewed Mar 20, 2026

View reviewed changes

Improve agent tests

6201fdf

mintlify bot deployed to staging March 20, 2026 16:42 View deployment

Update styleguide with guidance on using cards for links

c0b46c3

mintlify bot deployed to staging March 20, 2026 17:02 View deployment

remove testing improvement plan

d6d0910

mintlify bot deployed to staging March 20, 2026 17:03 View deployment

muhsinking marked this pull request as ready for review March 20, 2026 17:19

muhsinking requested a review from runpod-Henrik March 20, 2026 17:19

mintlify bot deployed to staging March 20, 2026 17:20 View deployment

Add test batches

b6e21e1

muhsinking force-pushed the agent-tests branch from c9cb586 to b6e21e1 Compare March 20, 2026 22:48

mintlify bot deployed to staging March 20, 2026 22:48 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add agent experience testing framework, expand .claude#561

Add agent experience testing framework, expand .claude#561
muhsinking wants to merge 8 commits intomainfrom
agent-tests

muhsinking commented Mar 20, 2026 •

edited

Loading

Uh oh!

runpod-Henrik left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

muhsinking commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Agent experience testing framework

Summary

Philosophy

How it works

Two doc source modes

Test coverage

Test format

Report output

Files changed

Requirements

Safety

Uh oh!

runpod-Henrik left a comment

Choose a reason for hiding this comment

1. MCP tool name typo — .claude/testing.md

2. Test table format doesn't match documented spec

3. Port limit accuracy — runpodctl-create-pod.mdx

4. Framework vs catalog

Nits

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

muhsinking commented Mar 20, 2026 •

edited

Loading

1. MCP tool name typo — `.claude/testing.md`

3. Port limit accuracy — `runpodctl-create-pod.mdx`