Add agent experience testing framework, expand .claude#561
Add agent experience testing framework, expand .claude#561muhsinking wants to merge 8 commits intomainfrom
Conversation
runpod-Henrik
left a comment
There was a problem hiding this comment.
PR #561 — Add agent experience testing framework, expand .claude
No prior reviews found — this is a first-time review.
1. MCP tool name typo — .claude/testing.md
Issue: Line 413 references mcp__runpod-dops__search_runpod_documentation. The MCP server is registered as runpod-docs, so the tool name should be mcp__runpod-docs__search_runpod_documentation. The misspelling will cause all Published Docs mode test runs to fail — the tool won't resolve.
2. Test table format doesn't match documented spec
Issue: tests/README.md and .claude/testing.md both say each test has three fields: ID, Goal, and Cleanup. The actual tables in TESTS.md have columns ID | Goal | Difficulty — no Cleanup column. An agent reading a test definition won't know what resources to clean up from the table; it has to infer from the bottom section. Either update the spec to say cleanup rules are global (not per-test), or add the Cleanup column back to the tables.
3. Port limit accuracy — runpodctl-create-pod.mdx
Question: The original said "Maximum of 1 HTTP port and 1 TCP port allowed." The new text says "up to 10 HTTP ports and multiple TCP ports." Is that backed by actual runpodctl behavior? If the original limit still applies to the CLI (even if the REST API allows more), this would mislead users into configurations that fail.
4. Framework vs catalog
This is a well-conceived idea but the implementation is a catalog, not a framework. A few structural gaps will limit how useful it is in practice:
No automation layer. Tests are triggered by a human typing natural language to Claude Code. There's no runner, no CI hook, no batch mode. 85 tests that require manual one-by-one triggering will never get run systematically.
Results are ephemeral. tests/reports/ is gitignored. There's no history, no trend tracking, no way to know which tests consistently fail across doc changes.
No smoke test tier. Many tests require live GPU deploys. There's no defined fast subset (10–15 tests) suitable for running before every merge. Without that, the full suite is too expensive to run regularly.
No success criteria. Difficulty (Easy/Hard) isn't a pass condition. The agent decides what PASS means, which will be inconsistent across runs. A brief expected outcome per test (e.g. "endpoint responds with 200 to a /runsync request") would anchor the verdict.
No cleanup safety net. If a test crashes mid-run, doc_test_ resources are orphaned. A cleanup script (e.g. delete all resources matching doc_test_*) would prevent cost surprises.
The local-docs mode for pre-merge validation is the most immediately useful feature here. The published-docs batch testing vision is worth pursuing but needs the automation layer first.
Nits
.gitignoreis missing a trailing newline.- Double
---separator before the Cleanup Rules section inTESTS.md— looks like a copy-paste artifact.
Verdict
NEEDS WORK — The MCP tool name typo (#1) silently breaks Published Docs mode. The Cleanup column mismatch (#2) creates ambiguity for running agents. The port limit change (#3) needs factual verification. Section 4 is not a blocker but worth discussing before the suite grows further.
🤖 Reviewed by Henrik's AI-Powered Bug Finder
Agent experience testing framework
Summary
Adds a lightweight framework for testing documentation quality by having AI coding agents attempt real-world tasks using only the docs. Tests reveal documentation gaps by simulating what happens when a user asks "how do I deploy a vLLM endpoint?" without any prior context.
Philosophy
Tests are intentionally hard to pass. Each test is a single sentence—no hints, no steps, no doc references. If the docs are good, an agent can figure it out. If not, the test reveals exactly where users get stuck.
How it works
tests/TESTS.md)Run the vllm-deploy testTwo doc source modes
Run the vllm-deploy testRun the vllm-deploy test using local docsLocal mode reads
.mdxfiles directly from the repo, letting you test doc changes on a branch before merging.Test coverage
~85 tests across 13 product areas:
Test format
Tests are minimal by design:
That's it. The agent must figure out everything else from the docs.
Report output
After each test, reports are saved to
tests/reports/(gitignored):Files changed
Requirements
Safety
doc_test_prefix