Skip to content

fix(swtbench): align docker workspace image building with SWE-bench#437

Closed
simonrosenberg wants to merge 3 commits intomainfrom
openhands/fix-swtbench-docker-image-436
Closed

fix(swtbench): align docker workspace image building with SWE-bench#437
simonrosenberg wants to merge 3 commits intomainfrom
openhands/fix-swtbench-docker-image-436

Conversation

@simonrosenberg
Copy link
Collaborator

@simonrosenberg simonrosenberg commented Feb 23, 2026

Summary

Fixes #436

The SWT-bench docker workspace mode was failing because it attempted to start a container from an image tag that doesn't exist locally or in GHCR. The root cause: SKIP_BUILD defaulted to 1, so the image was never built, and no documentation guided users to build images first (unlike SWE-bench's README "Step 1").

What changed

Auto-detect missing images (benchmarks/swtbench/run_infer.py):

When SKIP_BUILD is not explicitly set, the runner now checks whether the agent-server image exists in the local Docker daemon via docker image inspect. If it's missing, it builds automatically. This gives a zero-config experience:

SKIP_BUILD value Behavior
1 / true / yes Always skip build (user has pre-built images)
0 / false / no Always build (force rebuild)
Not set (default) Auto-detect: build only if image is missing locally

Structural improvements (from the original PR, preserved):

  • Import get_official_docker_image and extract_custom_tag from swebench.build_images instead of duplicating
  • Use shared build_image() from utils.build_utils for consistent build behavior
  • Remove DockerDevWorkspace — always use DockerWorkspace with pre-built or just-built images

Testing

  • 11 tests in tests/test_swtbench_run_infer.py (4 new for auto-detect behavior)
  • test_auto_builds_when_skip_build_unset_and_image_missing — verifies auto-build triggers
  • test_skips_build_when_skip_build_unset_and_image_exists_locally — verifies no rebuild when image exists
  • test_returns_true/false_when_image_exists/missing — unit tests for _local_docker_image_exists
  • All 44 tests pass (2 pre-existing failures in unrelated test_instance_timeout.py)

Usage

# Just works — images are built automatically if missing
uv run swtbench-infer .llm_config/gpt.json --workspace docker --num-workers 1

# Or pre-build images in bulk first, then run with SKIP_BUILD=1
uv run python -m benchmarks.swtbench.build_images --dataset <dataset> --split <split>
SKIP_BUILD=1 uv run swtbench-infer .llm_config/gpt.json --workspace docker

The SWT-bench docker workspace mode was failing because it attempted to
start a container from a GHCR image tag that did not exist. The issue
was that SWT-bench's prepare_workspace method used DockerDevWorkspace
for image building (when SKIP_BUILD=0), while SWE-bench uses the
build_image() function from build_utils.

This commit aligns SWT-bench's prepare_workspace with SWE-bench's
implementation:

1. Import get_official_docker_image and extract_custom_tag from
   swebench.build_images instead of defining local versions
2. Import build_image from utils.build_utils for local image building
3. Remove DockerDevWorkspace import and usage
4. Use build_image() when SKIP_BUILD=0 (same as SWE-bench)
5. Always use DockerWorkspace (not DockerDevWorkspace)

The fix ensures that when running with --workspace docker and
SKIP_BUILD=0, the image is built locally using the standard SDK
build infrastructure, producing the correct image tag format.

Fixes #436

Co-authored-by: openhands <openhands@all-hands.dev>
When SKIP_BUILD is not explicitly set, detect whether the agent-server
image exists in the local Docker daemon. If it's missing, build it
automatically instead of failing with "image not found". This gives
users a zero-config experience with `--workspace docker`.

Behavior:
- SKIP_BUILD=1: always skip build (explicit opt-in, pre-built images)
- SKIP_BUILD=0: always build (explicit opt-in, force rebuild)
- SKIP_BUILD unset: auto-detect via `docker image inspect`

Fixes #436

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@simonrosenberg simonrosenberg marked this pull request as ready for review February 24, 2026 18:55
Resolve conflicts in benchmarks/swtbench/run_infer.py by taking main's
refactored version which already incorporates the PR's auto-build
feature via create_docker_workspace from image_utils. Update tests
to match the new API surface (local_image_exists in image_utils,
create_docker_workspace, IMAGE_TAG_PREFIX).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@simonrosenberg
Copy link
Collaborator Author

Closing because the fix was solved by #456

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SWT-bench docker workspace pulls non-existent GHCR image tag ghcr.io/openhands/eval-agent-server:<SDK_SHA>-sweb.eval...-source-minimal

2 participants