Conversation
Add first-class Primus launcher support for Kubernetes and SLURM: infer backend and examples overlay from model/config, sanitize Job/Service/ ConfigMap/container names (k8s_names), and extend templates and mixin command generation. Docker: resolve repo-root build context for Primus Dockerfiles; align container run log filenames with build logs when the image is a full registry/repo:tag reference (container_runner_helpers). Docs: launchers and k8s-config examples README. Tests: integration (Docker context), unit coverage for execution helpers, K8s secrets/PVC/names, and consolidated Primus tests (test_primus).
There was a problem hiding this comment.
Pull request overview
Adds first-class Primus launcher support across local, Kubernetes, and SLURM deployments, plus supporting infra for K8s-safe naming and improved Docker context/log handling so Primus images/configs run consistently across environments.
Changes:
- Introduce Primus backend/config inference helpers and integrate Primus launcher command/env generation for Kubernetes and SLURM.
- Add Kubernetes name/label/container sanitization utilities and update K8s templates to use sanitized labels/container names.
- Improve Docker build/run ergonomics for Primus (repo-root build context; normalize run log naming for full image refs) and add corresponding tests/docs.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_primus.py | New unit coverage for Primus backend/overlay inference and K8s Primus command generation |
| tests/unit/test_k8s.py | Adds unit tests for K8s name/label/container sanitization utilities |
| tests/unit/test_execution.py | Extends log naming tests to cover full registry refs mapping to CI tag |
| tests/integration/test_docker_integration.py | Integration test ensuring Primus Docker builds use repo-root context |
| src/madengine/orchestration/run_orchestrator.py | Restores docker_env_vars from manifest context during local run |
| src/madengine/execution/docker_builder.py | Chooses repo-root Docker build context for Primus dockerfiles by hint |
| src/madengine/execution/container_runner_helpers.py | Normalizes docker image refs for run log filename consistency |
| src/madengine/execution/container_runner.py | Adds Primus env passthrough; resolves missing Primus images via fallback; preserves resolved image in results |
| src/madengine/deployment/templates/slurm/job.sh.j2 | Avoids printing MAD_MULTI_NODE_RUNNER for launchers that don’t use it |
| src/madengine/deployment/templates/kubernetes/service.yaml.j2 | Switches model label to sanitized model_label |
| src/madengine/deployment/templates/kubernetes/job.yaml.j2 | Switches model label to model_label; uses main_container_name; includes primus/torchtitan in distributed env block |
| src/madengine/deployment/templates/kubernetes/configmap.yaml.j2 | Switches model label to sanitized model_label |
| src/madengine/deployment/slurm.py | Adds Primus launcher env generation (PRIMUS_CONFIG_PATH/CLI_EXTRA/BACKEND inference) |
| src/madengine/deployment/primus_backend.py | New helper module: merges Primus config; infers BACKEND and examples overlay subdirs |
| src/madengine/deployment/kubernetes_launcher_mixin.py | Adds K8s Primus launcher env/command generation |
| src/madengine/deployment/kubernetes.py | Adds K8s name sanitization; bundles Primus examples overlay into ConfigMap; wires Primus launcher |
| src/madengine/deployment/k8s_names.py | New K8s naming utilities for object names, label values, and container names |
| src/madengine/deployment/common.py | Adds “primus” to valid launcher list |
| examples/k8s-configs/README.md | Documents Primus-on-K8s behavior, caveats, and example configs |
| docs/launchers.md | Adds Primus launcher documentation and updates launcher lists/tables |
Comments suppressed due to low confidence (1)
src/madengine/deployment/templates/kubernetes/service.yaml.j2:12
- service.yaml.j2 uses service_name as metadata.name and job.yaml.j2 uses the same value via pod.spec.subdomain/launcher DNS. Kubernetes Service names must be DNS labels (no dots, <=63). If service_name is derived from job_name that may contain dots (e.g. '1.7b') or exceed 63, the Service will be rejected and multi-node DNS (pod-0....) will fail. Ensure service_name is sanitized as a DNS label and keep subdomain/launcher DNS consistent with that sanitized value.
name: {{ service_name }}
namespace: {{ namespace }}
labels:
app: madengine
model: {{ model_label }}
spec:
clusterIP: None # Headless service for torchrun coordination
selector:
job-name: {{ job_name }}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| labels: | ||
| app: madengine | ||
| job-name: {{ job_name }} | ||
| model: {{ model_name }} | ||
| model: {{ model_label }} | ||
| spec: |
There was a problem hiding this comment.
The pod template sets label job-name: {{ job_name }} and the Service selector matches on this label. Label values are limited to 63 chars and cannot contain arbitrary characters; job_name can exceed this after sanitization. Introduce a dedicated short/sanitized label value (e.g. job_label = sanitize_k8s_label_value(job_name)) and use it consistently for pod labels and the Service selector.
| master_dns = ( | ||
| f"{self.job_name}-0.{self.job_name}.{self.namespace}.svc.cluster.local" | ||
| ) | ||
| lines.extend( | ||
| [ | ||
| "# Multi-node: Indexed Job + headless Service (pod-0 DNS as master)", | ||
| f'export MASTER_ADDR="{master_dns}"', | ||
| f"export MASTER_PORT={master_port}", | ||
| f"export NNODES={nnodes}", | ||
| "export NODE_RANK=${JOB_COMPLETION_INDEX}", | ||
| f"export GPUS_PER_NODE={nproc_per_node}", | ||
| f"export MAD_RUNTIME_NGPUS={nproc_per_node}", | ||
| ] |
There was a problem hiding this comment.
master_dns is constructed from self.job_name for both the pod hostname prefix and the Service/subdomain portion. If job_name contains dots or is >63 chars, the Service name portion will be invalid (DNS label) and the rendezvous address will not resolve. Use the same sanitized DNS-label value that is used for the headless Service name/subdomain (e.g. a dedicated self.service_name) when building master_dns.
| Reduce a Docker image reference to the same string ``DockerBuilder`` uses for tags. | ||
|
|
||
| Build logs use ``{model}_{dockerfile}.build.live.log`` where the image tag is | ||
| ``ci-{model_lower}_{dockerfile}``. Run may receive a full ref (e.g. | ||
| ``registry/namespace/name:ci-...``). Using the whole ref breaks pairing with | ||
| ``*.build.live.log`` and embeds ``/`` and ``:`` in filenames. | ||
|
|
||
| Rules: | ||
|
|
||
| - Strip digest (``@sha256:...``). | ||
| - If there is a ``/``, take the part after the last ``/`` (image name or | ||
| ``name:tag``). | ||
| - If that tail contains ``:``, return the tag (part after the first ``:`` in | ||
| the tail), which matches ``ci-...`` for normal ``repo/name:tag`` refs. | ||
| - Otherwise return the tail, or the whole string if there is no ``/``. | ||
|
|
||
| This matches short names like ``ci-model_ubuntu`` (no registry) unchanged. | ||
| """ | ||
| if not docker_image: | ||
| return docker_image | ||
| s = docker_image.strip() | ||
| if "@" in s: | ||
| s = s.split("@", 1)[0] | ||
| last_slash = s.rfind("/") | ||
| tail = s[last_slash + 1 :] if last_slash >= 0 else s | ||
| if ":" in tail: | ||
| return tail.split(":", 1)[1] | ||
| return tail |
There was a problem hiding this comment.
_docker_image_ref_for_log_naming() currently returns only the tag whenever the last path segment contains a ':'. This works for refs like repo/name:ci-..., but it breaks log naming for common refs like 'ubuntu:22.04' or 'myimage:latest' (it would reduce to just '22.04'/'latest', causing collisions and making dockerfile_part derivation incorrect). Consider only stripping to the tag when the tag itself matches the expected CI tag pattern (e.g. starts with 'ci-'), otherwise keep the name+tag (and just replace ':', '/', '@' with safe characters) so log filenames remain stable and unique.
| Reduce a Docker image reference to the same string ``DockerBuilder`` uses for tags. | |
| Build logs use ``{model}_{dockerfile}.build.live.log`` where the image tag is | |
| ``ci-{model_lower}_{dockerfile}``. Run may receive a full ref (e.g. | |
| ``registry/namespace/name:ci-...``). Using the whole ref breaks pairing with | |
| ``*.build.live.log`` and embeds ``/`` and ``:`` in filenames. | |
| Rules: | |
| - Strip digest (``@sha256:...``). | |
| - If there is a ``/``, take the part after the last ``/`` (image name or | |
| ``name:tag``). | |
| - If that tail contains ``:``, return the tag (part after the first ``:`` in | |
| the tail), which matches ``ci-...`` for normal ``repo/name:tag`` refs. | |
| - Otherwise return the tail, or the whole string if there is no ``/``. | |
| This matches short names like ``ci-model_ubuntu`` (no registry) unchanged. | |
| """ | |
| if not docker_image: | |
| return docker_image | |
| s = docker_image.strip() | |
| if "@" in s: | |
| s = s.split("@", 1)[0] | |
| last_slash = s.rfind("/") | |
| tail = s[last_slash + 1 :] if last_slash >= 0 else s | |
| if ":" in tail: | |
| return tail.split(":", 1)[1] | |
| return tail | |
| Reduce a Docker image reference to a stable filename-safe log naming component. | |
| Build logs use ``{model}_{dockerfile}.build.live.log`` where the image tag is | |
| ``ci-{model_lower}_{dockerfile}``. Run may receive a full ref (e.g. | |
| ``registry/namespace/name:ci-...``). For those CI-style refs, returning the | |
| tag preserves pairing with ``*.build.live.log``. For normal tagged refs such | |
| as ``ubuntu:22.04`` or ``myimage:latest``, collapsing to only the tag would | |
| cause collisions, so keep the name+tag and sanitize it for filenames. | |
| Rules: | |
| - Strip digest (``@sha256:...``) for naming purposes. | |
| - If the last path segment is ``name:tag`` and ``tag`` starts with ``ci-``, | |
| return only ``tag``. | |
| - Otherwise return the digest-free reference with ``:``, ``/``, and ``@`` | |
| replaced by ``_``. | |
| This preserves existing CI log pairing while keeping ordinary image | |
| references stable and unique in log filenames. | |
| """ | |
| if not docker_image: | |
| return docker_image | |
| s = docker_image.strip() | |
| ref_without_digest = s.split("@", 1)[0] | |
| last_slash = ref_without_digest.rfind("/") | |
| tail = ( | |
| ref_without_digest[last_slash + 1 :] | |
| if last_slash >= 0 | |
| else ref_without_digest | |
| ) | |
| if ":" in tail: | |
| _, tag = tail.split(":", 1) | |
| if tag.startswith("ci-"): | |
| return tag | |
| return ( | |
| ref_without_digest.replace("/", "_") | |
| .replace(":", "_") | |
| .replace("@", "_") | |
| ) |
| try: | ||
| self.console.sh(f"docker image inspect {docker_image} >/dev/null 2>&1") | ||
| return docker_image | ||
| except (subprocess.CalledProcessError, RuntimeError, Exception): | ||
| pass | ||
| if model_name.startswith("primus_pretrain/"): | ||
| fallback = "ci-primus_pretrain_primus.ubuntu.amd" | ||
| try: | ||
| self.console.sh(f"docker image inspect {fallback} >/dev/null 2>&1") | ||
| print( | ||
| f"ℹ️ Using shared Primus image (one build for all primus_pretrain configs): {fallback}" | ||
| ) | ||
| return fallback | ||
| except (subprocess.CalledProcessError, RuntimeError, Exception): | ||
| raise RuntimeError( | ||
| f"Docker image '{docker_image}' not found and fallback '{fallback}' not found. " | ||
| "Build the Primus image first: madengine build --tags primus_pretrain --additional-context-file <config>.json" | ||
| ) from None |
There was a problem hiding this comment.
_resolve_docker_image() interpolates docker_image directly into Console.sh() commands (shell=True). If docker_image contains spaces or shell metacharacters, this becomes a command-injection vector and can also fail on otherwise valid image refs. Use proper shell escaping (e.g. shlex.quote) or switch Console.sh invocations here to a subprocess call that passes argv as a list (shell=False).
| try: | |
| self.console.sh(f"docker image inspect {docker_image} >/dev/null 2>&1") | |
| return docker_image | |
| except (subprocess.CalledProcessError, RuntimeError, Exception): | |
| pass | |
| if model_name.startswith("primus_pretrain/"): | |
| fallback = "ci-primus_pretrain_primus.ubuntu.amd" | |
| try: | |
| self.console.sh(f"docker image inspect {fallback} >/dev/null 2>&1") | |
| print( | |
| f"ℹ️ Using shared Primus image (one build for all primus_pretrain configs): {fallback}" | |
| ) | |
| return fallback | |
| except (subprocess.CalledProcessError, RuntimeError, Exception): | |
| raise RuntimeError( | |
| f"Docker image '{docker_image}' not found and fallback '{fallback}' not found. " | |
| "Build the Primus image first: madengine build --tags primus_pretrain --additional-context-file <config>.json" | |
| ) from None | |
| def _docker_image_exists(image: str) -> bool: | |
| try: | |
| subprocess.run( | |
| ["docker", "image", "inspect", image], | |
| stdout=subprocess.DEVNULL, | |
| stderr=subprocess.DEVNULL, | |
| check=True, | |
| ) | |
| return True | |
| except (subprocess.CalledProcessError, OSError): | |
| return False | |
| if _docker_image_exists(docker_image): | |
| return docker_image | |
| if model_name.startswith("primus_pretrain/"): | |
| fallback = "ci-primus_pretrain_primus.ubuntu.amd" | |
| if _docker_image_exists(fallback): | |
| print( | |
| f"ℹ️ Using shared Primus image (one build for all primus_pretrain configs): {fallback}" | |
| ) | |
| return fallback | |
| raise RuntimeError( | |
| f"Docker image '{docker_image}' not found and fallback '{fallback}' not found. " | |
| "Build the Primus image first: madengine build --tags primus_pretrain --additional-context-file <config>.json" | |
| ) from None |
| multiple_results_file = (model_info.get("multiple_results") or "").strip() | ||
| if multiple_results_file: | ||
| try: | ||
| model_docker.sh( | ||
| f"cp {model_dir}/{multiple_results_file} . 2>/dev/null || true" | ||
| ) |
There was a problem hiding this comment.
The cp command uses multiple_results_file directly inside a docker exec bash -c string. If the path contains whitespace/quotes or shell metacharacters, it can break the copy or be abused for command injection. Quote/escape the path (and ideally use 'cp --' plus a safely-joined path) before passing it into the shell command.
| echo " NODE_RANK: ${NODE_RANK}" | ||
| echo " NNODES: ${NNODES}" | ||
| echo " NPROC_PER_NODE: ${GPUS_PER_NODE}" | ||
| {% if launcher_type in ['torchrun', 'deepspeed', 'megatron'] %} |
There was a problem hiding this comment.
This condition hides MAD_MULTI_NODE_RUNNER for 'torchtitan', but SlurmDeployment._generate_torchtitan_command() does export MAD_MULTI_NODE_RUNNER. Include 'torchtitan' in this list (or key off whether MAD_MULTI_NODE_RUNNER is set) so debug output stays accurate.
| {% if launcher_type in ['torchrun', 'deepspeed', 'megatron'] %} | |
| {% if launcher_type in ['torchrun', 'deepspeed', 'megatron', 'torchtitan'] %} |
| # Generate resource names (K8s compatible: no '/', spaces, etc.) | ||
| raw_model_name = model_info["name"] | ||
| self.job_name = sanitize_k8s_object_name("madengine", raw_model_name) | ||
| # Pod container names must be DNS labels (no dots); Job/PVC names may include dots. | ||
| self.main_container_name = sanitize_k8s_container_name(self.job_name) |
There was a problem hiding this comment.
job_name is now produced by sanitize_k8s_object_name(), which can retain dots and can be much longer than 63 chars. However, the headless Service name (service_name currently equals job_name), the pod subdomain, and the 'job-name' label value all have DNS-label/label-value constraints (no dots, <=63). As-is, model names like '...1.7b...' will generate invalid Service/labels and break multi-node rendezvous DNS. Consider generating separate values: (1) job_name for objects that allow subdomains, (2) service_name/subdomain as a DNS label (use sanitize_k8s_container_name or a dedicated sanitizer), and (3) a short job_name_label for selectors/labels (sanitize_k8s_label_value). Also update launcher DNS generation to use the sanitized service/subdomain value.
Add first-class Primus launcher support for Kubernetes and SLURM: infer backend and examples overlay from model/config, sanitize Job/Service/ ConfigMap/container names (k8s_names), and extend templates and mixin command generation.
Docker: resolve repo-root build context for Primus Dockerfiles; align container run log filenames with build logs when the image is a full registry/repo:tag reference (container_runner_helpers).
Docs: launchers and k8s-config examples README.
Tests: integration (Docker context), unit coverage for execution helpers, K8s secrets/PVC/names, and consolidated Primus tests (test_primus).