Skip to content

feat(deployment): Primus on local/K8s/SLURM#99

Open
coketaste wants to merge 3 commits intomainfrom
coketaste/v2-primus
Open

feat(deployment): Primus on local/K8s/SLURM#99
coketaste wants to merge 3 commits intomainfrom
coketaste/v2-primus

Conversation

@coketaste
Copy link
Copy Markdown
Collaborator

Add first-class Primus launcher support for Kubernetes and SLURM: infer backend and examples overlay from model/config, sanitize Job/Service/ ConfigMap/container names (k8s_names), and extend templates and mixin command generation.

Docker: resolve repo-root build context for Primus Dockerfiles; align container run log filenames with build logs when the image is a full registry/repo:tag reference (container_runner_helpers).

Docs: launchers and k8s-config examples README.

Tests: integration (Docker context), unit coverage for execution helpers, K8s secrets/PVC/names, and consolidated Primus tests (test_primus).

Add first-class Primus launcher support for Kubernetes and SLURM: infer
backend and examples overlay from model/config, sanitize Job/Service/
ConfigMap/container names (k8s_names), and extend templates and mixin
command generation.

Docker: resolve repo-root build context for Primus Dockerfiles; align
container run log filenames with build logs when the image is a full
registry/repo:tag reference (container_runner_helpers).

Docs: launchers and k8s-config examples README.

Tests: integration (Docker context), unit coverage for execution helpers,
K8s secrets/PVC/names, and consolidated Primus tests (test_primus).
@coketaste coketaste self-assigned this Apr 10, 2026
@coketaste coketaste changed the title feat(deployment): Primus on K8s/SLURM, K8s naming, and run log parity feat(deployment): Primus on local/K8s/SLURM Apr 10, 2026
Copilot AI review requested due to automatic review settings April 14, 2026 14:57
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class Primus launcher support across local, Kubernetes, and SLURM deployments, plus supporting infra for K8s-safe naming and improved Docker context/log handling so Primus images/configs run consistently across environments.

Changes:

  • Introduce Primus backend/config inference helpers and integrate Primus launcher command/env generation for Kubernetes and SLURM.
  • Add Kubernetes name/label/container sanitization utilities and update K8s templates to use sanitized labels/container names.
  • Improve Docker build/run ergonomics for Primus (repo-root build context; normalize run log naming for full image refs) and add corresponding tests/docs.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/unit/test_primus.py New unit coverage for Primus backend/overlay inference and K8s Primus command generation
tests/unit/test_k8s.py Adds unit tests for K8s name/label/container sanitization utilities
tests/unit/test_execution.py Extends log naming tests to cover full registry refs mapping to CI tag
tests/integration/test_docker_integration.py Integration test ensuring Primus Docker builds use repo-root context
src/madengine/orchestration/run_orchestrator.py Restores docker_env_vars from manifest context during local run
src/madengine/execution/docker_builder.py Chooses repo-root Docker build context for Primus dockerfiles by hint
src/madengine/execution/container_runner_helpers.py Normalizes docker image refs for run log filename consistency
src/madengine/execution/container_runner.py Adds Primus env passthrough; resolves missing Primus images via fallback; preserves resolved image in results
src/madengine/deployment/templates/slurm/job.sh.j2 Avoids printing MAD_MULTI_NODE_RUNNER for launchers that don’t use it
src/madengine/deployment/templates/kubernetes/service.yaml.j2 Switches model label to sanitized model_label
src/madengine/deployment/templates/kubernetes/job.yaml.j2 Switches model label to model_label; uses main_container_name; includes primus/torchtitan in distributed env block
src/madengine/deployment/templates/kubernetes/configmap.yaml.j2 Switches model label to sanitized model_label
src/madengine/deployment/slurm.py Adds Primus launcher env generation (PRIMUS_CONFIG_PATH/CLI_EXTRA/BACKEND inference)
src/madengine/deployment/primus_backend.py New helper module: merges Primus config; infers BACKEND and examples overlay subdirs
src/madengine/deployment/kubernetes_launcher_mixin.py Adds K8s Primus launcher env/command generation
src/madengine/deployment/kubernetes.py Adds K8s name sanitization; bundles Primus examples overlay into ConfigMap; wires Primus launcher
src/madengine/deployment/k8s_names.py New K8s naming utilities for object names, label values, and container names
src/madengine/deployment/common.py Adds “primus” to valid launcher list
examples/k8s-configs/README.md Documents Primus-on-K8s behavior, caveats, and example configs
docs/launchers.md Adds Primus launcher documentation and updates launcher lists/tables
Comments suppressed due to low confidence (1)

src/madengine/deployment/templates/kubernetes/service.yaml.j2:12

  • service.yaml.j2 uses service_name as metadata.name and job.yaml.j2 uses the same value via pod.spec.subdomain/launcher DNS. Kubernetes Service names must be DNS labels (no dots, <=63). If service_name is derived from job_name that may contain dots (e.g. '1.7b') or exceed 63, the Service will be rejected and multi-node DNS (pod-0....) will fail. Ensure service_name is sanitized as a DNS label and keep subdomain/launcher DNS consistent with that sanitized value.
  name: {{ service_name }}
  namespace: {{ namespace }}
  labels:
    app: madengine
    model: {{ model_label }}
spec:
  clusterIP: None  # Headless service for torchrun coordination
  selector:
    job-name: {{ job_name }}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 22 to 26
labels:
app: madengine
job-name: {{ job_name }}
model: {{ model_name }}
model: {{ model_label }}
spec:
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pod template sets label job-name: {{ job_name }} and the Service selector matches on this label. Label values are limited to 63 chars and cannot contain arbitrary characters; job_name can exceed this after sanitization. Introduce a dedicated short/sanitized label value (e.g. job_label = sanitize_k8s_label_value(job_name)) and use it consistently for pod labels and the Service selector.

Copilot uses AI. Check for mistakes.
Comment on lines +934 to +946
master_dns = (
f"{self.job_name}-0.{self.job_name}.{self.namespace}.svc.cluster.local"
)
lines.extend(
[
"# Multi-node: Indexed Job + headless Service (pod-0 DNS as master)",
f'export MASTER_ADDR="{master_dns}"',
f"export MASTER_PORT={master_port}",
f"export NNODES={nnodes}",
"export NODE_RANK=${JOB_COMPLETION_INDEX}",
f"export GPUS_PER_NODE={nproc_per_node}",
f"export MAD_RUNTIME_NGPUS={nproc_per_node}",
]
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

master_dns is constructed from self.job_name for both the pod hostname prefix and the Service/subdomain portion. If job_name contains dots or is >63 chars, the Service name portion will be invalid (DNS label) and the rendezvous address will not resolve. Use the same sanitized DNS-label value that is used for the headless Service name/subdomain (e.g. a dedicated self.service_name) when building master_dns.

Copilot uses AI. Check for mistakes.
Comment on lines +183 to +210
Reduce a Docker image reference to the same string ``DockerBuilder`` uses for tags.

Build logs use ``{model}_{dockerfile}.build.live.log`` where the image tag is
``ci-{model_lower}_{dockerfile}``. Run may receive a full ref (e.g.
``registry/namespace/name:ci-...``). Using the whole ref breaks pairing with
``*.build.live.log`` and embeds ``/`` and ``:`` in filenames.

Rules:

- Strip digest (``@sha256:...``).
- If there is a ``/``, take the part after the last ``/`` (image name or
``name:tag``).
- If that tail contains ``:``, return the tag (part after the first ``:`` in
the tail), which matches ``ci-...`` for normal ``repo/name:tag`` refs.
- Otherwise return the tail, or the whole string if there is no ``/``.

This matches short names like ``ci-model_ubuntu`` (no registry) unchanged.
"""
if not docker_image:
return docker_image
s = docker_image.strip()
if "@" in s:
s = s.split("@", 1)[0]
last_slash = s.rfind("/")
tail = s[last_slash + 1 :] if last_slash >= 0 else s
if ":" in tail:
return tail.split(":", 1)[1]
return tail
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_docker_image_ref_for_log_naming() currently returns only the tag whenever the last path segment contains a ':'. This works for refs like repo/name:ci-..., but it breaks log naming for common refs like 'ubuntu:22.04' or 'myimage:latest' (it would reduce to just '22.04'/'latest', causing collisions and making dockerfile_part derivation incorrect). Consider only stripping to the tag when the tag itself matches the expected CI tag pattern (e.g. starts with 'ci-'), otherwise keep the name+tag (and just replace ':', '/', '@' with safe characters) so log filenames remain stable and unique.

Suggested change
Reduce a Docker image reference to the same string ``DockerBuilder`` uses for tags.
Build logs use ``{model}_{dockerfile}.build.live.log`` where the image tag is
``ci-{model_lower}_{dockerfile}``. Run may receive a full ref (e.g.
``registry/namespace/name:ci-...``). Using the whole ref breaks pairing with
``*.build.live.log`` and embeds ``/`` and ``:`` in filenames.
Rules:
- Strip digest (``@sha256:...``).
- If there is a ``/``, take the part after the last ``/`` (image name or
``name:tag``).
- If that tail contains ``:``, return the tag (part after the first ``:`` in
the tail), which matches ``ci-...`` for normal ``repo/name:tag`` refs.
- Otherwise return the tail, or the whole string if there is no ``/``.
This matches short names like ``ci-model_ubuntu`` (no registry) unchanged.
"""
if not docker_image:
return docker_image
s = docker_image.strip()
if "@" in s:
s = s.split("@", 1)[0]
last_slash = s.rfind("/")
tail = s[last_slash + 1 :] if last_slash >= 0 else s
if ":" in tail:
return tail.split(":", 1)[1]
return tail
Reduce a Docker image reference to a stable filename-safe log naming component.
Build logs use ``{model}_{dockerfile}.build.live.log`` where the image tag is
``ci-{model_lower}_{dockerfile}``. Run may receive a full ref (e.g.
``registry/namespace/name:ci-...``). For those CI-style refs, returning the
tag preserves pairing with ``*.build.live.log``. For normal tagged refs such
as ``ubuntu:22.04`` or ``myimage:latest``, collapsing to only the tag would
cause collisions, so keep the name+tag and sanitize it for filenames.
Rules:
- Strip digest (``@sha256:...``) for naming purposes.
- If the last path segment is ``name:tag`` and ``tag`` starts with ``ci-``,
return only ``tag``.
- Otherwise return the digest-free reference with ``:``, ``/``, and ``@``
replaced by ``_``.
This preserves existing CI log pairing while keeping ordinary image
references stable and unique in log filenames.
"""
if not docker_image:
return docker_image
s = docker_image.strip()
ref_without_digest = s.split("@", 1)[0]
last_slash = ref_without_digest.rfind("/")
tail = (
ref_without_digest[last_slash + 1 :]
if last_slash >= 0
else ref_without_digest
)
if ":" in tail:
_, tag = tail.split(":", 1)
if tag.startswith("ci-"):
return tag
return (
ref_without_digest.replace("/", "_")
.replace(":", "_")
.replace("@", "_")
)

Copilot uses AI. Check for mistakes.
Comment on lines +717 to +734
try:
self.console.sh(f"docker image inspect {docker_image} >/dev/null 2>&1")
return docker_image
except (subprocess.CalledProcessError, RuntimeError, Exception):
pass
if model_name.startswith("primus_pretrain/"):
fallback = "ci-primus_pretrain_primus.ubuntu.amd"
try:
self.console.sh(f"docker image inspect {fallback} >/dev/null 2>&1")
print(
f"ℹ️ Using shared Primus image (one build for all primus_pretrain configs): {fallback}"
)
return fallback
except (subprocess.CalledProcessError, RuntimeError, Exception):
raise RuntimeError(
f"Docker image '{docker_image}' not found and fallback '{fallback}' not found. "
"Build the Primus image first: madengine build --tags primus_pretrain --additional-context-file <config>.json"
) from None
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_resolve_docker_image() interpolates docker_image directly into Console.sh() commands (shell=True). If docker_image contains spaces or shell metacharacters, this becomes a command-injection vector and can also fail on otherwise valid image refs. Use proper shell escaping (e.g. shlex.quote) or switch Console.sh invocations here to a subprocess call that passes argv as a list (shell=False).

Suggested change
try:
self.console.sh(f"docker image inspect {docker_image} >/dev/null 2>&1")
return docker_image
except (subprocess.CalledProcessError, RuntimeError, Exception):
pass
if model_name.startswith("primus_pretrain/"):
fallback = "ci-primus_pretrain_primus.ubuntu.amd"
try:
self.console.sh(f"docker image inspect {fallback} >/dev/null 2>&1")
print(
f"ℹ️ Using shared Primus image (one build for all primus_pretrain configs): {fallback}"
)
return fallback
except (subprocess.CalledProcessError, RuntimeError, Exception):
raise RuntimeError(
f"Docker image '{docker_image}' not found and fallback '{fallback}' not found. "
"Build the Primus image first: madengine build --tags primus_pretrain --additional-context-file <config>.json"
) from None
def _docker_image_exists(image: str) -> bool:
try:
subprocess.run(
["docker", "image", "inspect", image],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
check=True,
)
return True
except (subprocess.CalledProcessError, OSError):
return False
if _docker_image_exists(docker_image):
return docker_image
if model_name.startswith("primus_pretrain/"):
fallback = "ci-primus_pretrain_primus.ubuntu.amd"
if _docker_image_exists(fallback):
print(
f"ℹ️ Using shared Primus image (one build for all primus_pretrain configs): {fallback}"
)
return fallback
raise RuntimeError(
f"Docker image '{docker_image}' not found and fallback '{fallback}' not found. "
"Build the Primus image first: madengine build --tags primus_pretrain --additional-context-file <config>.json"
) from None

Copilot uses AI. Check for mistakes.
Comment on lines +1189 to +1194
multiple_results_file = (model_info.get("multiple_results") or "").strip()
if multiple_results_file:
try:
model_docker.sh(
f"cp {model_dir}/{multiple_results_file} . 2>/dev/null || true"
)
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cp command uses multiple_results_file directly inside a docker exec bash -c string. If the path contains whitespace/quotes or shell metacharacters, it can break the copy or be abused for command injection. Quote/escape the path (and ideally use 'cp --' plus a safely-joined path) before passing it into the shell command.

Copilot uses AI. Check for mistakes.
echo " NODE_RANK: ${NODE_RANK}"
echo " NNODES: ${NNODES}"
echo " NPROC_PER_NODE: ${GPUS_PER_NODE}"
{% if launcher_type in ['torchrun', 'deepspeed', 'megatron'] %}
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition hides MAD_MULTI_NODE_RUNNER for 'torchtitan', but SlurmDeployment._generate_torchtitan_command() does export MAD_MULTI_NODE_RUNNER. Include 'torchtitan' in this list (or key off whether MAD_MULTI_NODE_RUNNER is set) so debug output stays accurate.

Suggested change
{% if launcher_type in ['torchrun', 'deepspeed', 'megatron'] %}
{% if launcher_type in ['torchrun', 'deepspeed', 'megatron', 'torchtitan'] %}

Copilot uses AI. Check for mistakes.
Comment on lines +286 to +290
# Generate resource names (K8s compatible: no '/', spaces, etc.)
raw_model_name = model_info["name"]
self.job_name = sanitize_k8s_object_name("madengine", raw_model_name)
# Pod container names must be DNS labels (no dots); Job/PVC names may include dots.
self.main_container_name = sanitize_k8s_container_name(self.job_name)
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

job_name is now produced by sanitize_k8s_object_name(), which can retain dots and can be much longer than 63 chars. However, the headless Service name (service_name currently equals job_name), the pod subdomain, and the 'job-name' label value all have DNS-label/label-value constraints (no dots, <=63). As-is, model names like '...1.7b...' will generate invalid Service/labels and break multi-node rendezvous DNS. Consider generating separate values: (1) job_name for objects that allow subdomains, (2) service_name/subdomain as a DNS label (use sanitize_k8s_container_name or a dedicated sanitizer), and (3) a short job_name_label for selectors/labels (sanitize_k8s_label_value). Also update launcher DNS generation to use the sanitized service/subdomain value.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants