Skip to content

Add trust_remote_code cli option for mbridge distillation#934

Open
kevalmorabia97 wants to merge 1 commit intomainfrom
kmorabia/trust-remote-code-distill
Open

Add trust_remote_code cli option for mbridge distillation#934
kevalmorabia97 wants to merge 1 commit intomainfrom
kmorabia/trust-remote-code-distill

Conversation

@kevalmorabia97
Copy link
Collaborator

@kevalmorabia97 kevalmorabia97 commented Feb 25, 2026

What does this PR do?

Nemotron2/3 need AutoBridge(..., trust_remote_code=True) which was missing previously

Testing

Nemotron-nano-v2 can be distilled using tokenized Nemotron-Pretraining-SFT-v1 data

Summary by CodeRabbit

  • New Features

    • Added CLI flag to control remote code trust when loading models
  • Documentation

    • Updated Megatron Bridge examples with relative data paths and multi-file dataset configuration
    • Clarified tokenization guidance for JSONL input files

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 requested a review from a team as a code owner February 25, 2026 14:00
@kevalmorabia97 kevalmorabia97 requested review from AAnoosheh and ChenhanYu and removed request for ChenhanYu February 25, 2026 14:00
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 25, 2026

📝 Walkthrough

Walkthrough

Documentation updates to Megatron Bridge examples with path references and data configuration details. Added --trust_remote_code CLI flag to distill.py script, propagating it through AutoBridge model provider initialization.

Changes

Cohort / File(s) Summary
Documentation Updates
examples/megatron_bridge/README.md
Updated CLI examples with relative output paths, added explanatory blocks for tokenized data structure and data_paths configuration, clarified tokenization guidance, and expanded distillation command examples with updated data path references.
CLI Flag Addition
examples/megatron_bridge/distill.py
Added --trust_remote_code CLI argument and threaded it into AutoBridge.from_hf_pretrained() call, updating the method signature to accept the trust_remote_code parameter for remote code execution control.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly matches the main code change: adding a trust_remote_code CLI option for the megatron bridge distillation script (distill.py). This is the primary change described in the PR objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch kmorabia/trust-remote-code-distill

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

@kevalmorabia97 kevalmorabia97 enabled auto-merge (squash) February 25, 2026 14:00
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/megatron_bridge/README.md (1)

145-162: ⚠️ Potential issue | 🟡 Minor

Document when to pass --trust_remote_code in distillation commands.

Since the script now supports this flag for models that require custom HF code (per PR objective), add a short note in the command section so users know when it is required.

Suggested README addition
 torchrun --nnodes 1 --nproc_per_node 8 distill.py \
     --tp_size 8 \
     --teacher_hf_path Qwen/Qwen3-8B \
     --student_hf_path Qwen/Qwen3-4B \
     --data_paths 1.0 tokenized_qwen3/data1_text_document 1.0 tokenized_qwen3/data2_text_document \
@@
     --log_interval 10 \
     --output_dir /output/qwen3_8b_to_4b_distill
+
+# Add `--trust_remote_code` for models that require custom HF code loading (e.g., some Nemotron checkpoints).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/megatron_bridge/README.md` around lines 145 - 162, Add a short note
to the distillation command section explaining that the --trust_remote_code flag
must be passed to distill.py when either --teacher_hf_path or --student_hf_path
points to an HF model that uses custom/remote code (models that list
requires_remote_code or provide custom modeling files), and show the example
invocation by appending --trust_remote_code to the torchrun command; reference
the distill.py script and the --trust_remote_code flag so users know where and
when to include it.
🧹 Nitpick comments (1)
examples/megatron_bridge/distill.py (1)

69-69: Clarify the security implications in the CLI help text.

Line 69 adds a sensitive flag, but the current help text is too terse for a remote-code-execution toggle. Please make the help explicit that this should only be used for trusted model repos.

Suggested doc/help text tweak
-    parser.add_argument("--trust_remote_code", action="store_true", help="Trust remote code")
+    parser.add_argument(
+        "--trust_remote_code",
+        action="store_true",
+        help="Allow execution of custom code from Hugging Face model repos (use only with trusted sources).",
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/megatron_bridge/distill.py` at line 69, Update the CLI help for the
--trust_remote_code argparse flag to explicitly warn about remote code execution
risks: modify the parser.add_argument call that defines "--trust_remote_code"
(the parser variable and the flag name) so the help text clearly states this
enables executing model-provided code and should only be used with fully trusted
model repositories, and mention potential security consequences such as
executing arbitrary or untrusted code on the host.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/megatron_bridge/README.md`:
- Line 133: Replace the incorrect adverb "parallelly" in the README sentence
that describes tokenizing split .jsonl files using the --jsonl_paths argument
with the correct phrase "in parallel" (i.e., change "...tokenize them parallelly
via the `--jsonl_paths` argument." to "...tokenize them in parallel via the
`--jsonl_paths` argument.").

---

Outside diff comments:
In `@examples/megatron_bridge/README.md`:
- Around line 145-162: Add a short note to the distillation command section
explaining that the --trust_remote_code flag must be passed to distill.py when
either --teacher_hf_path or --student_hf_path points to an HF model that uses
custom/remote code (models that list requires_remote_code or provide custom
modeling files), and show the example invocation by appending
--trust_remote_code to the torchrun command; reference the distill.py script and
the --trust_remote_code flag so users know where and when to include it.

---

Nitpick comments:
In `@examples/megatron_bridge/distill.py`:
- Line 69: Update the CLI help for the --trust_remote_code argparse flag to
explicitly warn about remote code execution risks: modify the
parser.add_argument call that defines "--trust_remote_code" (the parser variable
and the flag name) so the help text clearly states this enables executing
model-provided code and should only be used with fully trusted model
repositories, and mention potential security consequences such as executing
arbitrary or untrusted code on the host.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e589ac8 and 8984ea0.

📒 Files selected for processing (2)
  • examples/megatron_bridge/README.md
  • examples/megatron_bridge/distill.py

```

The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a while to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them parallelly.
The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a few hours to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them parallelly via the `--jsonl_paths` argument.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix wording: use “in parallel” instead of “parallelly”.

Line 133 has a minor text-quality issue.

Suggested wording fix
-The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a few hours to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them parallelly via the `--jsonl_paths` argument.
+The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a few hours to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them in parallel via the `--jsonl_paths` argument.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a few hours to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them parallelly via the `--jsonl_paths` argument.
The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a few hours to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them in parallel via the `--jsonl_paths` argument.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/megatron_bridge/README.md` at line 133, Replace the incorrect adverb
"parallelly" in the README sentence that describes tokenizing split .jsonl files
using the --jsonl_paths argument with the correct phrase "in parallel" (i.e.,
change "...tokenize them parallelly via the `--jsonl_paths` argument." to
"...tokenize them in parallel via the `--jsonl_paths` argument.").

--teacher_hf_path Qwen/Qwen3-8B \
--student_hf_path Qwen/Qwen3-4B \
--data_paths 1.0 /path/to/tokenized/data/qwen3 \
--data_paths 1.0 tokenized_qwen3/data1_text_document 1.0 tokenized_qwen3/data2_text_document \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The weights don't need to sum to 1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants