Add trust_remote_code cli option for mbridge distillation by kevalmorabia97 · Pull Request #934 · NVIDIA/Model-Optimizer

kevalmorabia97 · 2026-02-25T14:00:20Z

What does this PR do?

Nemotron2/3 need AutoBridge(..., trust_remote_code=True) which was missing previously

Testing

Nemotron-nano-v2 can be distilled using tokenized Nemotron-Pretraining-SFT-v1 data

Summary by CodeRabbit

New Features
- Added CLI flag to control remote code trust when loading models
Documentation
- Updated Megatron Bridge examples with relative data paths and multi-file dataset configuration
- Clarified tokenization guidance for JSONL input files

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

coderabbitai · 2026-02-25T14:00:43Z

📝 Walkthrough

Walkthrough

Documentation updates to Megatron Bridge examples with path references and data configuration details. Added --trust_remote_code CLI flag to distill.py script, propagating it through AutoBridge model provider initialization.

Changes

Cohort / File(s)	Summary
Documentation Updates `examples/megatron_bridge/README.md`	Updated CLI examples with relative output paths, added explanatory blocks for tokenized data structure and data_paths configuration, clarified tokenization guidance, and expanded distillation command examples with updated data path references.
CLI Flag Addition `examples/megatron_bridge/distill.py`	Added `--trust_remote_code` CLI argument and threaded it into `AutoBridge.from_hf_pretrained()` call, updating the method signature to accept the trust_remote_code parameter for remote code execution control.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly matches the main code change: adding a trust_remote_code CLI option for the megatron bridge distillation script (distill.py). This is the primary change described in the PR objectives.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch kmorabia/trust-remote-code-distill

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/megatron_bridge/README.md (1)

145-162: ⚠️ Potential issue | 🟡 Minor

Document when to pass --trust_remote_code in distillation commands.

Since the script now supports this flag for models that require custom HF code (per PR objective), add a short note in the command section so users know when it is required.

Suggested README addition

 torchrun --nnodes 1 --nproc_per_node 8 distill.py \
     --tp_size 8 \
     --teacher_hf_path Qwen/Qwen3-8B \
     --student_hf_path Qwen/Qwen3-4B \
     --data_paths 1.0 tokenized_qwen3/data1_text_document 1.0 tokenized_qwen3/data2_text_document \
@@
     --log_interval 10 \
     --output_dir /output/qwen3_8b_to_4b_distill
+
+# Add `--trust_remote_code` for models that require custom HF code loading (e.g., some Nemotron checkpoints).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/megatron_bridge/README.md` around lines 145 - 162, Add a short note
to the distillation command section explaining that the --trust_remote_code flag
must be passed to distill.py when either --teacher_hf_path or --student_hf_path
points to an HF model that uses custom/remote code (models that list
requires_remote_code or provide custom modeling files), and show the example
invocation by appending --trust_remote_code to the torchrun command; reference
the distill.py script and the --trust_remote_code flag so users know where and
when to include it.

🧹 Nitpick comments (1)

examples/megatron_bridge/distill.py (1)

69-69: Clarify the security implications in the CLI help text.

Line 69 adds a sensitive flag, but the current help text is too terse for a remote-code-execution toggle. Please make the help explicit that this should only be used for trusted model repos.

Suggested doc/help text tweak

-    parser.add_argument("--trust_remote_code", action="store_true", help="Trust remote code")
+    parser.add_argument(
+        "--trust_remote_code",
+        action="store_true",
+        help="Allow execution of custom code from Hugging Face model repos (use only with trusted sources).",
+    )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/megatron_bridge/distill.py` at line 69, Update the CLI help for the
--trust_remote_code argparse flag to explicitly warn about remote code execution
risks: modify the parser.add_argument call that defines "--trust_remote_code"
(the parser variable and the flag name) so the help text clearly states this
enables executing model-provided code and should only be used with fully trusted
model repositories, and mention potential security consequences such as
executing arbitrary or untrusted code on the host.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/megatron_bridge/README.md`:
- Line 133: Replace the incorrect adverb "parallelly" in the README sentence
that describes tokenizing split .jsonl files using the --jsonl_paths argument
with the correct phrase "in parallel" (i.e., change "...tokenize them parallelly
via the `--jsonl_paths` argument." to "...tokenize them in parallel via the
`--jsonl_paths` argument.").

---

Outside diff comments:
In `@examples/megatron_bridge/README.md`:
- Around line 145-162: Add a short note to the distillation command section
explaining that the --trust_remote_code flag must be passed to distill.py when
either --teacher_hf_path or --student_hf_path points to an HF model that uses
custom/remote code (models that list requires_remote_code or provide custom
modeling files), and show the example invocation by appending
--trust_remote_code to the torchrun command; reference the distill.py script and
the --trust_remote_code flag so users know where and when to include it.

---

Nitpick comments:
In `@examples/megatron_bridge/distill.py`:
- Line 69: Update the CLI help for the --trust_remote_code argparse flag to
explicitly warn about remote code execution risks: modify the
parser.add_argument call that defines "--trust_remote_code" (the parser variable
and the flag name) so the help text clearly states this enables executing
model-provided code and should only be used with fully trusted model
repositories, and mention potential security consequences such as executing
arbitrary or untrusted code on the host.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e589ac8 and 8984ea0.

📒 Files selected for processing (2)

examples/megatron_bridge/README.md
examples/megatron_bridge/distill.py

coderabbitai · 2026-02-25T14:03:37Z

examples/megatron_bridge/README.md

 ```

-The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a while to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them parallelly.
+The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a few hours to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them parallelly via the `--jsonl_paths` argument.


⚠️ Potential issue | 🟡 Minor

Fix wording: use “in parallel” instead of “parallelly”.

Line 133 has a minor text-quality issue.

Suggested wording fix

-The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a few hours to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them parallelly via the `--jsonl_paths` argument. +The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a few hours to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them in parallel via the `--jsonl_paths` argument.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a few hours to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them parallelly via the `--jsonl_paths` argument.

The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a few hours to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them in parallel via the `--jsonl_paths` argument.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/megatron_bridge/README.md` at line 133, Replace the incorrect adverb "parallelly" in the README sentence that describes tokenizing split .jsonl files using the --jsonl_paths argument with the correct phrase "in parallel" (i.e., change "...tokenize them parallelly via the `--jsonl_paths` argument." to "...tokenize them in parallel via the `--jsonl_paths` argument.").

AAnoosheh · 2026-02-25T15:05:37Z

examples/megatron_bridge/README.md

    --teacher_hf_path Qwen/Qwen3-8B \
    --student_hf_path Qwen/Qwen3-4B \
-    --data_paths 1.0 /path/to/tokenized/data/qwen3 \
+    --data_paths 1.0 tokenized_qwen3/data1_text_document 1.0 tokenized_qwen3/data2_text_document \


The weights don't need to sum to 1?

Add trust_remote_code cli option for mbridge distillation

8984ea0

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

kevalmorabia97 requested a review from a team as a code owner February 25, 2026 14:00

kevalmorabia97 requested review from AAnoosheh and ChenhanYu and removed request for ChenhanYu February 25, 2026 14:00

kevalmorabia97 enabled auto-merge (squash) February 25, 2026 14:00

coderabbitai bot reviewed Feb 25, 2026

View reviewed changes

AAnoosheh reviewed Feb 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add trust_remote_code cli option for mbridge distillation#934

Add trust_remote_code cli option for mbridge distillation#934
kevalmorabia97 wants to merge 1 commit intomainfrom
kmorabia/trust-remote-code-distill

kevalmorabia97 commented Feb 25, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 25, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 25, 2026

Uh oh!

AAnoosheh Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kevalmorabia97 commented Feb 25, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

AAnoosheh Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kevalmorabia97 commented Feb 25, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 25, 2026 •

edited

Loading