Add trust_remote_code cli option for mbridge distillation#934
Add trust_remote_code cli option for mbridge distillation#934kevalmorabia97 wants to merge 1 commit intomainfrom
Conversation
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
📝 WalkthroughWalkthroughDocumentation updates to Megatron Bridge examples with path references and data configuration details. Added Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/megatron_bridge/README.md (1)
145-162:⚠️ Potential issue | 🟡 MinorDocument when to pass
--trust_remote_codein distillation commands.Since the script now supports this flag for models that require custom HF code (per PR objective), add a short note in the command section so users know when it is required.
Suggested README addition
torchrun --nnodes 1 --nproc_per_node 8 distill.py \ --tp_size 8 \ --teacher_hf_path Qwen/Qwen3-8B \ --student_hf_path Qwen/Qwen3-4B \ --data_paths 1.0 tokenized_qwen3/data1_text_document 1.0 tokenized_qwen3/data2_text_document \ @@ --log_interval 10 \ --output_dir /output/qwen3_8b_to_4b_distill + +# Add `--trust_remote_code` for models that require custom HF code loading (e.g., some Nemotron checkpoints).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/megatron_bridge/README.md` around lines 145 - 162, Add a short note to the distillation command section explaining that the --trust_remote_code flag must be passed to distill.py when either --teacher_hf_path or --student_hf_path points to an HF model that uses custom/remote code (models that list requires_remote_code or provide custom modeling files), and show the example invocation by appending --trust_remote_code to the torchrun command; reference the distill.py script and the --trust_remote_code flag so users know where and when to include it.
🧹 Nitpick comments (1)
examples/megatron_bridge/distill.py (1)
69-69: Clarify the security implications in the CLI help text.Line 69 adds a sensitive flag, but the current help text is too terse for a remote-code-execution toggle. Please make the help explicit that this should only be used for trusted model repos.
Suggested doc/help text tweak
- parser.add_argument("--trust_remote_code", action="store_true", help="Trust remote code") + parser.add_argument( + "--trust_remote_code", + action="store_true", + help="Allow execution of custom code from Hugging Face model repos (use only with trusted sources).", + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/megatron_bridge/distill.py` at line 69, Update the CLI help for the --trust_remote_code argparse flag to explicitly warn about remote code execution risks: modify the parser.add_argument call that defines "--trust_remote_code" (the parser variable and the flag name) so the help text clearly states this enables executing model-provided code and should only be used with fully trusted model repositories, and mention potential security consequences such as executing arbitrary or untrusted code on the host.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/megatron_bridge/README.md`:
- Line 133: Replace the incorrect adverb "parallelly" in the README sentence
that describes tokenizing split .jsonl files using the --jsonl_paths argument
with the correct phrase "in parallel" (i.e., change "...tokenize them parallelly
via the `--jsonl_paths` argument." to "...tokenize them in parallel via the
`--jsonl_paths` argument.").
---
Outside diff comments:
In `@examples/megatron_bridge/README.md`:
- Around line 145-162: Add a short note to the distillation command section
explaining that the --trust_remote_code flag must be passed to distill.py when
either --teacher_hf_path or --student_hf_path points to an HF model that uses
custom/remote code (models that list requires_remote_code or provide custom
modeling files), and show the example invocation by appending
--trust_remote_code to the torchrun command; reference the distill.py script and
the --trust_remote_code flag so users know where and when to include it.
---
Nitpick comments:
In `@examples/megatron_bridge/distill.py`:
- Line 69: Update the CLI help for the --trust_remote_code argparse flag to
explicitly warn about remote code execution risks: modify the
parser.add_argument call that defines "--trust_remote_code" (the parser variable
and the flag name) so the help text clearly states this enables executing
model-provided code and should only be used with fully trusted model
repositories, and mention potential security consequences such as executing
arbitrary or untrusted code on the host.
| ``` | ||
|
|
||
| The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a while to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them parallelly. | ||
| The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a few hours to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them parallelly via the `--jsonl_paths` argument. |
There was a problem hiding this comment.
Fix wording: use “in parallel” instead of “parallelly”.
Line 133 has a minor text-quality issue.
Suggested wording fix
-The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a few hours to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them parallelly via the `--jsonl_paths` argument.
+The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a few hours to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them in parallel via the `--jsonl_paths` argument.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a few hours to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them parallelly via the `--jsonl_paths` argument. | |
| The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a few hours to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them in parallel via the `--jsonl_paths` argument. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/megatron_bridge/README.md` at line 133, Replace the incorrect adverb
"parallelly" in the README sentence that describes tokenizing split .jsonl files
using the --jsonl_paths argument with the correct phrase "in parallel" (i.e.,
change "...tokenize them parallelly via the `--jsonl_paths` argument." to
"...tokenize them in parallel via the `--jsonl_paths` argument.").
| --teacher_hf_path Qwen/Qwen3-8B \ | ||
| --student_hf_path Qwen/Qwen3-4B \ | ||
| --data_paths 1.0 /path/to/tokenized/data/qwen3 \ | ||
| --data_paths 1.0 tokenized_qwen3/data1_text_document 1.0 tokenized_qwen3/data2_text_document \ |
There was a problem hiding this comment.
The weights don't need to sum to 1?
What does this PR do?
Nemotron2/3 need
AutoBridge(..., trust_remote_code=True)which was missing previouslyTesting
Nemotron-nano-v2 can be distilled using tokenized Nemotron-Pretraining-SFT-v1 data
Summary by CodeRabbit
New Features
Documentation