Latent MOE & Repeated MTP support; fix KV cache quant export#768
Latent MOE & Repeated MTP support; fix KV cache quant export#768
Conversation
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
📝 WalkthroughWalkthroughThis pull request extends the modelopt export and quantization pipeline to support Megatron TE/MTP (Tensor-Parallel) configurations, MoE (Mixture of Experts) improvements, and enhanced KV-cache scaling for quantization. Changes introduce custom module mappings, reorganize import/export logic, refactor KV-cache handling, and add distributed synchronization for expert amax values. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #768 +/- ##
==========================================
- Coverage 74.62% 74.62% -0.01%
==========================================
Files 192 192
Lines 18989 18992 +3
==========================================
+ Hits 14171 14172 +1
- Misses 4818 4820 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
| # Step 1: Sync amax across local experts in a SequentialMLP | ||
| for name, module in model.named_modules(): | ||
| if hasattr(module, "sync_moe_local_experts_amax"): | ||
| module.sync_moe_local_experts_amax() | ||
|
|
||
| # TODO just for testing | ||
| if "experts" in name and "weight_quantizer" in name: | ||
| assert child.amax is not None |
There was a problem hiding this comment.
Can we move this before if distributed sync check? This is not doing anything particular to distributed sync
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
| # TODO double check if MOE forward will be implemented in MoELayer or TransformerLayer | ||
| # We do not need both layers to be patched | ||
|
|
||
| @QuantModuleRegistry.register({megatron_transformer_layer.TransformerLayer: "megatron_transformer_layer_TransformerLayer"}) |
There was a problem hiding this comment.
TODO maybe remove this since MOE forward will not be removed in MLM main?
| "input_layernorm": NameRemapping("backbone.layers.{}.norm."), | ||
| "linear_qkv": QKVSlicing("backbone.layers.{}.mixer."), | ||
| "linear_proj": NameRemapping("backbone.layers.{}.mixer.o_proj."), | ||
| "core_attention": SelfAttentionScaling("backbone.layers.{}.mixer."), |
There was a problem hiding this comment.
doublecheck that this is only needed for export
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
| ] | ||
|
|
||
|
|
||
| # This path uses output_quantizer for KV cache quantization. |
There was a problem hiding this comment.
I hope to see more proof before we remove this.
Also TRTLLM right now by default still uses 1 as the kv cache scale [ignoring the values we set here.]
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
| layer.mlp.router, layer_id, dtype=self.moe_router_dtype | ||
| ) | ||
| if hasattr(layer.mlp, "fc1_latent_proj") and layer.mlp.fc1_latent_proj is not None: | ||
| self.rules["fc1_latent_proj"](layer.mlp.fc1_latent_proj, layer_id) |
There was a problem hiding this comment.
latent MOE import added here
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
modelopt/torch/quantization/model_calib.py (1)
110-152: Inconsistent step numbering in comments.Line 110 introduces "Step 3: TP sync" but line 152 contains a comment saying "Step 2: Sync amax across relevant parallelism". This is confusing since the code at line 152 is part of the TP synchronization logic introduced by the Step 3 comment.
Proposed fix
- # Step 2: Sync amax across relevant parallelism (such as TP / EP) + # Sync amax across relevant parallelism (such as TP / EP)Alternatively, remove the duplicate comment entirely since it's redundant with the Step 3 header.
modelopt/torch/export/quant_utils.py (1)
363-415: Preserve affine detection when choosing KV‑cache dtype.
is_affineis computed inget_kv_cache_dtypebut then ignored;_compute_kv_cache_dtypealways treats(2,1)as affine. That changes export semantics and can mislabel non‑affine NVFP4 as NVFP4_AFFINE.🛠️ Suggested fix
- return _compute_kv_cache_dtype(num_bits_list) + return _compute_kv_cache_dtype(num_bits_list, is_affine) -def _compute_kv_cache_dtype(num_bits_list: list[int]) -> str | None: +def _compute_kv_cache_dtype(num_bits_list: list[int], is_affine: bool) -> str | None: @@ - is_affine = True - if (4, 3) in num_bits_list: return KV_CACHE_FP8
🤖 Fix all issues with AI agents
In `@modelopt/torch/export/plugins/mcore_nemotron.py`:
- Line 100: The inline stale TODO comment "# TODO ADD MTP export" in
mcore_nemotron.py is now misleading because MTP export mappings are present
further down; remove that TODO or replace it with an accurate note describing
any remaining specific work (e.g., "TODO: verify MTP mapping completeness" or
similar). Locate the comment text "# TODO ADD MTP export" in the module
(mcore_nemotron.py) and either delete it or update it to a precise, current TODO
referencing the actual remaining task.
In `@modelopt/torch/export/plugins/megatron_importer.py`:
- Around line 531-646: The repeated‑MTP flag (is_mtp) isn't forwarded from
_import_transformer_layer to most rule calls (layernorms, MLP/router/expert
weights), causing tensors to resolve to backbone prefixes; update all rule
invocations inside _import_transformer_layer that handle per-layer weights —
e.g., "input_layernorm", "q_layernorm", "k_layernorm", "pre_mlp_layernorm",
"linear_fc1", "linear_fc2", "router", "fc1_latent_proj", "fc2_latent_proj",
"shared_experts.linear_fc1", "shared_experts.linear_fc2",
"local_experts.linear_fc1", "local_experts.linear_fc2",
"local_experts.linear_fc1_etp/ep", "local_experts.linear_fc2_etp/ep", and
"experts.linear_fc1/linear_fc2" — to pass the is_mtp flag (e.g., is_mtp=is_mtp)
so repeated‑MTP layers resolve to MTP prefixes correctly.
In `@modelopt/torch/export/quant_utils.py`:
- Around line 347-359: The return type of get_kv_cache_scaling_factor is wrong:
it returns a list of scaling factors (possibly [None, None]) not a torch.Tensor;
update the signature of get_kv_cache_scaling_factor to return a typed container
such as List[Optional[torch.Tensor]] or Tuple[Optional[torch.Tensor],
Optional[torch.Tensor]] and import the corresponding typing names (List/Tuple
and Optional), and ensure callers expecting a torch.Tensor are adjusted if
necessary; reference function get_kv_cache_scaling_factor and attributes
k_bmm_quantizer, v_bmm_quantizer and helper get_scaling_factor when making the
change.
In `@modelopt/torch/export/unified_export_megatron.py`:
- Around line 497-499: The call to _get_mtp_state_dict() is unconditional and
will raise when no pretrained_model_name_or_path or no MTP module exists; update
the code to only call _get_mtp_state_dict() when
self.pretrained_model_name_or_path is truthy and the MTP module exists (e.g.,
check getattr(self, "mtp", None) or hasattr(self, "mtp") before calling). Apply
the same guard to the other MTP-related export block (the later section that
currently calls _get_mtp_state_dict/_export for lines ~591-619) so MTP export is
skipped when no pretrained path or MTP is present.
- Around line 787-789: The condition treating qformat as None is wrong because
get_quantization_format returns QUANTIZATION_NONE (an enum/const) not None;
update the branch that currently checks "if qformat is None and 'norm' not in
prefix:" to instead detect unquantized format (e.g., qformat ==
QUANTIZATION_NONE or qformat in {None, QUANTIZATION_NONE}) and then append
prefix to self.exclude_modules so hf_quant_config correctly excludes unquantized
layers; locate qformat usage near get_quantization_format and
get_weight_block_size and amend the conditional accordingly.
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 103-104: The comment "Step 2:Sync amax across data parallelism"
has a missing space after the colon; locate the comment in model_calib.py (the
line containing "Step 2:Sync amax across data parallelism") and update it to
"Step 2: Sync amax across data parallelism" so the formatting is consistent.
🧹 Nitpick comments (2)
modelopt/torch/export/unified_export_megatron.py (1)
129-134: Avoid always‑on debug prints in export paths.
These prints will spam logs for large models. Consider gating them behind a debug flag or using a logger at debug level.Also applies to: 856-891
modelopt/torch/export/plugins/megatron_importer.py (1)
265-289: Grouped MLP import ignores quantized weight scales.
The TODO leaves weight scales unhandled, which will break quantized checkpoints. Consider loading{weight_scale_name}per expert or explicitly documenting that quantized TEGroupedMLP import isn’t supported yet.If you want, I can draft the scale‑loading logic for this path.
|
|
||
| } | ||
|
|
||
| # TODO ADD MTP export |
There was a problem hiding this comment.
Stale TODO comment.
The code now includes MTP export mappings below, so this TODO is misleading. Please remove or update it to reflect the remaining work.
🤖 Prompt for AI Agents
In `@modelopt/torch/export/plugins/mcore_nemotron.py` at line 100, The inline
stale TODO comment "# TODO ADD MTP export" in mcore_nemotron.py is now
misleading because MTP export mappings are present further down; remove that
TODO or replace it with an accurate note describing any remaining specific work
(e.g., "TODO: verify MTP mapping completeness" or similar). Locate the comment
text "# TODO ADD MTP export" in the module (mcore_nemotron.py) and either delete
it or update it to a precise, current TODO referencing the actual remaining
task.
| def _import_transformer_layer(self, layer, layer_id, layer_pbar, is_mtp: bool = False): | ||
| if not isinstance(layer.input_layernorm, IdentityOp): | ||
| self.rules["input_layernorm"](layer.input_layernorm, layer_id) | ||
|
|
||
| attention = layer.self_attention | ||
| if not isinstance(attention, IdentityOp): | ||
| if "MLASelfAttention" in str(type(attention)): | ||
| if hasattr(attention, "linear_q_proj"): | ||
| layer_pbar.set_description("Importing MLA (without q LoRA)") | ||
| self.rules["linear_q_proj"](attention.linear_q_proj, layer_id) | ||
| else: | ||
| layer_pbar.set_description("Importing MLA (with q LoRA)") | ||
| self.rules["linear_q_down_proj"](attention.linear_q_down_proj, layer_id) | ||
| self.rules["linear_q_layernorm"](attention.q_layernorm, layer_id) | ||
| self.rules["linear_q_up_proj"](attention.linear_q_up_proj, layer_id) | ||
| self.rules["linear_kv_down_proj"](attention.linear_kv_down_proj, layer_id) | ||
| self.rules["linear_kv_layernorm"](attention.kv_layernorm, layer_id) | ||
| self.rules["linear_kv_up_proj"](attention.linear_kv_up_proj, layer_id) | ||
| self.rules["linear_proj"](attention.linear_proj, layer_id) | ||
| else: | ||
| layer_pbar.set_description("Importing GQA/MHA") | ||
| if attention.q_layernorm is not None and not isinstance( | ||
| attention.q_layernorm, (IdentityOp, L2Norm) | ||
| ): | ||
| self.rules["q_layernorm"](attention.q_layernorm, layer_id) | ||
| self.rules["k_layernorm"](attention.k_layernorm, layer_id) | ||
| self.rules["linear_qkv"](attention.linear_qkv, layer_id, is_mtp=is_mtp) | ||
| self.rules["linear_proj"](attention.linear_proj, layer_id, is_mtp=is_mtp) | ||
| if getattr(attention.core_attention, "softmax_offset", None) is not None: | ||
| self.rules["softmax_offset"]( | ||
| attention.core_attention.softmax_offset, layer_id | ||
| ) | ||
|
|
||
| if not isinstance(layer.pre_mlp_layernorm, IdentityOp): | ||
| self.rules["pre_mlp_layernorm"](layer.pre_mlp_layernorm, layer_id) | ||
|
|
||
| if not isinstance(layer.mlp, IdentityOp): | ||
| if "MoE" in str(type(layer.mlp)): | ||
| layer_pbar.set_description(f"Importing MoE with moe_router_dtype: {self.moe_router_dtype}") | ||
| self.rules["router"]( | ||
| layer.mlp.router, layer_id, dtype=self.moe_router_dtype | ||
| ) | ||
| if hasattr(layer.mlp, "fc1_latent_proj") and layer.mlp.fc1_latent_proj is not None: | ||
| self.rules["fc1_latent_proj"](layer.mlp.fc1_latent_proj, layer_id) | ||
| if hasattr(layer.mlp, "fc2_latent_proj") and layer.mlp.fc2_latent_proj is not None: | ||
| self.rules["fc2_latent_proj"](layer.mlp.fc2_latent_proj, layer_id) | ||
|
|
||
| if ( | ||
| hasattr(layer.mlp, "shared_experts") | ||
| and layer.mlp.shared_experts is not None | ||
| ): | ||
| layer_pbar.set_description("Importing MoE shared experts") | ||
| fc1 = layer.mlp.shared_experts.linear_fc1 | ||
| fc2 = layer.mlp.shared_experts.linear_fc2 | ||
| self.rules["shared_experts.linear_fc1"](fc1, layer_id) | ||
| self.rules["shared_experts.linear_fc2"](fc2, layer_id) | ||
| if not self.rules.get("use_packed_local_experts", False): # Import local experts | ||
| experts = layer.mlp.experts | ||
| if hasattr(experts, "local_experts"): | ||
| for local_expert_id, expert in tqdm( | ||
| enumerate(layer.mlp.experts.local_experts), | ||
| desc="Importing MoE local experts", | ||
| leave=False, | ||
| disable=self.disable_tqdm, | ||
| ): | ||
| expert_id = layer.mlp.local_expert_indices[local_expert_id] | ||
| fc1 = expert.linear_fc1 | ||
| fc2 = expert.linear_fc2 | ||
| self.rules["local_experts.linear_fc1"](fc1, layer_id, expert_id) | ||
| self.rules["local_experts.linear_fc2"](fc2, layer_id, expert_id) | ||
| else: # Slice TEGroupedMLP | ||
| layer_pbar.set_description("Importing MoE grouped local experts") | ||
| num_local_experts = experts.num_local_experts | ||
| num_global_experts = experts.config.num_moe_experts | ||
| assert num_local_experts == num_global_experts, "num_local_experts must be equal to num_global_experts during MoE import" | ||
|
|
||
| ''' | ||
| if parallel_config is not None: | ||
| etp_size = get_expert_tensor_parallel_world_size() | ||
| # etp_rank = get_expert_tensor_parallel_rank() # this gives group rank | ||
| etp_rank = dist.get_rank() | ||
| print(f"etp_size: {etp_size}") | ||
| print(f"etp_rank: {etp_rank}") | ||
| assert num_local_experts * etp_size == num_global_experts | ||
| init_index = etp_rank * num_local_experts | ||
| ''' | ||
| init_index = 0 | ||
|
|
||
| self.rules["experts.linear_fc1"](experts.linear_fc1, layer_id, init_expert_id=init_index, num_local_experts=num_local_experts) | ||
| self.rules["experts.linear_fc2"](experts.linear_fc2, layer_id, init_expert_id=init_index, num_local_experts=num_local_experts) | ||
|
|
||
| # We only support either EP or ETP for now | ||
| elif get_expert_tensor_parallel_world_size() > 1: | ||
| # ETP supports for packed MoE | ||
| # ETP is not supported for gpt-oss model | ||
| if self.arch in ["GptOssForCausalLM"]: | ||
| raise ValueError("ETP is not supported for gpt-oss model") | ||
| self.rules["local_experts.linear_fc1_etp"]( | ||
| layer.mlp.experts.local_experts, layer_id | ||
| ) | ||
| self.rules["local_experts.linear_fc2_etp"]( | ||
| layer.mlp.experts.local_experts, layer_id | ||
| ) | ||
| else: | ||
| # EP supports for packed MoE | ||
| self.rules["local_experts.linear_fc1_ep"]( | ||
| layer.mlp.experts.local_experts, layer_id | ||
| ) | ||
| self.rules["local_experts.linear_fc2_ep"]( | ||
| layer.mlp.experts.local_experts, layer_id | ||
| ) | ||
| else: | ||
| layer_pbar.set_description("Importing MLP") | ||
| self.rules["linear_fc1"](layer.mlp.linear_fc1, layer_id) | ||
| self.rules["linear_fc2"](layer.mlp.linear_fc2, layer_id) | ||
|
|
There was a problem hiding this comment.
Propagate is_mtp to all rule calls in repeated‑MTP layers.
Only linear_qkv and linear_proj receive is_mtp, so layernorm/MLP/router/expert weights still resolve to backbone prefixes. That will import the wrong tensors for repeated MTP.
🛠️ Suggested fix
def _import_transformer_layer(self, layer, layer_id, layer_pbar, is_mtp: bool = False):
+ mtp_kwargs = {"is_mtp": is_mtp} if is_mtp else {}
if not isinstance(layer.input_layernorm, IdentityOp):
- self.rules["input_layernorm"](layer.input_layernorm, layer_id)
+ self.rules["input_layernorm"](layer.input_layernorm, layer_id, **mtp_kwargs)
@@
- self.rules["linear_qkv"](attention.linear_qkv, layer_id, is_mtp=is_mtp)
- self.rules["linear_proj"](attention.linear_proj, layer_id, is_mtp=is_mtp)
+ self.rules["linear_qkv"](attention.linear_qkv, layer_id, **mtp_kwargs)
+ self.rules["linear_proj"](attention.linear_proj, layer_id, **mtp_kwargs)
@@
if not isinstance(layer.pre_mlp_layernorm, IdentityOp):
- self.rules["pre_mlp_layernorm"](layer.pre_mlp_layernorm, layer_id)
+ self.rules["pre_mlp_layernorm"](layer.pre_mlp_layernorm, layer_id, **mtp_kwargs)
@@
- self.rules["router"](layer.mlp.router, layer_id, dtype=self.moe_router_dtype)
+ self.rules["router"](layer.mlp.router, layer_id, dtype=self.moe_router_dtype, **mtp_kwargs)
@@
- self.rules["shared_experts.linear_fc1"](fc1, layer_id)
- self.rules["shared_experts.linear_fc2"](fc2, layer_id)
+ self.rules["shared_experts.linear_fc1"](fc1, layer_id, **mtp_kwargs)
+ self.rules["shared_experts.linear_fc2"](fc2, layer_id, **mtp_kwargs)
@@
- self.rules["local_experts.linear_fc1"](fc1, layer_id, expert_id)
- self.rules["local_experts.linear_fc2"](fc2, layer_id, expert_id)
+ self.rules["local_experts.linear_fc1"](fc1, layer_id, expert_id, **mtp_kwargs)
+ self.rules["local_experts.linear_fc2"](fc2, layer_id, expert_id, **mtp_kwargs)
@@
- self.rules["linear_fc1"](layer.mlp.linear_fc1, layer_id)
- self.rules["linear_fc2"](layer.mlp.linear_fc2, layer_id)
+ self.rules["linear_fc1"](layer.mlp.linear_fc1, layer_id, **mtp_kwargs)
+ self.rules["linear_fc2"](layer.mlp.linear_fc2, layer_id, **mtp_kwargs)🤖 Prompt for AI Agents
In `@modelopt/torch/export/plugins/megatron_importer.py` around lines 531 - 646,
The repeated‑MTP flag (is_mtp) isn't forwarded from _import_transformer_layer to
most rule calls (layernorms, MLP/router/expert weights), causing tensors to
resolve to backbone prefixes; update all rule invocations inside
_import_transformer_layer that handle per-layer weights — e.g.,
"input_layernorm", "q_layernorm", "k_layernorm", "pre_mlp_layernorm",
"linear_fc1", "linear_fc2", "router", "fc1_latent_proj", "fc2_latent_proj",
"shared_experts.linear_fc1", "shared_experts.linear_fc2",
"local_experts.linear_fc1", "local_experts.linear_fc2",
"local_experts.linear_fc1_etp/ep", "local_experts.linear_fc2_etp/ep", and
"experts.linear_fc1/linear_fc2" — to pass the is_mtp flag (e.g., is_mtp=is_mtp)
so repeated‑MTP layers resolve to MTP prefixes correctly.
| def get_kv_cache_scaling_factor(self_attention_module: nn.Module) -> torch.Tensor: | ||
| """ | ||
| Returns the k and v BMM scaling factors if BMM quantizers are set in the self attention module. | ||
| Else returns None by default. | ||
| """ | ||
| if not hasattr(self_attention_module, "k_bmm_quantizer") or not hasattr(self_attention_module, "v_bmm_quantizer"): | ||
| return [None, None] | ||
|
|
||
| scaling_factors = [ | ||
| get_scaling_factor(getattr(kv_module, quantizer)) | ||
| get_scaling_factor(getattr(self_attention_module, quantizer)) | ||
| for quantizer in ("k_bmm_quantizer", "v_bmm_quantizer") | ||
| ] | ||
|
|
||
| # For FP8, we recommend default kv cache scaling factor to be 1. | ||
| if get_kv_cache_dtype(kv_module) == KV_CACHE_FP8: | ||
| for i, factor in enumerate(scaling_factors): | ||
| if factor.item() > 0.5: | ||
| warn( | ||
| f"Warning: Large KV activation detected: {factor.item()}, " | ||
| "Quantized KV cache may lead to higher accuracy drop." | ||
| ) | ||
| scaling_factors[i] = torch.max( | ||
| factor, torch.tensor([1.0], dtype=torch.float, device=factor.device) | ||
| ) | ||
|
|
||
| return scaling_factors |
There was a problem hiding this comment.
Fix return type annotation for KV‑cache scaling.
This function returns a list (e.g., [k_scale, v_scale]), but the signature says torch.Tensor. That mismatch will confuse callers and type checkers.
🛠️ Suggested fix
-def get_kv_cache_scaling_factor(self_attention_module: nn.Module) -> torch.Tensor:
+def get_kv_cache_scaling_factor(
+ self_attention_module: nn.Module,
+) -> list[torch.Tensor | None]:🤖 Prompt for AI Agents
In `@modelopt/torch/export/quant_utils.py` around lines 347 - 359, The return type
of get_kv_cache_scaling_factor is wrong: it returns a list of scaling factors
(possibly [None, None]) not a torch.Tensor; update the signature of
get_kv_cache_scaling_factor to return a typed container such as
List[Optional[torch.Tensor]] or Tuple[Optional[torch.Tensor],
Optional[torch.Tensor]] and import the corresponding typing names (List/Tuple
and Optional), and ensure callers expecting a torch.Tensor are adjusted if
necessary; reference function get_kv_cache_scaling_factor and attributes
k_bmm_quantizer, v_bmm_quantizer and helper get_scaling_factor when making the
change.
| # MTP layer | ||
| self._get_mtp_state_dict() | ||
|
|
There was a problem hiding this comment.
Guard MTP export when no pretrained path (or no MTP).
_get_mtp_state_dict() is called unconditionally and will error when pretrained_model_name_or_path is None (or when no MTP module exists).
🛠️ Suggested fix
- # MTP layer
- self._get_mtp_state_dict()
+ # MTP layer
+ if hasattr(model, "mtp") and self._hf_pretrained_model_name:
+ self._get_mtp_state_dict()Also applies to: 591-619
🤖 Prompt for AI Agents
In `@modelopt/torch/export/unified_export_megatron.py` around lines 497 - 499, The
call to _get_mtp_state_dict() is unconditional and will raise when no
pretrained_model_name_or_path or no MTP module exists; update the code to only
call _get_mtp_state_dict() when self.pretrained_model_name_or_path is truthy and
the MTP module exists (e.g., check getattr(self, "mtp", None) or hasattr(self,
"mtp") before calling). Apply the same guard to the other MTP-related export
block (the later section that currently calls _get_mtp_state_dict/_export for
lines ~591-619) so MTP export is skipped when no pretrained path or MTP is
present.
| if qformat is None and "norm" not in prefix: # Add exclude layers for hf_quant_config | ||
| self.exclude_modules.append(prefix) | ||
| block_size = get_weight_block_size(module) |
There was a problem hiding this comment.
Exclude‑modules list never populates for unquantized layers.
get_quantization_format returns QUANTIZATION_NONE, so qformat is None is never true and exclude_modules stays empty. That makes hf_quant_config incorrect for mixed/partial quantization.
🛠️ Suggested fix
- if qformat is None and "norm" not in prefix: # Add exclude layers for hf_quant_config
+ if qformat == QUANTIZATION_NONE and "norm" not in prefix: # Add exclude layers for hf_quant_config
self.exclude_modules.append(prefix)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if qformat is None and "norm" not in prefix: # Add exclude layers for hf_quant_config | |
| self.exclude_modules.append(prefix) | |
| block_size = get_weight_block_size(module) | |
| if qformat == QUANTIZATION_NONE and "norm" not in prefix: # Add exclude layers for hf_quant_config | |
| self.exclude_modules.append(prefix) | |
| block_size = get_weight_block_size(module) |
🤖 Prompt for AI Agents
In `@modelopt/torch/export/unified_export_megatron.py` around lines 787 - 789, The
condition treating qformat as None is wrong because get_quantization_format
returns QUANTIZATION_NONE (an enum/const) not None; update the branch that
currently checks "if qformat is None and 'norm' not in prefix:" to instead
detect unquantized format (e.g., qformat == QUANTIZATION_NONE or qformat in
{None, QUANTIZATION_NONE}) and then append prefix to self.exclude_modules so
hf_quant_config correctly excludes unquantized layers; locate qformat usage near
get_quantization_format and get_weight_block_size and amend the conditional
accordingly.
|
|
||
| # Step 2:Sync amax across data parallelism |
There was a problem hiding this comment.
Minor formatting issue: missing space after colon.
Proposed fix
-
- # Step 2:Sync amax across data parallelism
+
+ # Step 2: Sync amax across data parallelism📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Step 2:Sync amax across data parallelism | |
| # Step 2: Sync amax across data parallelism |
🤖 Prompt for AI Agents
In `@modelopt/torch/quantization/model_calib.py` around lines 103 - 104, The
comment "Step 2:Sync amax across data parallelism" has a missing space after the
colon; locate the comment in model_calib.py (the line containing "Step 2:Sync
amax across data parallelism") and update it to "Step 2: Sync amax across data
parallelism" so the formatting is consistent.
What does this PR do?
Type of change: New feature
Overview:
New nemotron models use
TransformerLayer.forward()instead ofMoELayer.forward()for MOE. This is a breaking change to our quantization implementation for Nano3 which relied on patchingMoELayer.forward()to force tokens to be routed to all experts during calibration.TransformerLayer.forward()which forces tokens to be routed to all experts during PTQ calibrationqkv_layer.output_quantizerexport & replace with properk/v_bmm_quantizerlogic? TODO Potentially remove MoELayer quant config if all MOEs in future will use TransformerLayer instead?
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
Release Notes
New Features
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.