Latent MOE & Repeated MTP support; fix KV cache quant export by jenchen13 · Pull Request #768 · NVIDIA/Model-Optimizer

jenchen13 · 2026-01-13T02:52:42Z

What does this PR do?

Type of change: New feature

Overview:
New nemotron models use TransformerLayer.forward() instead of MoELayer.forward() for MOE. This is a breaking change to our quantization implementation for Nano3 which relied on patching MoELayer.forward() to force tokens to be routed to all experts during calibration.

Add patch for TransformerLayer.forward() which forces tokens to be routed to all experts during PTQ calibration
Enable latent MOE modules during megatron import/export
Fix KV cache quantization export: remove old qkv_layer.output_quantizer export & replace with proper k/v_bmm_quantizer logic
Remove KV quantization scale clamping to 1
Improvements to EP amax sync
Support repeated MTP import/export for NemotronH models

? TODO Potentially remove MoELayer quant config if all MOEs in future will use TransformerLayer instead?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

Release Notes

New Features
- Enhanced support for Nemotron models including latent Mixture-of-Experts and grouped local expert handling
- Added attention scaling optimization for improved quantized model inference
- Integrated Transformer Engine support for advanced quantization
Improvements
- Refined KV cache quantization and scaling for better inference performance
- Optimized quantization calibration across distributed training environments
- Enhanced model export with improved quantization configuration compatibility

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

copy-pr-bot · 2026-01-13T02:52:46Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-01-13T02:52:48Z

📝 Walkthrough

Walkthrough

This pull request extends the modelopt export and quantization pipeline to support Megatron TE/MTP (Tensor-Parallel) configurations, MoE (Mixture of Experts) improvements, and enhanced KV-cache scaling for quantization. Changes introduce custom module mappings, reorganize import/export logic, refactor KV-cache handling, and add distributed synchronization for expert amax values.

Changes

Cohort / File(s)	Summary
Custom Module Mappings `modelopt/torch/export/plugins/mcore_custom.py`	Added two new CustomModuleMapping classes: GroupedMLPMerging for merging grouped MLP projections and SelfAttentionScaling for self-attention scaling.
Nemotron Model Configuration `modelopt/torch/export/plugins/mcore_nemotron.py`	Extended import/export mappings with new latent MoE, MTP (repeated) modules, and grouped local experts support; integrated GroupedMLPMerging and SelfAttentionScaling custom mappings.
Megatron Importer Logic `modelopt/torch/export/plugins/megatron_importer.py`	Major refactoring: added _grouped_mlp_merging method, introduced is_mtp flag propagation, modularized layer imports via _import_mamba_layer and _import_transformer_layer helpers, rewrote MoE import flow for local/shared/TE-grouped cases, expanded MTP handling for single and repeated configurations, switched to dist alias for torch.distributed calls, and added rank-based gating for progress output.
KV-Cache Quantization Utilities `modelopt/torch/export/quant_utils.py`	Refactored get_kv_cache_scaling_factor parameter naming (kv_module → self_attention_module) and return type annotation, removed FP8 runtime warnings and threshold clamping, introduced _compute_kv_cache_dtype helper to centralize KV cache dtype computation logic.
Megatron Export Pipeline `modelopt/torch/export/unified_export_megatron.py`	Replaced standalone KV-cache scaling with new _self_attention_scaling method, introduced exclude_modules collection across ranks for HF quant config, added NVFP4 handling in quantization config, added MTP safetensors loading path, updated moe_router_dtype parameter to string type, extended _get_quantized_state with prefix parameter for per-layer quantization handling, and added config.json augmentation with converted quant format.
Quantization Synchronization `modelopt/torch/quantization/model_calib.py`	Reorganized amax synchronization steps: added Step 1 for MoE local experts sync, relabeled DP sync to Step 2, added Step 3 for TP-aware synchronization across input and weight quantizers based on parallel config.
Megatron Quantization Plugins `modelopt/torch/quantization/plugins/megatron.py`	Added _QuantTEMCoreLinear class registration for Telegram Engine linear quantization, imported megatron.core.transformer.transformer_layer and torch.distributed as dist, removed local barrier in sync_moe_local_experts_amax, and adjusted forward calls to assign output variable in quantized MoE paths.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly captures the three main objectives: latent MOE support, repeated MTP support, and KV cache quantization export fixes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-01-13T03:03:39Z

Codecov Report

❌ Patch coverage is 40.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.62%. Comparing base (5104513) to head (4471c03).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/quantization/model_calib.py	40.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #768      +/-   ##
==========================================
- Coverage   74.62%   74.62%   -0.01%     
==========================================
  Files         192      192              
  Lines       18989    18992       +3     
==========================================
+ Hits        14171    14172       +1     
- Misses       4818     4820       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

modelopt/torch/quantization/plugins/megatron.py

realAsma · 2026-01-14T19:41:52Z

modelopt/torch/quantization/model_calib.py

+    # Step 1: Sync amax across local experts in a SequentialMLP
+    for name, module in model.named_modules():
+        if hasattr(module, "sync_moe_local_experts_amax"):
+            module.sync_moe_local_experts_amax()
+
+            # TODO just for testing
+            if "experts" in name and "weight_quantizer" in name:
+                assert child.amax is not None


Can we move this before if distributed sync check? This is not doing anything particular to distributed sync

modelopt/torch/quantization/plugins/megatron.py

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

jenchen13 · 2026-01-20T20:23:22Z

modelopt/torch/quantization/plugins/megatron.py

+# TODO double check if MOE forward will be implemented in MoELayer or TransformerLayer
+# We do not need both layers to be patched
+
+@QuantModuleRegistry.register({megatron_transformer_layer.TransformerLayer: "megatron_transformer_layer_TransformerLayer"})


TODO maybe remove this since MOE forward will not be removed in MLM main?

jenchen13 · 2026-01-20T20:24:39Z

modelopt/torch/export/plugins/mcore_nemotron.py

    "input_layernorm": NameRemapping("backbone.layers.{}.norm."),
    "linear_qkv": QKVSlicing("backbone.layers.{}.mixer."),
    "linear_proj": NameRemapping("backbone.layers.{}.mixer.o_proj."),
+    "core_attention": SelfAttentionScaling("backbone.layers.{}.mixer."),


doublecheck that this is only needed for export

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

cjluo-nv · 2026-01-22T17:49:14Z

modelopt/torch/export/unified_export_megatron.py

 ]


-# This path uses output_quantizer for KV cache quantization.


I hope to see more proof before we remove this.

Also TRTLLM right now by default still uses 1 as the kv cache scale [ignoring the values we set here.]

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

jenchen13 · 2026-01-27T03:19:47Z

modelopt/torch/export/plugins/megatron_importer.py

+                    layer.mlp.router, layer_id, dtype=self.moe_router_dtype
+                )
+                if hasattr(layer.mlp, "fc1_latent_proj") and layer.mlp.fc1_latent_proj is not None:
+                    self.rules["fc1_latent_proj"](layer.mlp.fc1_latent_proj, layer_id)


latent MOE import added here

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

modelopt/torch/quantization/model_calib.py (1)
110-152: Inconsistent step numbering in comments.

Line 110 introduces "Step 3: TP sync" but line 152 contains a comment saying "Step 2: Sync amax across relevant parallelism". This is confusing since the code at line 152 is part of the TP synchronization logic introduced by the Step 3 comment.
Proposed fix
-    # Step 2: Sync amax across relevant parallelism (such as TP / EP)
+    # Sync amax across relevant parallelism (such as TP / EP)
Alternatively, remove the duplicate comment entirely since it's redundant with the Step 3 header.
modelopt/torch/export/quant_utils.py (1)
363-415: Preserve affine detection when choosing KV‑cache dtype.
is_affine is computed in get_kv_cache_dtype but then ignored; _compute_kv_cache_dtype always treats (2,1) as affine. That changes export semantics and can mislabel non‑affine NVFP4 as NVFP4_AFFINE.
🛠️ Suggested fix
-    return _compute_kv_cache_dtype(num_bits_list)
+    return _compute_kv_cache_dtype(num_bits_list, is_affine)

-def _compute_kv_cache_dtype(num_bits_list: list[int]) -> str | None:
+def _compute_kv_cache_dtype(num_bits_list: list[int], is_affine: bool) -> str | None:
@@
-    is_affine = True
-
     if (4, 3) in num_bits_list:
         return KV_CACHE_FP8

🤖 Fix all issues with AI agents

In `@modelopt/torch/export/plugins/mcore_nemotron.py`:
- Line 100: The inline stale TODO comment "# TODO ADD MTP export" in
mcore_nemotron.py is now misleading because MTP export mappings are present
further down; remove that TODO or replace it with an accurate note describing
any remaining specific work (e.g., "TODO: verify MTP mapping completeness" or
similar). Locate the comment text "# TODO ADD MTP export" in the module
(mcore_nemotron.py) and either delete it or update it to a precise, current TODO
referencing the actual remaining task.

In `@modelopt/torch/export/plugins/megatron_importer.py`:
- Around line 531-646: The repeated‑MTP flag (is_mtp) isn't forwarded from
_import_transformer_layer to most rule calls (layernorms, MLP/router/expert
weights), causing tensors to resolve to backbone prefixes; update all rule
invocations inside _import_transformer_layer that handle per-layer weights —
e.g., "input_layernorm", "q_layernorm", "k_layernorm", "pre_mlp_layernorm",
"linear_fc1", "linear_fc2", "router", "fc1_latent_proj", "fc2_latent_proj",
"shared_experts.linear_fc1", "shared_experts.linear_fc2",
"local_experts.linear_fc1", "local_experts.linear_fc2",
"local_experts.linear_fc1_etp/ep", "local_experts.linear_fc2_etp/ep", and
"experts.linear_fc1/linear_fc2" — to pass the is_mtp flag (e.g., is_mtp=is_mtp)
so repeated‑MTP layers resolve to MTP prefixes correctly.

In `@modelopt/torch/export/quant_utils.py`:
- Around line 347-359: The return type of get_kv_cache_scaling_factor is wrong:
it returns a list of scaling factors (possibly [None, None]) not a torch.Tensor;
update the signature of get_kv_cache_scaling_factor to return a typed container
such as List[Optional[torch.Tensor]] or Tuple[Optional[torch.Tensor],
Optional[torch.Tensor]] and import the corresponding typing names (List/Tuple
and Optional), and ensure callers expecting a torch.Tensor are adjusted if
necessary; reference function get_kv_cache_scaling_factor and attributes
k_bmm_quantizer, v_bmm_quantizer and helper get_scaling_factor when making the
change.

In `@modelopt/torch/export/unified_export_megatron.py`:
- Around line 497-499: The call to _get_mtp_state_dict() is unconditional and
will raise when no pretrained_model_name_or_path or no MTP module exists; update
the code to only call _get_mtp_state_dict() when
self.pretrained_model_name_or_path is truthy and the MTP module exists (e.g.,
check getattr(self, "mtp", None) or hasattr(self, "mtp") before calling). Apply
the same guard to the other MTP-related export block (the later section that
currently calls _get_mtp_state_dict/_export for lines ~591-619) so MTP export is
skipped when no pretrained path or MTP is present.
- Around line 787-789: The condition treating qformat as None is wrong because
get_quantization_format returns QUANTIZATION_NONE (an enum/const) not None;
update the branch that currently checks "if qformat is None and 'norm' not in
prefix:" to instead detect unquantized format (e.g., qformat ==
QUANTIZATION_NONE or qformat in {None, QUANTIZATION_NONE}) and then append
prefix to self.exclude_modules so hf_quant_config correctly excludes unquantized
layers; locate qformat usage near get_quantization_format and
get_weight_block_size and amend the conditional accordingly.

In `@modelopt/torch/quantization/model_calib.py`:
- Around line 103-104: The comment "Step 2:Sync amax across data parallelism"
has a missing space after the colon; locate the comment in model_calib.py (the
line containing "Step 2:Sync amax across data parallelism") and update it to
"Step 2: Sync amax across data parallelism" so the formatting is consistent.

🧹 Nitpick comments (2)

modelopt/torch/export/unified_export_megatron.py (1)

129-134: Avoid always‑on debug prints in export paths.
These prints will spam logs for large models. Consider gating them behind a debug flag or using a logger at debug level.

Also applies to: 856-891

modelopt/torch/export/plugins/megatron_importer.py (1)

265-289: Grouped MLP import ignores quantized weight scales.
The TODO leaves weight scales unhandled, which will break quantized checkpoints. Consider loading {weight_scale_name} per expert or explicitly documenting that quantized TEGroupedMLP import isn’t supported yet.

If you want, I can draft the scale‑loading logic for this path.

coderabbitai · 2026-01-29T17:04:56Z

modelopt/torch/export/plugins/mcore_nemotron.py

+
 }

+# TODO ADD MTP export 


⚠️ Potential issue | 🟡 Minor

Stale TODO comment.
The code now includes MTP export mappings below, so this TODO is misleading. Please remove or update it to reflect the remaining work.

🤖 Prompt for AI Agents

In `@modelopt/torch/export/plugins/mcore_nemotron.py` at line 100, The inline stale TODO comment "# TODO ADD MTP export" in mcore_nemotron.py is now misleading because MTP export mappings are present further down; remove that TODO or replace it with an accurate note describing any remaining specific work (e.g., "TODO: verify MTP mapping completeness" or similar). Locate the comment text "# TODO ADD MTP export" in the module (mcore_nemotron.py) and either delete it or update it to a precise, current TODO referencing the actual remaining task.

coderabbitai · 2026-01-29T17:04:56Z

modelopt/torch/export/plugins/megatron_importer.py

+    def _import_transformer_layer(self, layer, layer_id, layer_pbar, is_mtp: bool = False):
+        if not isinstance(layer.input_layernorm, IdentityOp):
+            self.rules["input_layernorm"](layer.input_layernorm, layer_id)
+
+        attention = layer.self_attention
+        if not isinstance(attention, IdentityOp):
+            if "MLASelfAttention" in str(type(attention)):
+                if hasattr(attention, "linear_q_proj"):
+                    layer_pbar.set_description("Importing MLA (without q LoRA)")
+                    self.rules["linear_q_proj"](attention.linear_q_proj, layer_id)
+                else:
+                    layer_pbar.set_description("Importing MLA (with q LoRA)")
+                    self.rules["linear_q_down_proj"](attention.linear_q_down_proj, layer_id)
+                    self.rules["linear_q_layernorm"](attention.q_layernorm, layer_id)
+                    self.rules["linear_q_up_proj"](attention.linear_q_up_proj, layer_id)
+                self.rules["linear_kv_down_proj"](attention.linear_kv_down_proj, layer_id)
+                self.rules["linear_kv_layernorm"](attention.kv_layernorm, layer_id)
+                self.rules["linear_kv_up_proj"](attention.linear_kv_up_proj, layer_id)
+                self.rules["linear_proj"](attention.linear_proj, layer_id)
+            else:
+                layer_pbar.set_description("Importing GQA/MHA")
+                if attention.q_layernorm is not None and not isinstance(
+                    attention.q_layernorm, (IdentityOp, L2Norm)
+                ):
+                    self.rules["q_layernorm"](attention.q_layernorm, layer_id)
+                    self.rules["k_layernorm"](attention.k_layernorm, layer_id)
+                self.rules["linear_qkv"](attention.linear_qkv, layer_id, is_mtp=is_mtp)
+                self.rules["linear_proj"](attention.linear_proj, layer_id, is_mtp=is_mtp)
+                if getattr(attention.core_attention, "softmax_offset", None) is not None:
+                    self.rules["softmax_offset"](
+                        attention.core_attention.softmax_offset, layer_id
+                    )
+
+        if not isinstance(layer.pre_mlp_layernorm, IdentityOp):
+            self.rules["pre_mlp_layernorm"](layer.pre_mlp_layernorm, layer_id)
+
+        if not isinstance(layer.mlp, IdentityOp):
+            if "MoE" in str(type(layer.mlp)):
+                layer_pbar.set_description(f"Importing MoE with moe_router_dtype: {self.moe_router_dtype}")
+                self.rules["router"](
+                    layer.mlp.router, layer_id, dtype=self.moe_router_dtype
+                )
+                if hasattr(layer.mlp, "fc1_latent_proj") and layer.mlp.fc1_latent_proj is not None:
+                    self.rules["fc1_latent_proj"](layer.mlp.fc1_latent_proj, layer_id)
+                if hasattr(layer.mlp, "fc2_latent_proj") and layer.mlp.fc2_latent_proj is not None:
+                    self.rules["fc2_latent_proj"](layer.mlp.fc2_latent_proj, layer_id)
+
+                if (
+                    hasattr(layer.mlp, "shared_experts")
+                    and layer.mlp.shared_experts is not None
+                ):
+                    layer_pbar.set_description("Importing MoE shared experts")
+                    fc1 = layer.mlp.shared_experts.linear_fc1
+                    fc2 = layer.mlp.shared_experts.linear_fc2
+                    self.rules["shared_experts.linear_fc1"](fc1, layer_id)
+                    self.rules["shared_experts.linear_fc2"](fc2, layer_id)
+                if not self.rules.get("use_packed_local_experts", False): # Import local experts
+                    experts = layer.mlp.experts
+                    if hasattr(experts, "local_experts"):
+                        for local_expert_id, expert in tqdm(
+                            enumerate(layer.mlp.experts.local_experts),
+                            desc="Importing MoE local experts",
+                            leave=False,
+                            disable=self.disable_tqdm,
+                        ):
+                            expert_id = layer.mlp.local_expert_indices[local_expert_id]
+                            fc1 = expert.linear_fc1
+                            fc2 = expert.linear_fc2
+                            self.rules["local_experts.linear_fc1"](fc1, layer_id, expert_id)
+                            self.rules["local_experts.linear_fc2"](fc2, layer_id, expert_id)
+                    else: # Slice TEGroupedMLP
+                        layer_pbar.set_description("Importing MoE grouped local experts")
+                        num_local_experts = experts.num_local_experts
+                        num_global_experts = experts.config.num_moe_experts
+                        assert num_local_experts == num_global_experts, "num_local_experts must be equal to num_global_experts during MoE import"
+
+                        '''
+                        if parallel_config is not None:
+                            etp_size = get_expert_tensor_parallel_world_size()
+                            # etp_rank = get_expert_tensor_parallel_rank() # this gives group rank
+                            etp_rank = dist.get_rank()
+                            print(f"etp_size: {etp_size}")
+                            print(f"etp_rank: {etp_rank}")
+                            assert num_local_experts * etp_size == num_global_experts
+                            init_index = etp_rank * num_local_experts
+                        '''
+                        init_index = 0
+
+                        self.rules["experts.linear_fc1"](experts.linear_fc1, layer_id, init_expert_id=init_index, num_local_experts=num_local_experts)
+                        self.rules["experts.linear_fc2"](experts.linear_fc2, layer_id, init_expert_id=init_index, num_local_experts=num_local_experts)
+
+                # We only support either EP or ETP for now
+                elif get_expert_tensor_parallel_world_size() > 1:
+                    # ETP supports for packed MoE
+                    # ETP is not supported for gpt-oss model
+                    if self.arch in ["GptOssForCausalLM"]:
+                        raise ValueError("ETP is not supported for gpt-oss model")
+                    self.rules["local_experts.linear_fc1_etp"](
+                        layer.mlp.experts.local_experts, layer_id
+                    )
+                    self.rules["local_experts.linear_fc2_etp"](
+                        layer.mlp.experts.local_experts, layer_id
+                    )
+                else:
+                    # EP supports for packed MoE
+                    self.rules["local_experts.linear_fc1_ep"](
+                        layer.mlp.experts.local_experts, layer_id
+                    )
+                    self.rules["local_experts.linear_fc2_ep"](
+                        layer.mlp.experts.local_experts, layer_id
+                    )
+            else:
+                layer_pbar.set_description("Importing MLP")
+                self.rules["linear_fc1"](layer.mlp.linear_fc1, layer_id)
+                self.rules["linear_fc2"](layer.mlp.linear_fc2, layer_id)
+


⚠️ Potential issue | 🟠 Major

Propagate is_mtp to all rule calls in repeated‑MTP layers.
Only linear_qkv and linear_proj receive is_mtp, so layernorm/MLP/router/expert weights still resolve to backbone prefixes. That will import the wrong tensors for repeated MTP.

🛠️ Suggested fix

def _import_transformer_layer(self, layer, layer_id, layer_pbar, is_mtp: bool = False): + mtp_kwargs = {"is_mtp": is_mtp} if is_mtp else {} if not isinstance(layer.input_layernorm, IdentityOp): - self.rules["input_layernorm"](layer.input_layernorm, layer_id) + self.rules["input_layernorm"](layer.input_layernorm, layer_id, **mtp_kwargs) @@ - self.rules["linear_qkv"](attention.linear_qkv, layer_id, is_mtp=is_mtp) - self.rules["linear_proj"](attention.linear_proj, layer_id, is_mtp=is_mtp) + self.rules["linear_qkv"](attention.linear_qkv, layer_id, **mtp_kwargs) + self.rules["linear_proj"](attention.linear_proj, layer_id, **mtp_kwargs) @@ if not isinstance(layer.pre_mlp_layernorm, IdentityOp): - self.rules["pre_mlp_layernorm"](layer.pre_mlp_layernorm, layer_id) + self.rules["pre_mlp_layernorm"](layer.pre_mlp_layernorm, layer_id, **mtp_kwargs) @@ - self.rules["router"](layer.mlp.router, layer_id, dtype=self.moe_router_dtype) + self.rules["router"](layer.mlp.router, layer_id, dtype=self.moe_router_dtype, **mtp_kwargs) @@ - self.rules["shared_experts.linear_fc1"](fc1, layer_id) - self.rules["shared_experts.linear_fc2"](fc2, layer_id) + self.rules["shared_experts.linear_fc1"](fc1, layer_id, **mtp_kwargs) + self.rules["shared_experts.linear_fc2"](fc2, layer_id, **mtp_kwargs) @@ - self.rules["local_experts.linear_fc1"](fc1, layer_id, expert_id) - self.rules["local_experts.linear_fc2"](fc2, layer_id, expert_id) + self.rules["local_experts.linear_fc1"](fc1, layer_id, expert_id, **mtp_kwargs) + self.rules["local_experts.linear_fc2"](fc2, layer_id, expert_id, **mtp_kwargs) @@ - self.rules["linear_fc1"](layer.mlp.linear_fc1, layer_id) - self.rules["linear_fc2"](layer.mlp.linear_fc2, layer_id) + self.rules["linear_fc1"](layer.mlp.linear_fc1, layer_id, **mtp_kwargs) + self.rules["linear_fc2"](layer.mlp.linear_fc2, layer_id, **mtp_kwargs)

🤖 Prompt for AI Agents

In `@modelopt/torch/export/plugins/megatron_importer.py` around lines 531 - 646, The repeated‑MTP flag (is_mtp) isn't forwarded from _import_transformer_layer to most rule calls (layernorms, MLP/router/expert weights), causing tensors to resolve to backbone prefixes; update all rule invocations inside _import_transformer_layer that handle per-layer weights — e.g., "input_layernorm", "q_layernorm", "k_layernorm", "pre_mlp_layernorm", "linear_fc1", "linear_fc2", "router", "fc1_latent_proj", "fc2_latent_proj", "shared_experts.linear_fc1", "shared_experts.linear_fc2", "local_experts.linear_fc1", "local_experts.linear_fc2", "local_experts.linear_fc1_etp/ep", "local_experts.linear_fc2_etp/ep", and "experts.linear_fc1/linear_fc2" — to pass the is_mtp flag (e.g., is_mtp=is_mtp) so repeated‑MTP layers resolve to MTP prefixes correctly.

coderabbitai · 2026-01-29T17:04:57Z

modelopt/torch/export/quant_utils.py

+def get_kv_cache_scaling_factor(self_attention_module: nn.Module) -> torch.Tensor:
+    """
+    Returns the k and v BMM scaling factors if BMM quantizers are set in the self attention module. 
+    Else returns None by default.
+    """
+    if not hasattr(self_attention_module, "k_bmm_quantizer") or not hasattr(self_attention_module, "v_bmm_quantizer"):
        return [None, None]

    scaling_factors = [
-        get_scaling_factor(getattr(kv_module, quantizer))
+        get_scaling_factor(getattr(self_attention_module, quantizer))
        for quantizer in ("k_bmm_quantizer", "v_bmm_quantizer")
    ]
-
-    # For FP8, we recommend default kv cache scaling factor to be 1.
-    if get_kv_cache_dtype(kv_module) == KV_CACHE_FP8:
-        for i, factor in enumerate(scaling_factors):
-            if factor.item() > 0.5:
-                warn(
-                    f"Warning: Large KV activation detected: {factor.item()}, "
-                    "Quantized KV cache may lead to higher accuracy drop."
-                )
-            scaling_factors[i] = torch.max(
-                factor, torch.tensor([1.0], dtype=torch.float, device=factor.device)
-            )
-
    return scaling_factors


⚠️ Potential issue | 🟡 Minor

Fix return type annotation for KV‑cache scaling.
This function returns a list (e.g., [k_scale, v_scale]), but the signature says torch.Tensor. That mismatch will confuse callers and type checkers.

🛠️ Suggested fix

-def get_kv_cache_scaling_factor(self_attention_module: nn.Module) -> torch.Tensor: +def get_kv_cache_scaling_factor( + self_attention_module: nn.Module, +) -> list[torch.Tensor | None]:

🤖 Prompt for AI Agents

In `@modelopt/torch/export/quant_utils.py` around lines 347 - 359, The return type of get_kv_cache_scaling_factor is wrong: it returns a list of scaling factors (possibly [None, None]) not a torch.Tensor; update the signature of get_kv_cache_scaling_factor to return a typed container such as List[Optional[torch.Tensor]] or Tuple[Optional[torch.Tensor], Optional[torch.Tensor]] and import the corresponding typing names (List/Tuple and Optional), and ensure callers expecting a torch.Tensor are adjusted if necessary; reference function get_kv_cache_scaling_factor and attributes k_bmm_quantizer, v_bmm_quantizer and helper get_scaling_factor when making the change.

coderabbitai · 2026-01-29T17:04:57Z

modelopt/torch/export/unified_export_megatron.py

+        # MTP layer
+        self._get_mtp_state_dict()
+


⚠️ Potential issue | 🟠 Major

Guard MTP export when no pretrained path (or no MTP).
_get_mtp_state_dict() is called unconditionally and will error when pretrained_model_name_or_path is None (or when no MTP module exists).

🛠️ Suggested fix

- # MTP layer - self._get_mtp_state_dict() + # MTP layer + if hasattr(model, "mtp") and self._hf_pretrained_model_name: + self._get_mtp_state_dict()

Also applies to: 591-619

🤖 Prompt for AI Agents

In `@modelopt/torch/export/unified_export_megatron.py` around lines 497 - 499, The call to _get_mtp_state_dict() is unconditional and will raise when no pretrained_model_name_or_path or no MTP module exists; update the code to only call _get_mtp_state_dict() when self.pretrained_model_name_or_path is truthy and the MTP module exists (e.g., check getattr(self, "mtp", None) or hasattr(self, "mtp") before calling). Apply the same guard to the other MTP-related export block (the later section that currently calls _get_mtp_state_dict/_export for lines ~591-619) so MTP export is skipped when no pretrained path or MTP is present.

coderabbitai · 2026-01-29T17:04:57Z

modelopt/torch/export/unified_export_megatron.py

+        if qformat is None and "norm" not in prefix:  # Add exclude layers for hf_quant_config
+            self.exclude_modules.append(prefix)
        block_size = get_weight_block_size(module)


⚠️ Potential issue | 🟠 Major

Exclude‑modules list never populates for unquantized layers.
get_quantization_format returns QUANTIZATION_NONE, so qformat is None is never true and exclude_modules stays empty. That makes hf_quant_config incorrect for mixed/partial quantization.

🛠️ Suggested fix

- if qformat is None and "norm" not in prefix: # Add exclude layers for hf_quant_config + if qformat == QUANTIZATION_NONE and "norm" not in prefix: # Add exclude layers for hf_quant_config self.exclude_modules.append(prefix)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if qformat is None and "norm" not in prefix: # Add exclude layers for hf_quant_config

self.exclude_modules.append(prefix)

block_size = get_weight_block_size(module)

if qformat == QUANTIZATION_NONE and "norm" not in prefix: # Add exclude layers for hf_quant_config

self.exclude_modules.append(prefix)

block_size = get_weight_block_size(module)

🤖 Prompt for AI Agents

In `@modelopt/torch/export/unified_export_megatron.py` around lines 787 - 789, The condition treating qformat as None is wrong because get_quantization_format returns QUANTIZATION_NONE (an enum/const) not None; update the branch that currently checks "if qformat is None and 'norm' not in prefix:" to instead detect unquantized format (e.g., qformat == QUANTIZATION_NONE or qformat in {None, QUANTIZATION_NONE}) and then append prefix to self.exclude_modules so hf_quant_config correctly excludes unquantized layers; locate qformat usage near get_quantization_format and get_weight_block_size and amend the conditional accordingly.

coderabbitai · 2026-01-29T17:04:57Z

modelopt/torch/quantization/model_calib.py

+
+    # Step 2:Sync amax across data parallelism


⚠️ Potential issue | 🟡 Minor

Minor formatting issue: missing space after colon.

Proposed fix

- - # Step 2:Sync amax across data parallelism + + # Step 2: Sync amax across data parallelism

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Step 2:Sync amax across data parallelism

# Step 2: Sync amax across data parallelism

🤖 Prompt for AI Agents

In `@modelopt/torch/quantization/model_calib.py` around lines 103 - 104, The comment "Step 2:Sync amax across data parallelism" has a missing space after the colon; locate the comment in model_calib.py (the line containing "Step 2:Sync amax across data parallelism") and update it to "Step 2: Sync amax across data parallelism" so the formatting is consistent.

support latent moe import and fix local experts sync

9bda954

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

jenchen13 added 2 commits January 12, 2026 21:06

patch TransformerLayer forward

5da17f4

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

fix bug of duplicate forward

4471c03

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

jenchen13 changed the title ~~Latent MOE support and fix MOE amax sync~~ Latent MOE support and patch TransformerLayer forward for MOE Jan 13, 2026

jenchen13 requested review from ChenhanYu and realAsma January 13, 2026 16:03

realAsma reviewed Jan 14, 2026

View reviewed changes

modelopt/torch/quantization/plugins/megatron.py Outdated Show resolved Hide resolved

realAsma reviewed Jan 14, 2026

View reviewed changes

modelopt/torch/quantization/plugins/megatron.py Outdated Show resolved Hide resolved

fix kv bmm export

db1892f

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

jenchen13 commented Jan 20, 2026

View reviewed changes

small fixes

f26bf3c

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

jenchen13 changed the title ~~Latent MOE support and patch TransformerLayer forward for MOE~~ Latent MOE support and fix KV cache quant export Jan 20, 2026

mtp import fixes

2cc4acc

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

jenchen13 changed the title ~~Latent MOE support and fix KV cache quant export~~ Latent MOE & Repeated MTP support; fix KV cache quant export Jan 22, 2026

cjluo-nv reviewed Jan 22, 2026

View reviewed changes

jenchen13 added 4 commits January 22, 2026 13:29

enable TELinear quant

3d0a31a

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

import grouped mlp in mtp

334b0b9

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

fix grouped mlp import

d4bfbfe

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

fix config.json

fd1b9f4

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

jenchen13 commented Jan 27, 2026

View reviewed changes

jenchen13 added 3 commits January 27, 2026 07:43

fix import router dtype bug

348055d

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

make export exclude modules dynamic

13be41e

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

dynamically export exclude modules

18cc2ca

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

jenchen13 marked this pull request as ready for review January 29, 2026 16:53

jenchen13 requested a review from a team as a code owner January 29, 2026 16:53

jenchen13 requested a review from jingyu-ml January 29, 2026 16:53

coderabbitai bot reviewed Jan 29, 2026

View reviewed changes

jenchen13 closed this Jan 29, 2026

jenchen13 deleted the jennifchen/superv3 branch January 29, 2026 17:52

		]


		# This path uses output_quantizer for KV cache quantization.


	# Step 2:Sync amax across data parallelism

	# Step 2: Sync amax across data parallelism

Conversation

jenchen13 commented Jan 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot bot commented Jan 13, 2026

Uh oh!

coderabbitai bot commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

codecov bot commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

realAsma Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jenchen13 Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

jenchen13 Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

cjluo-nv Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

jenchen13 Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jenchen13 commented Jan 13, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 13, 2026 •

edited

Loading

codecov bot commented Jan 13, 2026 •

edited

Loading