Skip to content

Latent MOE & Repeated MTP support; fix KV cache quant export#768

Closed
jenchen13 wants to merge 13 commits intomainfrom
jennifchen/superv3
Closed

Latent MOE & Repeated MTP support; fix KV cache quant export#768
jenchen13 wants to merge 13 commits intomainfrom
jennifchen/superv3

Conversation

@jenchen13
Copy link
Contributor

@jenchen13 jenchen13 commented Jan 13, 2026

What does this PR do?

Type of change: New feature

Overview:
New nemotron models use TransformerLayer.forward() instead of MoELayer.forward() for MOE. This is a breaking change to our quantization implementation for Nano3 which relied on patching MoELayer.forward() to force tokens to be routed to all experts during calibration.

  • Add patch for TransformerLayer.forward() which forces tokens to be routed to all experts during PTQ calibration
  • Enable latent MOE modules during megatron import/export
  • Fix KV cache quantization export: remove old qkv_layer.output_quantizer export & replace with proper k/v_bmm_quantizer logic
  • Remove KV quantization scale clamping to 1
  • Improvements to EP amax sync
  • Support repeated MTP import/export for NemotronH models

? TODO Potentially remove MoELayer quant config if all MOEs in future will use TransformerLayer instead?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes/No
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes/No
  • Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced support for Nemotron models including latent Mixture-of-Experts and grouped local expert handling
    • Added attention scaling optimization for improved quantized model inference
    • Integrated Transformer Engine support for advanced quantization
  • Improvements

    • Refined KV cache quantization and scaling for better inference performance
    • Optimized quantization calibration across distributed training environments
    • Enhanced model export with improved quantization configuration compatibility

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: jenchen13 <jennifchen@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 13, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 13, 2026

📝 Walkthrough

Walkthrough

This pull request extends the modelopt export and quantization pipeline to support Megatron TE/MTP (Tensor-Parallel) configurations, MoE (Mixture of Experts) improvements, and enhanced KV-cache scaling for quantization. Changes introduce custom module mappings, reorganize import/export logic, refactor KV-cache handling, and add distributed synchronization for expert amax values.

Changes

Cohort / File(s) Summary
Custom Module Mappings
modelopt/torch/export/plugins/mcore_custom.py
Added two new CustomModuleMapping classes: GroupedMLPMerging for merging grouped MLP projections and SelfAttentionScaling for self-attention scaling.
Nemotron Model Configuration
modelopt/torch/export/plugins/mcore_nemotron.py
Extended import/export mappings with new latent MoE, MTP (repeated) modules, and grouped local experts support; integrated GroupedMLPMerging and SelfAttentionScaling custom mappings.
Megatron Importer Logic
modelopt/torch/export/plugins/megatron_importer.py
Major refactoring: added _grouped_mlp_merging method, introduced is_mtp flag propagation, modularized layer imports via _import_mamba_layer and _import_transformer_layer helpers, rewrote MoE import flow for local/shared/TE-grouped cases, expanded MTP handling for single and repeated configurations, switched to dist alias for torch.distributed calls, and added rank-based gating for progress output.
KV-Cache Quantization Utilities
modelopt/torch/export/quant_utils.py
Refactored get_kv_cache_scaling_factor parameter naming (kv_module → self_attention_module) and return type annotation, removed FP8 runtime warnings and threshold clamping, introduced _compute_kv_cache_dtype helper to centralize KV cache dtype computation logic.
Megatron Export Pipeline
modelopt/torch/export/unified_export_megatron.py
Replaced standalone KV-cache scaling with new _self_attention_scaling method, introduced exclude_modules collection across ranks for HF quant config, added NVFP4 handling in quantization config, added MTP safetensors loading path, updated moe_router_dtype parameter to string type, extended _get_quantized_state with prefix parameter for per-layer quantization handling, and added config.json augmentation with converted quant format.
Quantization Synchronization
modelopt/torch/quantization/model_calib.py
Reorganized amax synchronization steps: added Step 1 for MoE local experts sync, relabeled DP sync to Step 2, added Step 3 for TP-aware synchronization across input and weight quantizers based on parallel config.
Megatron Quantization Plugins
modelopt/torch/quantization/plugins/megatron.py
Added _QuantTEMCoreLinear class registration for Telegram Engine linear quantization, imported megatron.core.transformer.transformer_layer and torch.distributed as dist, removed local barrier in sync_moe_local_experts_amax, and adjusted forward calls to assign output variable in quantized MoE paths.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly captures the three main objectives: latent MOE support, repeated MTP support, and KV cache quantization export fixes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link

codecov bot commented Jan 13, 2026

Codecov Report

❌ Patch coverage is 40.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.62%. Comparing base (5104513) to head (4471c03).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/quantization/model_calib.py 40.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #768      +/-   ##
==========================================
- Coverage   74.62%   74.62%   -0.01%     
==========================================
  Files         192      192              
  Lines       18989    18992       +3     
==========================================
+ Hits        14171    14172       +1     
- Misses       4818     4820       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
@jenchen13 jenchen13 changed the title Latent MOE support and fix MOE amax sync Latent MOE support and patch TransformerLayer forward for MOE Jan 13, 2026
Comment on lines 98 to 105
# Step 1: Sync amax across local experts in a SequentialMLP
for name, module in model.named_modules():
if hasattr(module, "sync_moe_local_experts_amax"):
module.sync_moe_local_experts_amax()

# TODO just for testing
if "experts" in name and "weight_quantizer" in name:
assert child.amax is not None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this before if distributed sync check? This is not doing anything particular to distributed sync

Signed-off-by: jenchen13 <jennifchen@nvidia.com>
# TODO double check if MOE forward will be implemented in MoELayer or TransformerLayer
# We do not need both layers to be patched

@QuantModuleRegistry.register({megatron_transformer_layer.TransformerLayer: "megatron_transformer_layer_TransformerLayer"})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO maybe remove this since MOE forward will not be removed in MLM main?

"input_layernorm": NameRemapping("backbone.layers.{}.norm."),
"linear_qkv": QKVSlicing("backbone.layers.{}.mixer."),
"linear_proj": NameRemapping("backbone.layers.{}.mixer.o_proj."),
"core_attention": SelfAttentionScaling("backbone.layers.{}.mixer."),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doublecheck that this is only needed for export

Signed-off-by: jenchen13 <jennifchen@nvidia.com>
@jenchen13 jenchen13 changed the title Latent MOE support and patch TransformerLayer forward for MOE Latent MOE support and fix KV cache quant export Jan 20, 2026
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
@jenchen13 jenchen13 changed the title Latent MOE support and fix KV cache quant export Latent MOE & Repeated MTP support; fix KV cache quant export Jan 22, 2026
]


# This path uses output_quantizer for KV cache quantization.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope to see more proof before we remove this.

Also TRTLLM right now by default still uses 1 as the kv cache scale [ignoring the values we set here.]

Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
layer.mlp.router, layer_id, dtype=self.moe_router_dtype
)
if hasattr(layer.mlp, "fc1_latent_proj") and layer.mlp.fc1_latent_proj is not None:
self.rules["fc1_latent_proj"](layer.mlp.fc1_latent_proj, layer_id)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latent MOE import added here

Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
@jenchen13 jenchen13 marked this pull request as ready for review January 29, 2026 16:53
@jenchen13 jenchen13 requested a review from a team as a code owner January 29, 2026 16:53
@jenchen13 jenchen13 requested a review from a team as a code owner January 29, 2026 16:53
@jenchen13 jenchen13 requested a review from jingyu-ml January 29, 2026 16:53
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
modelopt/torch/quantization/model_calib.py (1)

110-152: Inconsistent step numbering in comments.

Line 110 introduces "Step 3: TP sync" but line 152 contains a comment saying "Step 2: Sync amax across relevant parallelism". This is confusing since the code at line 152 is part of the TP synchronization logic introduced by the Step 3 comment.

Proposed fix
-    # Step 2: Sync amax across relevant parallelism (such as TP / EP)
+    # Sync amax across relevant parallelism (such as TP / EP)

Alternatively, remove the duplicate comment entirely since it's redundant with the Step 3 header.

modelopt/torch/export/quant_utils.py (1)

363-415: Preserve affine detection when choosing KV‑cache dtype.
is_affine is computed in get_kv_cache_dtype but then ignored; _compute_kv_cache_dtype always treats (2,1) as affine. That changes export semantics and can mislabel non‑affine NVFP4 as NVFP4_AFFINE.

🛠️ Suggested fix
-    return _compute_kv_cache_dtype(num_bits_list)
+    return _compute_kv_cache_dtype(num_bits_list, is_affine)

-def _compute_kv_cache_dtype(num_bits_list: list[int]) -> str | None:
+def _compute_kv_cache_dtype(num_bits_list: list[int], is_affine: bool) -> str | None:
@@
-    is_affine = True
-
     if (4, 3) in num_bits_list:
         return KV_CACHE_FP8
🤖 Fix all issues with AI agents
In `@modelopt/torch/export/plugins/mcore_nemotron.py`:
- Line 100: The inline stale TODO comment "# TODO ADD MTP export" in
mcore_nemotron.py is now misleading because MTP export mappings are present
further down; remove that TODO or replace it with an accurate note describing
any remaining specific work (e.g., "TODO: verify MTP mapping completeness" or
similar). Locate the comment text "# TODO ADD MTP export" in the module
(mcore_nemotron.py) and either delete it or update it to a precise, current TODO
referencing the actual remaining task.

In `@modelopt/torch/export/plugins/megatron_importer.py`:
- Around line 531-646: The repeated‑MTP flag (is_mtp) isn't forwarded from
_import_transformer_layer to most rule calls (layernorms, MLP/router/expert
weights), causing tensors to resolve to backbone prefixes; update all rule
invocations inside _import_transformer_layer that handle per-layer weights —
e.g., "input_layernorm", "q_layernorm", "k_layernorm", "pre_mlp_layernorm",
"linear_fc1", "linear_fc2", "router", "fc1_latent_proj", "fc2_latent_proj",
"shared_experts.linear_fc1", "shared_experts.linear_fc2",
"local_experts.linear_fc1", "local_experts.linear_fc2",
"local_experts.linear_fc1_etp/ep", "local_experts.linear_fc2_etp/ep", and
"experts.linear_fc1/linear_fc2" — to pass the is_mtp flag (e.g., is_mtp=is_mtp)
so repeated‑MTP layers resolve to MTP prefixes correctly.

In `@modelopt/torch/export/quant_utils.py`:
- Around line 347-359: The return type of get_kv_cache_scaling_factor is wrong:
it returns a list of scaling factors (possibly [None, None]) not a torch.Tensor;
update the signature of get_kv_cache_scaling_factor to return a typed container
such as List[Optional[torch.Tensor]] or Tuple[Optional[torch.Tensor],
Optional[torch.Tensor]] and import the corresponding typing names (List/Tuple
and Optional), and ensure callers expecting a torch.Tensor are adjusted if
necessary; reference function get_kv_cache_scaling_factor and attributes
k_bmm_quantizer, v_bmm_quantizer and helper get_scaling_factor when making the
change.

In `@modelopt/torch/export/unified_export_megatron.py`:
- Around line 497-499: The call to _get_mtp_state_dict() is unconditional and
will raise when no pretrained_model_name_or_path or no MTP module exists; update
the code to only call _get_mtp_state_dict() when
self.pretrained_model_name_or_path is truthy and the MTP module exists (e.g.,
check getattr(self, "mtp", None) or hasattr(self, "mtp") before calling). Apply
the same guard to the other MTP-related export block (the later section that
currently calls _get_mtp_state_dict/_export for lines ~591-619) so MTP export is
skipped when no pretrained path or MTP is present.
- Around line 787-789: The condition treating qformat as None is wrong because
get_quantization_format returns QUANTIZATION_NONE (an enum/const) not None;
update the branch that currently checks "if qformat is None and 'norm' not in
prefix:" to instead detect unquantized format (e.g., qformat ==
QUANTIZATION_NONE or qformat in {None, QUANTIZATION_NONE}) and then append
prefix to self.exclude_modules so hf_quant_config correctly excludes unquantized
layers; locate qformat usage near get_quantization_format and
get_weight_block_size and amend the conditional accordingly.

In `@modelopt/torch/quantization/model_calib.py`:
- Around line 103-104: The comment "Step 2:Sync amax across data parallelism"
has a missing space after the colon; locate the comment in model_calib.py (the
line containing "Step 2:Sync amax across data parallelism") and update it to
"Step 2: Sync amax across data parallelism" so the formatting is consistent.
🧹 Nitpick comments (2)
modelopt/torch/export/unified_export_megatron.py (1)

129-134: Avoid always‑on debug prints in export paths.
These prints will spam logs for large models. Consider gating them behind a debug flag or using a logger at debug level.

Also applies to: 856-891

modelopt/torch/export/plugins/megatron_importer.py (1)

265-289: Grouped MLP import ignores quantized weight scales.
The TODO leaves weight scales unhandled, which will break quantized checkpoints. Consider loading {weight_scale_name} per expert or explicitly documenting that quantized TEGroupedMLP import isn’t supported yet.

If you want, I can draft the scale‑loading logic for this path.


}

# TODO ADD MTP export
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Stale TODO comment.
The code now includes MTP export mappings below, so this TODO is misleading. Please remove or update it to reflect the remaining work.

🤖 Prompt for AI Agents
In `@modelopt/torch/export/plugins/mcore_nemotron.py` at line 100, The inline
stale TODO comment "# TODO ADD MTP export" in mcore_nemotron.py is now
misleading because MTP export mappings are present further down; remove that
TODO or replace it with an accurate note describing any remaining specific work
(e.g., "TODO: verify MTP mapping completeness" or similar). Locate the comment
text "# TODO ADD MTP export" in the module (mcore_nemotron.py) and either delete
it or update it to a precise, current TODO referencing the actual remaining
task.

Comment on lines +531 to +646
def _import_transformer_layer(self, layer, layer_id, layer_pbar, is_mtp: bool = False):
if not isinstance(layer.input_layernorm, IdentityOp):
self.rules["input_layernorm"](layer.input_layernorm, layer_id)

attention = layer.self_attention
if not isinstance(attention, IdentityOp):
if "MLASelfAttention" in str(type(attention)):
if hasattr(attention, "linear_q_proj"):
layer_pbar.set_description("Importing MLA (without q LoRA)")
self.rules["linear_q_proj"](attention.linear_q_proj, layer_id)
else:
layer_pbar.set_description("Importing MLA (with q LoRA)")
self.rules["linear_q_down_proj"](attention.linear_q_down_proj, layer_id)
self.rules["linear_q_layernorm"](attention.q_layernorm, layer_id)
self.rules["linear_q_up_proj"](attention.linear_q_up_proj, layer_id)
self.rules["linear_kv_down_proj"](attention.linear_kv_down_proj, layer_id)
self.rules["linear_kv_layernorm"](attention.kv_layernorm, layer_id)
self.rules["linear_kv_up_proj"](attention.linear_kv_up_proj, layer_id)
self.rules["linear_proj"](attention.linear_proj, layer_id)
else:
layer_pbar.set_description("Importing GQA/MHA")
if attention.q_layernorm is not None and not isinstance(
attention.q_layernorm, (IdentityOp, L2Norm)
):
self.rules["q_layernorm"](attention.q_layernorm, layer_id)
self.rules["k_layernorm"](attention.k_layernorm, layer_id)
self.rules["linear_qkv"](attention.linear_qkv, layer_id, is_mtp=is_mtp)
self.rules["linear_proj"](attention.linear_proj, layer_id, is_mtp=is_mtp)
if getattr(attention.core_attention, "softmax_offset", None) is not None:
self.rules["softmax_offset"](
attention.core_attention.softmax_offset, layer_id
)

if not isinstance(layer.pre_mlp_layernorm, IdentityOp):
self.rules["pre_mlp_layernorm"](layer.pre_mlp_layernorm, layer_id)

if not isinstance(layer.mlp, IdentityOp):
if "MoE" in str(type(layer.mlp)):
layer_pbar.set_description(f"Importing MoE with moe_router_dtype: {self.moe_router_dtype}")
self.rules["router"](
layer.mlp.router, layer_id, dtype=self.moe_router_dtype
)
if hasattr(layer.mlp, "fc1_latent_proj") and layer.mlp.fc1_latent_proj is not None:
self.rules["fc1_latent_proj"](layer.mlp.fc1_latent_proj, layer_id)
if hasattr(layer.mlp, "fc2_latent_proj") and layer.mlp.fc2_latent_proj is not None:
self.rules["fc2_latent_proj"](layer.mlp.fc2_latent_proj, layer_id)

if (
hasattr(layer.mlp, "shared_experts")
and layer.mlp.shared_experts is not None
):
layer_pbar.set_description("Importing MoE shared experts")
fc1 = layer.mlp.shared_experts.linear_fc1
fc2 = layer.mlp.shared_experts.linear_fc2
self.rules["shared_experts.linear_fc1"](fc1, layer_id)
self.rules["shared_experts.linear_fc2"](fc2, layer_id)
if not self.rules.get("use_packed_local_experts", False): # Import local experts
experts = layer.mlp.experts
if hasattr(experts, "local_experts"):
for local_expert_id, expert in tqdm(
enumerate(layer.mlp.experts.local_experts),
desc="Importing MoE local experts",
leave=False,
disable=self.disable_tqdm,
):
expert_id = layer.mlp.local_expert_indices[local_expert_id]
fc1 = expert.linear_fc1
fc2 = expert.linear_fc2
self.rules["local_experts.linear_fc1"](fc1, layer_id, expert_id)
self.rules["local_experts.linear_fc2"](fc2, layer_id, expert_id)
else: # Slice TEGroupedMLP
layer_pbar.set_description("Importing MoE grouped local experts")
num_local_experts = experts.num_local_experts
num_global_experts = experts.config.num_moe_experts
assert num_local_experts == num_global_experts, "num_local_experts must be equal to num_global_experts during MoE import"

'''
if parallel_config is not None:
etp_size = get_expert_tensor_parallel_world_size()
# etp_rank = get_expert_tensor_parallel_rank() # this gives group rank
etp_rank = dist.get_rank()
print(f"etp_size: {etp_size}")
print(f"etp_rank: {etp_rank}")
assert num_local_experts * etp_size == num_global_experts
init_index = etp_rank * num_local_experts
'''
init_index = 0

self.rules["experts.linear_fc1"](experts.linear_fc1, layer_id, init_expert_id=init_index, num_local_experts=num_local_experts)
self.rules["experts.linear_fc2"](experts.linear_fc2, layer_id, init_expert_id=init_index, num_local_experts=num_local_experts)

# We only support either EP or ETP for now
elif get_expert_tensor_parallel_world_size() > 1:
# ETP supports for packed MoE
# ETP is not supported for gpt-oss model
if self.arch in ["GptOssForCausalLM"]:
raise ValueError("ETP is not supported for gpt-oss model")
self.rules["local_experts.linear_fc1_etp"](
layer.mlp.experts.local_experts, layer_id
)
self.rules["local_experts.linear_fc2_etp"](
layer.mlp.experts.local_experts, layer_id
)
else:
# EP supports for packed MoE
self.rules["local_experts.linear_fc1_ep"](
layer.mlp.experts.local_experts, layer_id
)
self.rules["local_experts.linear_fc2_ep"](
layer.mlp.experts.local_experts, layer_id
)
else:
layer_pbar.set_description("Importing MLP")
self.rules["linear_fc1"](layer.mlp.linear_fc1, layer_id)
self.rules["linear_fc2"](layer.mlp.linear_fc2, layer_id)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Propagate is_mtp to all rule calls in repeated‑MTP layers.
Only linear_qkv and linear_proj receive is_mtp, so layernorm/MLP/router/expert weights still resolve to backbone prefixes. That will import the wrong tensors for repeated MTP.

🛠️ Suggested fix
     def _import_transformer_layer(self, layer, layer_id, layer_pbar, is_mtp: bool = False):
+        mtp_kwargs = {"is_mtp": is_mtp} if is_mtp else {}
         if not isinstance(layer.input_layernorm, IdentityOp):
-            self.rules["input_layernorm"](layer.input_layernorm, layer_id)
+            self.rules["input_layernorm"](layer.input_layernorm, layer_id, **mtp_kwargs)
@@
-                self.rules["linear_qkv"](attention.linear_qkv, layer_id, is_mtp=is_mtp)
-                self.rules["linear_proj"](attention.linear_proj, layer_id, is_mtp=is_mtp)
+                self.rules["linear_qkv"](attention.linear_qkv, layer_id, **mtp_kwargs)
+                self.rules["linear_proj"](attention.linear_proj, layer_id, **mtp_kwargs)
@@
         if not isinstance(layer.pre_mlp_layernorm, IdentityOp):
-            self.rules["pre_mlp_layernorm"](layer.pre_mlp_layernorm, layer_id)
+            self.rules["pre_mlp_layernorm"](layer.pre_mlp_layernorm, layer_id, **mtp_kwargs)
@@
-                self.rules["router"](layer.mlp.router, layer_id, dtype=self.moe_router_dtype)
+                self.rules["router"](layer.mlp.router, layer_id, dtype=self.moe_router_dtype, **mtp_kwargs)
@@
-                    self.rules["shared_experts.linear_fc1"](fc1, layer_id)
-                    self.rules["shared_experts.linear_fc2"](fc2, layer_id)
+                    self.rules["shared_experts.linear_fc1"](fc1, layer_id, **mtp_kwargs)
+                    self.rules["shared_experts.linear_fc2"](fc2, layer_id, **mtp_kwargs)
@@
-                            self.rules["local_experts.linear_fc1"](fc1, layer_id, expert_id)
-                            self.rules["local_experts.linear_fc2"](fc2, layer_id, expert_id)
+                            self.rules["local_experts.linear_fc1"](fc1, layer_id, expert_id, **mtp_kwargs)
+                            self.rules["local_experts.linear_fc2"](fc2, layer_id, expert_id, **mtp_kwargs)
@@
-                self.rules["linear_fc1"](layer.mlp.linear_fc1, layer_id)
-                self.rules["linear_fc2"](layer.mlp.linear_fc2, layer_id)
+                self.rules["linear_fc1"](layer.mlp.linear_fc1, layer_id, **mtp_kwargs)
+                self.rules["linear_fc2"](layer.mlp.linear_fc2, layer_id, **mtp_kwargs)
🤖 Prompt for AI Agents
In `@modelopt/torch/export/plugins/megatron_importer.py` around lines 531 - 646,
The repeated‑MTP flag (is_mtp) isn't forwarded from _import_transformer_layer to
most rule calls (layernorms, MLP/router/expert weights), causing tensors to
resolve to backbone prefixes; update all rule invocations inside
_import_transformer_layer that handle per-layer weights — e.g.,
"input_layernorm", "q_layernorm", "k_layernorm", "pre_mlp_layernorm",
"linear_fc1", "linear_fc2", "router", "fc1_latent_proj", "fc2_latent_proj",
"shared_experts.linear_fc1", "shared_experts.linear_fc2",
"local_experts.linear_fc1", "local_experts.linear_fc2",
"local_experts.linear_fc1_etp/ep", "local_experts.linear_fc2_etp/ep", and
"experts.linear_fc1/linear_fc2" — to pass the is_mtp flag (e.g., is_mtp=is_mtp)
so repeated‑MTP layers resolve to MTP prefixes correctly.

Comment on lines +347 to 359
def get_kv_cache_scaling_factor(self_attention_module: nn.Module) -> torch.Tensor:
"""
Returns the k and v BMM scaling factors if BMM quantizers are set in the self attention module.
Else returns None by default.
"""
if not hasattr(self_attention_module, "k_bmm_quantizer") or not hasattr(self_attention_module, "v_bmm_quantizer"):
return [None, None]

scaling_factors = [
get_scaling_factor(getattr(kv_module, quantizer))
get_scaling_factor(getattr(self_attention_module, quantizer))
for quantizer in ("k_bmm_quantizer", "v_bmm_quantizer")
]

# For FP8, we recommend default kv cache scaling factor to be 1.
if get_kv_cache_dtype(kv_module) == KV_CACHE_FP8:
for i, factor in enumerate(scaling_factors):
if factor.item() > 0.5:
warn(
f"Warning: Large KV activation detected: {factor.item()}, "
"Quantized KV cache may lead to higher accuracy drop."
)
scaling_factors[i] = torch.max(
factor, torch.tensor([1.0], dtype=torch.float, device=factor.device)
)

return scaling_factors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix return type annotation for KV‑cache scaling.
This function returns a list (e.g., [k_scale, v_scale]), but the signature says torch.Tensor. That mismatch will confuse callers and type checkers.

🛠️ Suggested fix
-def get_kv_cache_scaling_factor(self_attention_module: nn.Module) -> torch.Tensor:
+def get_kv_cache_scaling_factor(
+    self_attention_module: nn.Module,
+) -> list[torch.Tensor | None]:
🤖 Prompt for AI Agents
In `@modelopt/torch/export/quant_utils.py` around lines 347 - 359, The return type
of get_kv_cache_scaling_factor is wrong: it returns a list of scaling factors
(possibly [None, None]) not a torch.Tensor; update the signature of
get_kv_cache_scaling_factor to return a typed container such as
List[Optional[torch.Tensor]] or Tuple[Optional[torch.Tensor],
Optional[torch.Tensor]] and import the corresponding typing names (List/Tuple
and Optional), and ensure callers expecting a torch.Tensor are adjusted if
necessary; reference function get_kv_cache_scaling_factor and attributes
k_bmm_quantizer, v_bmm_quantizer and helper get_scaling_factor when making the
change.

Comment on lines +497 to +499
# MTP layer
self._get_mtp_state_dict()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard MTP export when no pretrained path (or no MTP).
_get_mtp_state_dict() is called unconditionally and will error when pretrained_model_name_or_path is None (or when no MTP module exists).

🛠️ Suggested fix
-        # MTP layer
-        self._get_mtp_state_dict()
+        # MTP layer
+        if hasattr(model, "mtp") and self._hf_pretrained_model_name:
+            self._get_mtp_state_dict()

Also applies to: 591-619

🤖 Prompt for AI Agents
In `@modelopt/torch/export/unified_export_megatron.py` around lines 497 - 499, The
call to _get_mtp_state_dict() is unconditional and will raise when no
pretrained_model_name_or_path or no MTP module exists; update the code to only
call _get_mtp_state_dict() when self.pretrained_model_name_or_path is truthy and
the MTP module exists (e.g., check getattr(self, "mtp", None) or hasattr(self,
"mtp") before calling). Apply the same guard to the other MTP-related export
block (the later section that currently calls _get_mtp_state_dict/_export for
lines ~591-619) so MTP export is skipped when no pretrained path or MTP is
present.

Comment on lines +787 to 789
if qformat is None and "norm" not in prefix: # Add exclude layers for hf_quant_config
self.exclude_modules.append(prefix)
block_size = get_weight_block_size(module)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Exclude‑modules list never populates for unquantized layers.
get_quantization_format returns QUANTIZATION_NONE, so qformat is None is never true and exclude_modules stays empty. That makes hf_quant_config incorrect for mixed/partial quantization.

🛠️ Suggested fix
-        if qformat is None and "norm" not in prefix:  # Add exclude layers for hf_quant_config
+        if qformat == QUANTIZATION_NONE and "norm" not in prefix:  # Add exclude layers for hf_quant_config
             self.exclude_modules.append(prefix)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if qformat is None and "norm" not in prefix: # Add exclude layers for hf_quant_config
self.exclude_modules.append(prefix)
block_size = get_weight_block_size(module)
if qformat == QUANTIZATION_NONE and "norm" not in prefix: # Add exclude layers for hf_quant_config
self.exclude_modules.append(prefix)
block_size = get_weight_block_size(module)
🤖 Prompt for AI Agents
In `@modelopt/torch/export/unified_export_megatron.py` around lines 787 - 789, The
condition treating qformat as None is wrong because get_quantization_format
returns QUANTIZATION_NONE (an enum/const) not None; update the branch that
currently checks "if qformat is None and 'norm' not in prefix:" to instead
detect unquantized format (e.g., qformat == QUANTIZATION_NONE or qformat in
{None, QUANTIZATION_NONE}) and then append prefix to self.exclude_modules so
hf_quant_config correctly excludes unquantized layers; locate qformat usage near
get_quantization_format and get_weight_block_size and amend the conditional
accordingly.

Comment on lines +103 to +104

# Step 2:Sync amax across data parallelism
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor formatting issue: missing space after colon.

Proposed fix
-    
-    # Step 2:Sync amax across data parallelism
+
+    # Step 2: Sync amax across data parallelism
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Step 2:Sync amax across data parallelism
# Step 2: Sync amax across data parallelism
🤖 Prompt for AI Agents
In `@modelopt/torch/quantization/model_calib.py` around lines 103 - 104, The
comment "Step 2:Sync amax across data parallelism" has a missing space after the
colon; locate the comment in model_calib.py (the line containing "Step 2:Sync
amax across data parallelism") and update it to "Step 2: Sync amax across data
parallelism" so the formatting is consistent.

@jenchen13 jenchen13 closed this Jan 29, 2026
@jenchen13 jenchen13 deleted the jennifchen/superv3 branch January 29, 2026 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants