Skip to content

support static NVFP4 HF export#858

Open
Fridah-nv wants to merge 9 commits intomainfrom
fridah/static-fp4-export
Open

support static NVFP4 HF export#858
Fridah-nv wants to merge 9 commits intomainfrom
fridah/static-fp4-export

Conversation

@Fridah-nv
Copy link
Contributor

@Fridah-nv Fridah-nv commented Feb 5, 2026

What does this PR do?

Type of change: ? new feature

Overview: ?
Supports export NVFP4StaticQantizer in unified huggingface checkpoint, as a deployment path for PTQ algorithms such as MSE

Usage

# checkpoint generation
python examples/llm_ptq/hf_ptq.py --pyt_ckpt_path Qwen/Qwen3-8B  --qformat nvfp4_mse --export_path test-Qwen3-8B-Instruct-MSE-FP8-sweep-FP4 --kv_cache_qformat none --trust_remote_code 

Testing

Tested generated Qwen3 8B checkpoint with trtllm serve and nv_eval example in Model-Optimizer-Internal/examples/nv_eval.

NV eval results:

| Groups |Version|Filter|n-shot|  Metric   |   |Value |   |Stderr|
|--------|-------|------|------|-----------|---|-----:|---|-----:|
|mmlu_str|       |none  |      |exact_match|↑  |0.7186|±  |0.0036|

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes/No
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes/No
  • Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

  • New Features

    • Added support for static NVFP4 quantizers that utilize pre-computed calibration scales.
    • Introduced new NVFP4 W4A4 quantization configuration with optional FP8 scale sweep.
  • Performance Improvements

    • Static quantizers now skip unnecessary dynamic scaling factor recalculation.
    • Unified quantization handling for improved consistency and efficiency.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 5, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 5, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Introduces support for NVFP4StaticQuantizer to handle pre-computed weight scales alongside existing dynamic quantizers. Changes distinguish between static and dynamic NVFP4 variants across export utilities, configuration definitions, and tensor scaling operations. Static quantizers bypass dynamic calibration; dynamic quantizers proceed normally.

Changes

Cohort / File(s) Summary
Export utilities
modelopt/torch/export/quant_utils.py, modelopt/torch/export/unified_export_hf.py
Added recognition of NVFP4StaticQuantizer in import and weight scaling logic. Export utilities now conditionally skip dynamic scaling computation for static quantizers and use pre-computed scales instead. Config generation labels static NVFP4 as nvfp4_static and fusion preprocessing unifies global_amax for static quantizers.
Quantization configuration
modelopt/torch/quantization/config.py
Introduced new NVFP4_W4A4_WEIGHT_MSE_FP8_SWEEP_CFG constant with static weight quantizer and dynamic input quantizer, combined with MSE calibration and FP8 scale sweep enablement.
Tensor scaling logic
modelopt/torch/quantization/qtensor/nvfp4_tensor.py
Added dual-path quantizer handling: _is_static_quantizer detects static quantizers via global_amax, and new get_weights_scaling_factor_from_quantizer method computes per-block scales for static quantizers from pre-computed values or delegates to dynamic path, with optional FP8 quantization.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The title 'support static NVFP4 HF export' accurately summarizes the main change—adding export support for static NVFP4 quantizers in Hugging Face export, which is reflected across all modified files.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fridah/static-fp4-export

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

@Fridah-nv Fridah-nv force-pushed the fridah/static-fp4-export branch from 9f69993 to df4e6a9 Compare February 5, 2026 19:02
Base automatically changed from asma/refactor-scale-sweep to main February 6, 2026 19:47
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
…FP4QTensor

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
@Fridah-nv Fridah-nv force-pushed the fridah/static-fp4-export branch from c1ea842 to e0606cb Compare February 6, 2026 22:56
@codecov
Copy link

codecov bot commented Feb 6, 2026

Codecov Report

❌ Patch coverage is 19.23077% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.16%. Comparing base (d78797b) to head (9a0416c).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...odelopt/torch/quantization/qtensor/nvfp4_tensor.py 16.00% 21 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #858      +/-   ##
==========================================
- Coverage   73.09%   72.16%   -0.93%     
==========================================
  Files         205      207       +2     
  Lines       22301    22680     +379     
==========================================
+ Hits        16300    16367      +67     
- Misses       6001     6313     +312     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Fridah-nv Fridah-nv marked this pull request as ready for review February 6, 2026 23:31
@Fridah-nv Fridah-nv requested review from a team as code owners February 6, 2026 23:31
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 509-520: The static NVFP4 path uses a scale computed from the
untransposed weight which will mismatch when is_bmm_expert_weight is True;
update the branch that checks is_nvfp4_static to either (1) recompute/reshape
weight_scale using the transposed weight by calling
NVFP4QTensor.get_weights_scaling_factor (or
get_weights_scaling_factor_from_quantizer) on the transposed weight before
calling to_quantized_weight, or (2) add an explicit guard that raises a clear
error when is_bmm_expert_weight and isinstance(weight_quantizer,
NVFP4StaticQuantizer) to prevent misuse; modify the code around the
is_nvfp4_static check (referencing is_bmm_expert_weight, weight_quantizer,
NVFP4StaticQuantizer, NVFP4QTensor.get_weights_scaling_factor[_from_quantizer],
and to_quantized_weight) accordingly.

In `@modelopt/torch/quantization/config.py`:
- Around line 391-411: The new NVFP4_W4A4_WEIGHT_MSE_FP8_SWEEP_CFG config is
defined but not added to the exported choices set; update the choices collection
(the variable named choices) to include "NVFP4_W4A4_WEIGHT_MSE_FP8_SWEEP_CFG"
alongside the other NVFP4 entries so it becomes discoverable by algorithms.py
(which expects all supported quantization format names in choices). Locate the
choices definition and append the new config name in the same style/ordering as
the other NVFP4_* entries.
🧹 Nitpick comments (2)
modelopt/torch/quantization/qtensor/nvfp4_tensor.py (2)

55-58: Duck-typing check vs isinstance — inconsistency with other call sites.

_is_static_quantizer uses duck typing (hasattr + is not None), but quant_utils.py and unified_export_hf.py use isinstance(weight_quantizer, NVFP4StaticQuantizer). If any non-NVFP4StaticQuantizer object happens to carry a global_amax attribute, this check could produce false positives. Consider aligning on isinstance for consistency, or documenting why duck typing is preferred here.

Proposed fix
+    from modelopt.torch.quantization.nn import NVFP4StaticQuantizer
+
     `@classmethod`
     def _is_static_quantizer(cls, weight_quantizer) -> bool:
         """Check if the weight quantizer is a static NVFP4 quantizer with pre-computed amax."""
-        return hasattr(weight_quantizer, "global_amax") and weight_quantizer.global_amax is not None
+        return isinstance(weight_quantizer, NVFP4StaticQuantizer) and weight_quantizer.global_amax is not None

110-130: Zero-scale handling differs between static and dynamic paths.

In the static path, per_block_scale[per_block_scale == 0] = 1.0 is applied before the 448.0 / per_block_scale_max normalization (Line 118 vs 127). In the dynamic path (get_weights_scaling_factor, Line 167), the same sentinel is applied after the normalization by weights_scaling_factor_2.

For all-zero blocks this is harmless (quantized values are 0, so the scale is irrelevant during dequant), but the resulting FP8 scale values will differ between the two paths for those blocks. This could complicate debugging or round-trip comparisons.

To align, move the zero guard after normalization:

Align zero-handling with dynamic path
             per_block_scale = per_block_amax / 6.0
-            per_block_scale[per_block_scale == 0] = 1.0
 
             # Reshape per_block_scale to match weight's block structure
             num_blocks_per_row = weight.shape[-1] // block_size
             expected_shape = (*weight.shape[:-1], num_blocks_per_row)
             per_block_scale = per_block_scale.view(expected_shape)
 
             # Quantize scales to FP8
             if not keep_high_precision:
                 per_block_scale = (per_block_scale * 448.0 / per_block_scale_max).to(
                     torch.float8_e4m3fn
                 )
+            per_block_scale_float = per_block_scale.float()
+            per_block_scale_float[per_block_scale_float == 0] = 1.0
+            per_block_scale = per_block_scale_float.to(per_block_scale.dtype)

Comment on lines 509 to 520

# Check if this is a static NVFP4 quantizer (has pre-computed scales from MSE calibration)
# For static NVFP4, weight_scale is already computed from static _amax values in get_weight_scaling_factor
is_nvfp4_static = isinstance(weight_quantizer, NVFP4StaticQuantizer)

if not is_nvfp4_static:
# For dynamic NVFP4, compute scales from weights
weight_scale = NVFP4QTensor.get_weights_scaling_factor(
weight,
block_size=block_size,
weights_scaling_factor_2=weight_scale_2,
)[0]
Copy link
Contributor

@coderabbitai coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential shape mismatch for BMM-style expert weights with static NVFP4.

When is_bmm_expert_weight is True, the weight is transposed at Line 506–508 (e.g., from (E, in_dim, out_dim)(E, out_dim, in_dim)). The dynamic path (Lines 516–520) correctly recomputes weight_scale from the transposed weight. However, the static path skips recomputation and uses the scale that was computed by get_weight_scaling_factor (Line 461) from the untransposed weight.

Since the static path in NVFP4QTensor.get_weights_scaling_factor_from_quantizer (nvfp4_tensor.py Line 121–123) reshapes the per-block scale using the weight's original shape, the scale would have shape (*untransposed_shape[:-1], num_blocks) which won't match the transposed weight layout expected by to_quantized_weight.

This would fail with a shape error if static NVFP4 quantizers are ever used with Llama4TextExperts or GptOssExperts. If that combination is currently not expected, a guard would prevent a confusing error later:

Proposed guard
         # Check if this is a static NVFP4 quantizer (has pre-computed scales from MSE calibration)
         # For static NVFP4, weight_scale is already computed from static _amax values in get_weight_scaling_factor
         is_nvfp4_static = isinstance(weight_quantizer, NVFP4StaticQuantizer)
 
+        if is_nvfp4_static and is_bmm_expert_weight:
+            raise NotImplementedError(
+                "Static NVFP4 quantization is not yet supported for BMM-style expert weights "
+                "(Llama4TextExperts, GptOssExperts). Use dynamic NVFP4 quantization instead."
+            )
+
         if not is_nvfp4_static:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Check if this is a static NVFP4 quantizer (has pre-computed scales from MSE calibration)
# For static NVFP4, weight_scale is already computed from static _amax values in get_weight_scaling_factor
is_nvfp4_static = isinstance(weight_quantizer, NVFP4StaticQuantizer)
if not is_nvfp4_static:
# For dynamic NVFP4, compute scales from weights
weight_scale = NVFP4QTensor.get_weights_scaling_factor(
weight,
block_size=block_size,
weights_scaling_factor_2=weight_scale_2,
)[0]
# Check if this is a static NVFP4 quantizer (has pre-computed scales from MSE calibration)
# For static NVFP4, weight_scale is already computed from static _amax values in get_weight_scaling_factor
is_nvfp4_static = isinstance(weight_quantizer, NVFP4StaticQuantizer)
if is_nvfp4_static and is_bmm_expert_weight:
raise NotImplementedError(
"Static NVFP4 quantization is not yet supported for BMM-style expert weights "
"(Llama4TextExperts, GptOssExperts). Use dynamic NVFP4 quantization instead."
)
if not is_nvfp4_static:
# For dynamic NVFP4, compute scales from weights
weight_scale = NVFP4QTensor.get_weights_scaling_factor(
weight,
block_size=block_size,
weights_scaling_factor_2=weight_scale_2,
)[0]
🤖 Prompt for AI Agents
In `@modelopt/torch/export/unified_export_hf.py` around lines 509 - 520, The
static NVFP4 path uses a scale computed from the untransposed weight which will
mismatch when is_bmm_expert_weight is True; update the branch that checks
is_nvfp4_static to either (1) recompute/reshape weight_scale using the
transposed weight by calling NVFP4QTensor.get_weights_scaling_factor (or
get_weights_scaling_factor_from_quantizer) on the transposed weight before
calling to_quantized_weight, or (2) add an explicit guard that raises a clear
error when is_bmm_expert_weight and isinstance(weight_quantizer,
NVFP4StaticQuantizer) to prevent misuse; modify the code around the
is_nvfp4_static check (referencing is_bmm_expert_weight, weight_quantizer,
NVFP4StaticQuantizer, NVFP4QTensor.get_weights_scaling_factor[_from_quantizer],
and to_quantized_weight) accordingly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fridah-nv could you fix this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not supporting static NVFP4 BMM yet, it won't be trigger in calibration algorithm, let me add a guard here

@Fridah-nv Fridah-nv changed the title Fridah/static fp4 export support static NVFP4 HF export Feb 7, 2026
Copy link
Contributor

@Edwardf0t1 Edwardf0t1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fridah-nv Could we add a sample command in the PR description and the MMLU results you have?

Also, is the checkpoint format the same comparing static NVFP4 vs our default NVFP4?

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
@Fridah-nv Fridah-nv requested a review from a team as a code owner February 7, 2026 01:09
@Fridah-nv Fridah-nv requested a review from meenchen February 7, 2026 01:09
@Fridah-nv
Copy link
Contributor Author

@Fridah-nv Could we add a sample command in the PR description and the MMLU results you have?

Added, thanks for reminding.

Also, is the checkpoint format the same comparing static NVFP4 vs our default NVFP4?

Yes, I checked that hf_quant_config.json and config.json are the same for the two formats.


# Handle NVFP4 variants (static or dynamic)
is_nvfp4_static = isinstance(weight_quantizer, NVFP4StaticQuantizer)
if is_nvfp4_static or quantization_format in [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the quantization_format for NVFP4StaticQuantizer? do we need is_nvfp4_static here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add this branch because I think we don't need _ensure_weight_quantizer_calibrated for NVFP4 static case. All NVFP4StaticQuantizer should be calibrated by the time of export

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the case for _ensure_weight_quantizer_calibrated to be effective. I understand activation quantizers might not be calibrated. But why weight quantizers might not be reached before export? If this is needed, I can ensure _ensure_weight_quantizer_calibrated work with NVFP4StaticQuantizer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the case that the weight quantizers of some experts might not be calibrated now. Updated _ensure_weight_quantizer_calibrated to handle NVFP4StaticQuantizer

# Calibrate weight quantizer if amax is not set for all NVFP4 variants
if quantization_format in [
is_nvfp4_static = isinstance(weight_quantizer, NVFP4StaticQuantizer)
if is_nvfp4_static or quantization_format in [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

# For static NVFP4, weight_scale is already computed from static _amax values in get_weight_scaling_factor
is_nvfp4_static = isinstance(weight_quantizer, NVFP4StaticQuantizer)

if not is_nvfp4_static:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to handle the else condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The static NVFP4 case is already handled by L459-L462

        else:
            sub_module.register_buffer(
                quantizer_attrs.weight_scale, get_weight_scaling_factor(sub_module, weight_name)
            )

Which calls NVFP4QTensor.get_weights_scaling_factor_from_quantizer that handles both static and dynamic NVFP4 case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines L511-L514 for handling dynamic NVFP4 could be removed potentially, I just need to double check on the dynamic BMM case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: removed this if branch and add a warning before BMM quantizers handling.

Comment on lines 509 to 520

# Check if this is a static NVFP4 quantizer (has pre-computed scales from MSE calibration)
# For static NVFP4, weight_scale is already computed from static _amax values in get_weight_scaling_factor
is_nvfp4_static = isinstance(weight_quantizer, NVFP4StaticQuantizer)

if not is_nvfp4_static:
# For dynamic NVFP4, compute scales from weights
weight_scale = NVFP4QTensor.get_weights_scaling_factor(
weight,
block_size=block_size,
weights_scaling_factor_2=weight_scale_2,
)[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fridah-nv could you fix this?

for module in modules:
if hasattr(module.weight_quantizer, "_global_amax"):
module.weight_quantizer._global_amax = unified_global_amax

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TRTLLM requires fusible linear layers to have same weight_scale_2 , so here I set global_amax to be the max value across all modules being fused. It's also what the dynamic path is doing.
After discussing with cursor, I think this won't have much impact to accuracy:

The quantization math:
weights_scaling_factor_2 = _global_amax / (6.0 * 448.0)
weights_scaling_factor = _amax / (6.0 * wsf2) = _amax * 448.0 / _global_amax
Effective scale used in quantization: weights_scaling_factor * weights_scaling_factor_2 = _amax / 6.0

Since the effective scale is wsf * wsf2 = _amax / 6.0, the _amax values are determined purely by the weight values themselves through MSE calibration, not by _global_amax.

The _global_amax is just a separate parameter that determines how to split the representation between the two scale factors (wsf and wsf2), but since they multiply together and _global_amax cancels out, it has no effect on the actual quantization.

I think it would still have some effect since wsf and wsf2 are kept in different precision, but the gap should not be large. @realAsma Please let me know if it makes sense to you.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fridah-nv are you saying that. e.g. for QKV,

You keep the wsf1 not changed, but just pick the wsf2 of QKV to be the max of the individual wsf2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the global_amax(wsf1) needs to be unified, wsf2 won't be updated. The concern above is that MSE algo ties wsf1 to wsf2, I'm worried that updating wsf1 only might affect accuracy. But it looks fine in math and in our experiements results.


# Handle NVFP4 variants (static or dynamic)
is_nvfp4_static = isinstance(weight_quantizer, NVFP4StaticQuantizer)
if is_nvfp4_static or quantization_format in [
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add this branch because I think we don't need _ensure_weight_quantizer_calibrated for NVFP4 static case. All NVFP4StaticQuantizer should be calibrated by the time of export

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
…NVFP4StaticQuantizer, update changelog

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
@@ -52,12 +52,87 @@ def get_e2m1_bounds(cls, device):
cls.e2m1_bounds_on_device[device] = e2m1_bounds.to(device)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Overall this PR looks good. The author has made significant improvements in the latest commits to consolidate NVFP4 logic into rather than scattering it across export utilities. The unified approach via follows the normal NVFP4 path convention well.

Major Points

1. Convention Alignment - GOOD
The latest refactor moves logic to which is the right approach. This centralizes NVFP4-specific behavior in one place.

2. Static vs Dynamic Path Handling
The check followed by branching to either static (pre-computed amax) or dynamic (compute from weights) is clean and follows existing patterns.

Responses to Existing Comments

On realAsma's config naming comment (line 391):
The rename to is good - it clarifies this is W4A4 (weights 4-bit, activations 4-bit).

On realAsma's scale computation suggestion (line ~302):
The current implementation uses:

This is mathematically equivalent to the suggestion and the final form is cleaner. The simplification works out to , then multiplied by 448 for FP8 quantization. This looks correct.

On cjluo-nv's isinstance question:
The author addressed this - is now sufficient. The attribute check was removed in later commits.

On global_amax vs _amax:

  • = per-block amax values (pre-computed during MSE calibration)
  • = per-tensor max for the global scale

The naming is clear in .

Minor Suggestions

  1. Docstring consistency: The docstrings in the new methods are good and explain the dual path (static vs dynamic).

  2. BMM expert handling: CodeRabbit's concern about BMM-style expert weights with static NVFP4 is valid - the warning in is appropriate as a temporary measure.

  3. Test coverage: The added tests () properly validate that static and dynamic paths produce identical outputs when given the same calibration data.

Conclusion

LGTM. The PR successfully:

  • Adds static NVFP4 export support following existing conventions
  • Consolidates NVFP4 logic in
  • Maintains backward compatibility with dynamic quantizers
  • Includes appropriate tests
  • Addresses reviewer feedback from previous rounds

for expert_type in ["Llama4TextExperts", "GptOssExperts"]
)
if is_bmm_expert_weight and isinstance(weight_quantizer, NVFP4StaticQuantizer):
warnings.warn(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about just throw exception here? Prefer failing fast instead of unexpected accuracy bug.

- Add LTX-2 and Wan2.2 (T2V) support in the diffusers quantization workflow.
- Add PTQ support for GLM-4.7, including loading MTP layer weights from a separate ``mtp.safetensors`` file and export as-is.
- Add support for image-text data calibration in PTQ for Nemotron VL models.
- Add support for advanced weight scale search for NVFP4 quantization and its export path.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a 0.42 feature. Please move to 0.43

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for exporting static NVFP4 (pre-calibrated) quantization into the unified Hugging Face checkpoint flow, enabling a deployment path for PTQ methods like MSE-based scale search.

Changes:

  • Add NVFP4 static-quantizer aware scale extraction and lazy calibration during export (quant_utils.py, nvfp4_tensor.py).
  • Introduce an NVFP4 W4A4 MSE (+ optional FP8 scale sweep) quantization config and expose it via hf_ptq.py as --qformat nvfp4_mse.
  • Expand unit/GPU tests to cover NVFP4 static quantizer export paths and unified HF export for the new qformat.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
modelopt/torch/quantization/qtensor/nvfp4_tensor.py Adds APIs to derive NVFP4 scaling factors from quantizers (static vs dynamic).
modelopt/torch/export/quant_utils.py Adds lazy calibration for NVFP4StaticQuantizer, updates NVFP4 scaling-factor extraction, and threads nvfp4_static through config processing.
modelopt/torch/export/unified_export_hf.py Imports NVFP4StaticQuantizer and warns about unsupported BMM expert weights for static NVFP4.
modelopt/torch/quantization/config.py Adds NVFP4_W4A4_WEIGHT_MSE_FP8_SWEEP_CFG.
examples/llm_ptq/hf_ptq.py Adds nvfp4_mse qformat option mapped to the new config.
tests/_test_utils/torch/export/utils.py Extends partial_nvfp4_config to quantize an additional toy layer.
tests/gpu/torch/export/test_export.py Updates expected NVFP4 exclude list to match the updated toy config.
tests/unit/torch/export/test_get_quantization.py Adds a unit test covering static NVFP4 quantizer detection/config export.
tests/gpu/torch/export/test_export_weight_gpu.py Adds a GPU test comparing dynamic vs static NVFP4 export results.
tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py Adds unified HF export coverage for nvfp4_mse.
CHANGELOG.rst Documents the new NVFP4 advanced scale search + export support.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants