Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughIntroduces support for NVFP4StaticQuantizer to handle pre-computed weight scales alongside existing dynamic quantizers. Changes distinguish between static and dynamic NVFP4 variants across export utilities, configuration definitions, and tensor scaling operations. Static quantizers bypass dynamic calibration; dynamic quantizers proceed normally. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Comment |
9f69993 to
df4e6a9
Compare
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
…FP4QTensor Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
c1ea842 to
e0606cb
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #858 +/- ##
==========================================
- Coverage 73.09% 72.16% -0.93%
==========================================
Files 205 207 +2
Lines 22301 22680 +379
==========================================
+ Hits 16300 16367 +67
- Misses 6001 6313 +312 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 509-520: The static NVFP4 path uses a scale computed from the
untransposed weight which will mismatch when is_bmm_expert_weight is True;
update the branch that checks is_nvfp4_static to either (1) recompute/reshape
weight_scale using the transposed weight by calling
NVFP4QTensor.get_weights_scaling_factor (or
get_weights_scaling_factor_from_quantizer) on the transposed weight before
calling to_quantized_weight, or (2) add an explicit guard that raises a clear
error when is_bmm_expert_weight and isinstance(weight_quantizer,
NVFP4StaticQuantizer) to prevent misuse; modify the code around the
is_nvfp4_static check (referencing is_bmm_expert_weight, weight_quantizer,
NVFP4StaticQuantizer, NVFP4QTensor.get_weights_scaling_factor[_from_quantizer],
and to_quantized_weight) accordingly.
In `@modelopt/torch/quantization/config.py`:
- Around line 391-411: The new NVFP4_W4A4_WEIGHT_MSE_FP8_SWEEP_CFG config is
defined but not added to the exported choices set; update the choices collection
(the variable named choices) to include "NVFP4_W4A4_WEIGHT_MSE_FP8_SWEEP_CFG"
alongside the other NVFP4 entries so it becomes discoverable by algorithms.py
(which expects all supported quantization format names in choices). Locate the
choices definition and append the new config name in the same style/ordering as
the other NVFP4_* entries.
🧹 Nitpick comments (2)
modelopt/torch/quantization/qtensor/nvfp4_tensor.py (2)
55-58: Duck-typing check vsisinstance— inconsistency with other call sites.
_is_static_quantizeruses duck typing (hasattr + is not None), butquant_utils.pyandunified_export_hf.pyuseisinstance(weight_quantizer, NVFP4StaticQuantizer). If any non-NVFP4StaticQuantizerobject happens to carry aglobal_amaxattribute, this check could produce false positives. Consider aligning onisinstancefor consistency, or documenting why duck typing is preferred here.Proposed fix
+ from modelopt.torch.quantization.nn import NVFP4StaticQuantizer + `@classmethod` def _is_static_quantizer(cls, weight_quantizer) -> bool: """Check if the weight quantizer is a static NVFP4 quantizer with pre-computed amax.""" - return hasattr(weight_quantizer, "global_amax") and weight_quantizer.global_amax is not None + return isinstance(weight_quantizer, NVFP4StaticQuantizer) and weight_quantizer.global_amax is not None
110-130: Zero-scale handling differs between static and dynamic paths.In the static path,
per_block_scale[per_block_scale == 0] = 1.0is applied before the448.0 / per_block_scale_maxnormalization (Line 118 vs 127). In the dynamic path (get_weights_scaling_factor, Line 167), the same sentinel is applied after the normalization byweights_scaling_factor_2.For all-zero blocks this is harmless (quantized values are 0, so the scale is irrelevant during dequant), but the resulting FP8 scale values will differ between the two paths for those blocks. This could complicate debugging or round-trip comparisons.
To align, move the zero guard after normalization:
Align zero-handling with dynamic path
per_block_scale = per_block_amax / 6.0 - per_block_scale[per_block_scale == 0] = 1.0 # Reshape per_block_scale to match weight's block structure num_blocks_per_row = weight.shape[-1] // block_size expected_shape = (*weight.shape[:-1], num_blocks_per_row) per_block_scale = per_block_scale.view(expected_shape) # Quantize scales to FP8 if not keep_high_precision: per_block_scale = (per_block_scale * 448.0 / per_block_scale_max).to( torch.float8_e4m3fn ) + per_block_scale_float = per_block_scale.float() + per_block_scale_float[per_block_scale_float == 0] = 1.0 + per_block_scale = per_block_scale_float.to(per_block_scale.dtype)
|
|
||
| # Check if this is a static NVFP4 quantizer (has pre-computed scales from MSE calibration) | ||
| # For static NVFP4, weight_scale is already computed from static _amax values in get_weight_scaling_factor | ||
| is_nvfp4_static = isinstance(weight_quantizer, NVFP4StaticQuantizer) | ||
|
|
||
| if not is_nvfp4_static: | ||
| # For dynamic NVFP4, compute scales from weights | ||
| weight_scale = NVFP4QTensor.get_weights_scaling_factor( | ||
| weight, | ||
| block_size=block_size, | ||
| weights_scaling_factor_2=weight_scale_2, | ||
| )[0] |
There was a problem hiding this comment.
Potential shape mismatch for BMM-style expert weights with static NVFP4.
When is_bmm_expert_weight is True, the weight is transposed at Line 506–508 (e.g., from (E, in_dim, out_dim) → (E, out_dim, in_dim)). The dynamic path (Lines 516–520) correctly recomputes weight_scale from the transposed weight. However, the static path skips recomputation and uses the scale that was computed by get_weight_scaling_factor (Line 461) from the untransposed weight.
Since the static path in NVFP4QTensor.get_weights_scaling_factor_from_quantizer (nvfp4_tensor.py Line 121–123) reshapes the per-block scale using the weight's original shape, the scale would have shape (*untransposed_shape[:-1], num_blocks) which won't match the transposed weight layout expected by to_quantized_weight.
This would fail with a shape error if static NVFP4 quantizers are ever used with Llama4TextExperts or GptOssExperts. If that combination is currently not expected, a guard would prevent a confusing error later:
Proposed guard
# Check if this is a static NVFP4 quantizer (has pre-computed scales from MSE calibration)
# For static NVFP4, weight_scale is already computed from static _amax values in get_weight_scaling_factor
is_nvfp4_static = isinstance(weight_quantizer, NVFP4StaticQuantizer)
+ if is_nvfp4_static and is_bmm_expert_weight:
+ raise NotImplementedError(
+ "Static NVFP4 quantization is not yet supported for BMM-style expert weights "
+ "(Llama4TextExperts, GptOssExperts). Use dynamic NVFP4 quantization instead."
+ )
+
if not is_nvfp4_static:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Check if this is a static NVFP4 quantizer (has pre-computed scales from MSE calibration) | |
| # For static NVFP4, weight_scale is already computed from static _amax values in get_weight_scaling_factor | |
| is_nvfp4_static = isinstance(weight_quantizer, NVFP4StaticQuantizer) | |
| if not is_nvfp4_static: | |
| # For dynamic NVFP4, compute scales from weights | |
| weight_scale = NVFP4QTensor.get_weights_scaling_factor( | |
| weight, | |
| block_size=block_size, | |
| weights_scaling_factor_2=weight_scale_2, | |
| )[0] | |
| # Check if this is a static NVFP4 quantizer (has pre-computed scales from MSE calibration) | |
| # For static NVFP4, weight_scale is already computed from static _amax values in get_weight_scaling_factor | |
| is_nvfp4_static = isinstance(weight_quantizer, NVFP4StaticQuantizer) | |
| if is_nvfp4_static and is_bmm_expert_weight: | |
| raise NotImplementedError( | |
| "Static NVFP4 quantization is not yet supported for BMM-style expert weights " | |
| "(Llama4TextExperts, GptOssExperts). Use dynamic NVFP4 quantization instead." | |
| ) | |
| if not is_nvfp4_static: | |
| # For dynamic NVFP4, compute scales from weights | |
| weight_scale = NVFP4QTensor.get_weights_scaling_factor( | |
| weight, | |
| block_size=block_size, | |
| weights_scaling_factor_2=weight_scale_2, | |
| )[0] |
🤖 Prompt for AI Agents
In `@modelopt/torch/export/unified_export_hf.py` around lines 509 - 520, The
static NVFP4 path uses a scale computed from the untransposed weight which will
mismatch when is_bmm_expert_weight is True; update the branch that checks
is_nvfp4_static to either (1) recompute/reshape weight_scale using the
transposed weight by calling NVFP4QTensor.get_weights_scaling_factor (or
get_weights_scaling_factor_from_quantizer) on the transposed weight before
calling to_quantized_weight, or (2) add an explicit guard that raises a clear
error when is_bmm_expert_weight and isinstance(weight_quantizer,
NVFP4StaticQuantizer) to prevent misuse; modify the code around the
is_nvfp4_static check (referencing is_bmm_expert_weight, weight_quantizer,
NVFP4StaticQuantizer, NVFP4QTensor.get_weights_scaling_factor[_from_quantizer],
and to_quantized_weight) accordingly.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
We are not supporting static NVFP4 BMM yet, it won't be trigger in calibration algorithm, let me add a guard here
Edwardf0t1
left a comment
There was a problem hiding this comment.
@Fridah-nv Could we add a sample command in the PR description and the MMLU results you have?
Also, is the checkpoint format the same comparing static NVFP4 vs our default NVFP4?
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Added, thanks for reminding.
Yes, I checked that |
modelopt/torch/export/quant_utils.py
Outdated
|
|
||
| # Handle NVFP4 variants (static or dynamic) | ||
| is_nvfp4_static = isinstance(weight_quantizer, NVFP4StaticQuantizer) | ||
| if is_nvfp4_static or quantization_format in [ |
There was a problem hiding this comment.
what's the quantization_format for NVFP4StaticQuantizer? do we need is_nvfp4_static here?
There was a problem hiding this comment.
I add this branch because I think we don't need _ensure_weight_quantizer_calibrated for NVFP4 static case. All NVFP4StaticQuantizer should be calibrated by the time of export
There was a problem hiding this comment.
I'm not sure about the case for _ensure_weight_quantizer_calibrated to be effective. I understand activation quantizers might not be calibrated. But why weight quantizers might not be reached before export? If this is needed, I can ensure _ensure_weight_quantizer_calibrated work with NVFP4StaticQuantizer
There was a problem hiding this comment.
I understand the case that the weight quantizers of some experts might not be calibrated now. Updated _ensure_weight_quantizer_calibrated to handle NVFP4StaticQuantizer
modelopt/torch/export/quant_utils.py
Outdated
| # Calibrate weight quantizer if amax is not set for all NVFP4 variants | ||
| if quantization_format in [ | ||
| is_nvfp4_static = isinstance(weight_quantizer, NVFP4StaticQuantizer) | ||
| if is_nvfp4_static or quantization_format in [ |
| # For static NVFP4, weight_scale is already computed from static _amax values in get_weight_scaling_factor | ||
| is_nvfp4_static = isinstance(weight_quantizer, NVFP4StaticQuantizer) | ||
|
|
||
| if not is_nvfp4_static: |
There was a problem hiding this comment.
do we need to handle the else condition?
There was a problem hiding this comment.
The static NVFP4 case is already handled by L459-L462
else:
sub_module.register_buffer(
quantizer_attrs.weight_scale, get_weight_scaling_factor(sub_module, weight_name)
)
Which calls NVFP4QTensor.get_weights_scaling_factor_from_quantizer that handles both static and dynamic NVFP4 case.
There was a problem hiding this comment.
These lines L511-L514 for handling dynamic NVFP4 could be removed potentially, I just need to double check on the dynamic BMM case
There was a problem hiding this comment.
Update: removed this if branch and add a warning before BMM quantizers handling.
|
|
||
| # Check if this is a static NVFP4 quantizer (has pre-computed scales from MSE calibration) | ||
| # For static NVFP4, weight_scale is already computed from static _amax values in get_weight_scaling_factor | ||
| is_nvfp4_static = isinstance(weight_quantizer, NVFP4StaticQuantizer) | ||
|
|
||
| if not is_nvfp4_static: | ||
| # For dynamic NVFP4, compute scales from weights | ||
| weight_scale = NVFP4QTensor.get_weights_scaling_factor( | ||
| weight, | ||
| block_size=block_size, | ||
| weights_scaling_factor_2=weight_scale_2, | ||
| )[0] |
| for module in modules: | ||
| if hasattr(module.weight_quantizer, "_global_amax"): | ||
| module.weight_quantizer._global_amax = unified_global_amax | ||
|
|
There was a problem hiding this comment.
TRTLLM requires fusible linear layers to have same weight_scale_2 , so here I set global_amax to be the max value across all modules being fused. It's also what the dynamic path is doing.
After discussing with cursor, I think this won't have much impact to accuracy:
The quantization math:
weights_scaling_factor_2 = _global_amax / (6.0 * 448.0)
weights_scaling_factor = _amax / (6.0 * wsf2) = _amax * 448.0 / _global_amax
Effective scale used in quantization: weights_scaling_factor * weights_scaling_factor_2 = _amax / 6.0
Since the effective scale is wsf * wsf2 = _amax / 6.0, the _amax values are determined purely by the weight values themselves through MSE calibration, not by _global_amax.
The _global_amax is just a separate parameter that determines how to split the representation between the two scale factors (wsf and wsf2), but since they multiply together and _global_amax cancels out, it has no effect on the actual quantization.
I think it would still have some effect since wsf and wsf2 are kept in different precision, but the gap should not be large. @realAsma Please let me know if it makes sense to you.
There was a problem hiding this comment.
@Fridah-nv are you saying that. e.g. for QKV,
You keep the wsf1 not changed, but just pick the wsf2 of QKV to be the max of the individual wsf2?
There was a problem hiding this comment.
Only the global_amax(wsf1) needs to be unified, wsf2 won't be updated. The concern above is that MSE algo ties wsf1 to wsf2, I'm worried that updating wsf1 only might affect accuracy. But it looks fine in math and in our experiements results.
modelopt/torch/export/quant_utils.py
Outdated
|
|
||
| # Handle NVFP4 variants (static or dynamic) | ||
| is_nvfp4_static = isinstance(weight_quantizer, NVFP4StaticQuantizer) | ||
| if is_nvfp4_static or quantization_format in [ |
There was a problem hiding this comment.
I add this branch because I think we don't need _ensure_weight_quantizer_calibrated for NVFP4 static case. All NVFP4StaticQuantizer should be calibrated by the time of export
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
…NVFP4StaticQuantizer, update changelog Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
| @@ -52,12 +52,87 @@ def get_e2m1_bounds(cls, device): | |||
| cls.e2m1_bounds_on_device[device] = e2m1_bounds.to(device) | |||
There was a problem hiding this comment.
Review Summary
Overall this PR looks good. The author has made significant improvements in the latest commits to consolidate NVFP4 logic into rather than scattering it across export utilities. The unified approach via follows the normal NVFP4 path convention well.
Major Points
1. Convention Alignment - GOOD ✅
The latest refactor moves logic to which is the right approach. This centralizes NVFP4-specific behavior in one place.
2. Static vs Dynamic Path Handling ✅
The check followed by branching to either static (pre-computed amax) or dynamic (compute from weights) is clean and follows existing patterns.
Responses to Existing Comments
On realAsma's config naming comment (line 391):
The rename to is good - it clarifies this is W4A4 (weights 4-bit, activations 4-bit).
On realAsma's scale computation suggestion (line ~302):
The current implementation uses:
This is mathematically equivalent to the suggestion and the final form is cleaner. The simplification works out to , then multiplied by 448 for FP8 quantization. This looks correct.
On cjluo-nv's isinstance question:
The author addressed this - is now sufficient. The attribute check was removed in later commits.
On global_amax vs _amax:
- = per-block amax values (pre-computed during MSE calibration)
- = per-tensor max for the global scale
The naming is clear in .
Minor Suggestions
-
Docstring consistency: The docstrings in the new methods are good and explain the dual path (static vs dynamic).
-
BMM expert handling: CodeRabbit's concern about BMM-style expert weights with static NVFP4 is valid - the warning in is appropriate as a temporary measure.
-
Test coverage: The added tests () properly validate that static and dynamic paths produce identical outputs when given the same calibration data.
Conclusion
LGTM. The PR successfully:
- Adds static NVFP4 export support following existing conventions
- Consolidates NVFP4 logic in
- Maintains backward compatibility with dynamic quantizers
- Includes appropriate tests
- Addresses reviewer feedback from previous rounds
| for expert_type in ["Llama4TextExperts", "GptOssExperts"] | ||
| ) | ||
| if is_bmm_expert_weight and isinstance(weight_quantizer, NVFP4StaticQuantizer): | ||
| warnings.warn( |
There was a problem hiding this comment.
how about just throw exception here? Prefer failing fast instead of unexpected accuracy bug.
| - Add LTX-2 and Wan2.2 (T2V) support in the diffusers quantization workflow. | ||
| - Add PTQ support for GLM-4.7, including loading MTP layer weights from a separate ``mtp.safetensors`` file and export as-is. | ||
| - Add support for image-text data calibration in PTQ for Nemotron VL models. | ||
| - Add support for advanced weight scale search for NVFP4 quantization and its export path. |
There was a problem hiding this comment.
this is not a 0.42 feature. Please move to 0.43
There was a problem hiding this comment.
Pull request overview
Adds support for exporting static NVFP4 (pre-calibrated) quantization into the unified Hugging Face checkpoint flow, enabling a deployment path for PTQ methods like MSE-based scale search.
Changes:
- Add NVFP4 static-quantizer aware scale extraction and lazy calibration during export (
quant_utils.py,nvfp4_tensor.py). - Introduce an NVFP4 W4A4 MSE (+ optional FP8 scale sweep) quantization config and expose it via
hf_ptq.pyas--qformat nvfp4_mse. - Expand unit/GPU tests to cover NVFP4 static quantizer export paths and unified HF export for the new qformat.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
modelopt/torch/quantization/qtensor/nvfp4_tensor.py |
Adds APIs to derive NVFP4 scaling factors from quantizers (static vs dynamic). |
modelopt/torch/export/quant_utils.py |
Adds lazy calibration for NVFP4StaticQuantizer, updates NVFP4 scaling-factor extraction, and threads nvfp4_static through config processing. |
modelopt/torch/export/unified_export_hf.py |
Imports NVFP4StaticQuantizer and warns about unsupported BMM expert weights for static NVFP4. |
modelopt/torch/quantization/config.py |
Adds NVFP4_W4A4_WEIGHT_MSE_FP8_SWEEP_CFG. |
examples/llm_ptq/hf_ptq.py |
Adds nvfp4_mse qformat option mapped to the new config. |
tests/_test_utils/torch/export/utils.py |
Extends partial_nvfp4_config to quantize an additional toy layer. |
tests/gpu/torch/export/test_export.py |
Updates expected NVFP4 exclude list to match the updated toy config. |
tests/unit/torch/export/test_get_quantization.py |
Adds a unit test covering static NVFP4 quantizer detection/config export. |
tests/gpu/torch/export/test_export_weight_gpu.py |
Adds a GPU test comparing dynamic vs static NVFP4 export results. |
tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py |
Adds unified HF export coverage for nvfp4_mse. |
CHANGELOG.rst |
Documents the new NVFP4 advanced scale search + export support. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
What does this PR do?
Type of change: ? new feature
Overview: ?
Supports export
NVFP4StaticQantizerin unified huggingface checkpoint, as a deployment path for PTQ algorithms such as MSEUsage
Testing
Tested generated Qwen3 8B checkpoint with trtllm serve and nv_eval example in
Model-Optimizer-Internal/examples/nv_eval.NV eval results:
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
New Features
Performance Improvements