Support mixed-precision per layer quant config in config.json#929
Support mixed-precision per layer quant config in config.json#929Edwardf0t1 merged 3 commits intomainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughIntroduces support for MIXED_PRECISION quantization algorithm in Hugging Face config conversion by adding a helper function that maps quantization algorithms to group configurations, and enhancing the main converter to aggregate layers by their quantization configs into multiple config groups. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
modelopt/torch/export/convert_hf_config.py (1)
162-181: DRY: Existing FP8/NVFP4 branches duplicate the new helper.The inline config dicts for
FP8(lines 163-166) andNVFP4(lines 171-178) duplicate the logic in_quant_algo_to_group_config. Reusing the helper keeps both paths consistent and avoids future drift (e.g., if FP8 config shape changes, you'd need to update two places).Suggested refactor
if quant_algo_value == "FP8": - config_group_details = { - "input_activations": {"dynamic": False, "num_bits": 8, "type": "float"}, - "weights": {"dynamic": False, "num_bits": 8, "type": "float"}, - "targets": ["Linear"], - } + config_group_details = _quant_algo_to_group_config("FP8") + config_group_details["targets"] = ["Linear"] new_config["config_groups"] = {"group_0": config_group_details} elif quant_algo_value == "NVFP4": group_size = original_quantization_details.get("group_size", 16) - config_group_details = { - "input_activations": { - "dynamic": False, - "num_bits": 4, - "type": "float", - "group_size": group_size, - }, - "weights": {"dynamic": False, "num_bits": 4, "type": "float", "group_size": group_size}, - "targets": ["Linear"], - } + config_group_details = _quant_algo_to_group_config("NVFP4", group_size) + config_group_details["targets"] = ["Linear"] new_config["config_groups"] = {"group_0": config_group_details}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/convert_hf_config.py` around lines 162 - 181, The FP8 and NVFP4 branches duplicate the config construction; replace the inline dicts with calls to the existing helper _quant_algo_to_group_config to avoid duplication: in the block that checks quant_algo_value == "FP8" and the block for "NVFP4" use _quant_algo_to_group_config(quant_algo_value, original_quantization_details) (or pass any required args used by the helper) to produce config_group_details and then set new_config["config_groups"] = {"group_0": config_group_details}; ensure you preserve group_size handling by relying on the helper’s logic rather than rebuilding the dict inline.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/export/convert_hf_config.py`:
- Around line 54-71: The branch handling quant_algo in ("NVFP4_AWQ",
"W4A16_AWQ", "W4A8_AWQ") computes act_bits incorrectly by checking only "A8" in
quant_algo; update the act_bits logic to detect A16 explicitly (or parse the
numeric suffix after 'A') so "W4A16_AWQ" yields act_bits=16, "W4A8_AWQ" yields 8
and fall back to 4 otherwise; change the calculation near the quant_algo
variable in convert_hf_config.py and keep the returned dict structure unchanged.
- Around line 103-105: The fallback branch in convert_hf_config (the final else
that currently returns {"quant_algo": quant_algo}) produces a config shape
inconsistent with other branches (which return {"input_activations": {...},
"weights": {...}}) and can break downstream parsing; replace that silent
fallback with an explicit error: in the else branch (where quant_algo is
unsupported, e.g., during MIXED_PRECISION), raise a ValueError (including
quant_algo in the message) to refuse unsupported algorithms rather than
returning a structurally incompatible dict so callers must handle or convert to
a valid config.
- Around line 188-191: The grouping key creation using
tuple(sorted(layer_cfg.items())) will fail if layer_cfg contains nested
dicts/lists; replace this fragile approach by serializing layer_cfg to a
canonical JSON string (e.g., using json.dumps with sort_keys=True and compact
separators) to produce a stable, hashable key for algo_to_layers; update
convert_hf_config.py to import json and compute key = json.dumps(layer_cfg,
sort_keys=True, separators=(',', ':'), ensure_ascii=False) (or fall back to
repr(layer_cfg) if json serialization fails) when iterating over
quantized_layers to preserve deterministic grouping.
- Around line 198-201: The MIXED_PRECISION branch is incorrectly assigning layer
names (layer_names) to group_config["targets"] (used for module class matching)
which breaks compressed-tensors; change the assignment so
group_config["targets"] contains module class names (e.g.,
module.__class__.__name__) derived from the layer configs instead of the layer
keys, and if you still need the original layer name list preserve it under a
separate field (e.g., keep your existing quantized_layers or add
quantized_layer_names) so group_config (from _quant_algo_to_group_config)
retains class-name targets while layer_name lists live in a distinct key.
---
Nitpick comments:
In `@modelopt/torch/export/convert_hf_config.py`:
- Around line 162-181: The FP8 and NVFP4 branches duplicate the config
construction; replace the inline dicts with calls to the existing helper
_quant_algo_to_group_config to avoid duplication: in the block that checks
quant_algo_value == "FP8" and the block for "NVFP4" use
_quant_algo_to_group_config(quant_algo_value, original_quantization_details) (or
pass any required args used by the helper) to produce config_group_details and
then set new_config["config_groups"] = {"group_0": config_group_details}; ensure
you preserve group_size handling by relying on the helper’s logic rather than
rebuilding the dict inline.
| for layer_name, layer_cfg in quantized_layers.items(): | ||
| # Create a hashable key from the layer config | ||
| key = tuple(sorted(layer_cfg.items())) | ||
| algo_to_layers[key].append(layer_name) |
There was a problem hiding this comment.
tuple(sorted(layer_cfg.items())) will crash if any layer config value is a dict or list.
If a quantized_layers entry ever contains a nested structure (e.g., {"quant_algo": "X", "extra": {"key": "val"}}), tuple(sorted(...)) will raise TypeError because dicts/lists aren't hashable. Currently the expected payloads are flat, but this is fragile.
A safer approach is to serialize the config to a canonical JSON string for the grouping key:
Suggested defensive fix
+import json
...
for layer_name, layer_cfg in quantized_layers.items():
- # Create a hashable key from the layer config
- key = tuple(sorted(layer_cfg.items()))
+ # Create a hashable key from the layer config (handles nested values)
+ key = json.dumps(layer_cfg, sort_keys=True)
algo_to_layers[key].append(layer_name)
...
for idx, (config_key, layer_names) in enumerate(algo_to_layers.items()):
- layer_cfg = dict(config_key)
+ layer_cfg = json.loads(config_key)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/export/convert_hf_config.py` around lines 188 - 191, The
grouping key creation using tuple(sorted(layer_cfg.items())) will fail if
layer_cfg contains nested dicts/lists; replace this fragile approach by
serializing layer_cfg to a canonical JSON string (e.g., using json.dumps with
sort_keys=True and compact separators) to produce a stable, hashable key for
algo_to_layers; update convert_hf_config.py to import json and compute key =
json.dumps(layer_cfg, sort_keys=True, separators=(',', ':'), ensure_ascii=False)
(or fall back to repr(layer_cfg) if json serialization fails) when iterating
over quantized_layers to preserve deterministic grouping.
|
|
||
| group_config = _quant_algo_to_group_config(algo, layer_group_size) | ||
| group_config["targets"] = sorted(layer_names) | ||
| config_groups[f"group_{idx}"] = group_config |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the file to understand the full context
wc -l modelopt/torch/export/convert_hf_config.pyRepository: NVIDIA/Model-Optimizer
Length of output: 112
🏁 Script executed:
# Read the relevant section around lines 198-201 plus surrounding context
sed -n '170,230p' modelopt/torch/export/convert_hf_config.pyRepository: NVIDIA/Model-Optimizer
Length of output: 2628
🏁 Script executed:
# Check how _quant_algo_to_group_config is defined and what it returns for different algos
rg -n "_quant_algo_to_group_config" -A 20 -B 5 modelopt/torch/export/convert_hf_config.pyRepository: NVIDIA/Model-Optimizer
Length of output: 2474
🏁 Script executed:
# Search for "targets" usage in the file to understand the semantics
rg -n "targets" -B 3 -A 3 modelopt/torch/export/convert_hf_config.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1544
🏁 Script executed:
# Check if compressed-tensors is available and how it uses "targets"
fd -e py -e md | xargs rg "targets" | grep -i "quant\|config" | head -30Repository: NVIDIA/Model-Optimizer
Length of output: 886
🏁 Script executed:
# Check how quantized_layers is structured - trace back where it comes from
rg -n "quantized_layers" -B 5 -A 5 modelopt/torch/export/convert_hf_config.py | head -80Repository: NVIDIA/Model-Optimizer
Length of output: 1475
🏁 Script executed:
# Search for where quantized_layers gets populated in the original data
rg -rn "quantized_layers" --type py modelopt/torch/export/ | grep -v "Binary" | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 1180
🏁 Script executed:
# Check the structure of quantized_layers in the input to understand what layer_name looks like
rg -rn "quantized_layers" --type py modelopt/torch/ -B 3 -A 3 | grep -A 10 "layer_name" | head -40Repository: NVIDIA/Model-Optimizer
Length of output: 1073
🏁 Script executed:
# Search for test files or examples that show the structure of quantized_layers
fd -e py -path "*/test*" | xargs rg "quantized_layers" -B 2 -A 2 | head -50Repository: NVIDIA/Model-Optimizer
Length of output: 3656
🏁 Script executed:
# Look at more of the test to see the full context of quantized_layers structure
sed -n '1,100p' tests/gpu/torch/export/test_export.py | grep -A 20 "quantized_layers"Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Check if there's any downstream code that validates or processes "targets" field
rg -rn "targets" --type py modelopt/ | grep -E "(config|quant)" | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 844
🏁 Script executed:
# Check the exact current state of lines around 200 to confirm if it's "targets" or "n"
sed -n '195,210p' modelopt/torch/export/convert_hf_config.pyRepository: NVIDIA/Model-Optimizer
Length of output: 755
🏁 Script executed:
# Also check lines 163-180 to see what the FP8/NVFP4 branches use
sed -n '160,182p' modelopt/torch/export/convert_hf_config.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1133
🏁 Script executed:
# Check the docstring that mentions the field name
sed -n '110,125p' modelopt/torch/export/convert_hf_config.pyRepository: NVIDIA/Model-Optimizer
Length of output: 835
targets in MIXED_PRECISION branch contains layer names instead of module types, breaking quantization matching.
In FP8 and NVFP4 branches, targets correctly holds module class names (e.g., ["Linear"]), which compressed-tensors matches against module.__class__.__name__. However, the MIXED_PRECISION branch assigns layer names (e.g., "layer1", "layer3") to targets on line 200, which violates the documented contract in lines 115–117 and produces non-functional configs.
compressed-tensors will fail to match these layer names against module types. Either extract module class names from the layer configs or move the layer name list to a separate field like you already do for quantized_layers (line 205).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/export/convert_hf_config.py` around lines 198 - 201, The
MIXED_PRECISION branch is incorrectly assigning layer names (layer_names) to
group_config["targets"] (used for module class matching) which breaks
compressed-tensors; change the assignment so group_config["targets"] contains
module class names (e.g., module.__class__.__name__) derived from the layer
configs instead of the layer keys, and if you still need the original layer name
list preserve it under a separate field (e.g., keep your existing
quantized_layers or add quantized_layer_names) so group_config (from
_quant_algo_to_group_config) retains class-name targets while layer_name lists
live in a distinct key.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #929 +/- ##
==========================================
+ Coverage 72.07% 72.09% +0.01%
==========================================
Files 207 207
Lines 22691 22691
==========================================
+ Hits 16355 16358 +3
+ Misses 6336 6333 -3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds support for exporting mixed-precision, per-layer quantization settings into the Hugging Face config.json quantization config (via convert_hf_quant_config_format), aligning it with the compressed-tensors/llm-compressor-style config_groups layout while preserving per-layer detail.
Changes:
- Introduces
_quant_algo_to_group_configto map per-layerquant_algosettings toconfig_groupsentries. - Adds a new
MIXED_PRECISIONconversion path that groups layers by identical per-layer quantization configs and emits multipleconfig_groups. - Preserves the original per-layer mapping under
quantized_layersfor consumers needing full detail.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| elif quant_algo_value == "MIXED_PRECISION": | ||
| quantized_layers = original_quantization_details.get("quantized_layers", {}) | ||
|
|
||
| # Group layers by their unique quantization config so each distinct | ||
| # (quant_algo, group_size, ...) combination becomes one config_group. | ||
| algo_to_layers: dict[tuple, list[str]] = defaultdict(list) | ||
| for layer_name, layer_cfg in quantized_layers.items(): | ||
| # Create a hashable key from the layer config | ||
| key = tuple(sorted(layer_cfg.items())) | ||
| algo_to_layers[key].append(layer_name) | ||
|
|
||
| config_groups: dict[str, Any] = {} | ||
| for idx, (config_key, layer_names) in enumerate(algo_to_layers.items()): | ||
| layer_cfg = dict(config_key) | ||
| algo = layer_cfg.get("quant_algo", "") | ||
| layer_group_size = layer_cfg.get("group_size") | ||
|
|
||
| group_config = _quant_algo_to_group_config(algo, layer_group_size) | ||
| group_config["targets"] = sorted(layer_names) | ||
| config_groups[f"group_{idx}"] = group_config | ||
|
|
||
| new_config["config_groups"] = config_groups | ||
| # Preserve the full per-layer detail for consumers that need it. | ||
| new_config["quantized_layers"] = quantized_layers |
There was a problem hiding this comment.
The new MIXED_PRECISION conversion branch introduces non-trivial behavior (layer grouping, config_groups synthesis, preserving quantized_layers) but there isn’t any automated test coverage for it. Since convert_hf_quant_config_format is already used in tests for existing FP8 export behavior, it would be good to add a unit test that asserts: (1) grouping creates the expected number of groups for a mixed config, and (2) per-layer fields like group_size are reflected correctly in the produced group configs.
|
|
||
| # Group layers by their unique quantization config so each distinct | ||
| # (quant_algo, group_size, ...) combination becomes one config_group. | ||
| algo_to_layers: dict[tuple, list[str]] = defaultdict(list) |
There was a problem hiding this comment.
Type annotation algo_to_layers: dict[tuple, list[str]] uses tuple without type parameters. With the repo’s strict mypy settings, this is typically flagged as “Missing type parameters for generic type 'tuple'”. Consider tightening this to the actual key type being used here (e.g., tuple[tuple[str, Any], ...]) so type checking stays clean.
| algo_to_layers: dict[tuple, list[str]] = defaultdict(list) | |
| algo_to_layers: dict[tuple[tuple[str, Any], ...], list[str]] = defaultdict(list) |
| elif quant_algo in ("NVFP4_AWQ", "W4A16_AWQ", "W4A8_AWQ"): | ||
| gs = group_size or 128 | ||
| weight_bits = 4 | ||
| act_bits = 8 if "A8" in quant_algo else 4 | ||
| return { | ||
| "input_activations": { | ||
| "dynamic": False, | ||
| "num_bits": act_bits, |
There was a problem hiding this comment.
In _quant_algo_to_group_config, W4A16_AWQ is treated the same as NVFP4_AWQ/W4A8_AWQ and assigns input_activations bits (act_bits becomes 4 here). But W4A16_* implies weight-only quantization with unquantized (16-bit) activations (see INT4_AWQ_CFG where *input_quantizer is disabled). This mapping will incorrectly advertise activation quantization in the exported config_groups. Consider handling W4A16_AWQ as a separate case (no input_activations, and appropriate weight config) instead of grouping it with the A8/NVFP4 AWQ variants.
| algo = layer_cfg.get("quant_algo", "") | ||
| layer_group_size = layer_cfg.get("group_size") | ||
|
|
||
| group_config = _quant_algo_to_group_config(algo, layer_group_size) |
There was a problem hiding this comment.
The function docstring notes that targets are PyTorch module types (e.g., "Linear" via module.__class__.__name__), but the new MIXED_PRECISION path sets targets to fully-qualified layer names (e.g., model.layers.0.self_attn.q_proj). Either update the docstring/commentary to reflect the MIXED_PRECISION behavior, or switch MIXED_PRECISION targets to the same targeting semantics used by the FP8/NVFP4 branches to avoid confusing downstream consumers.
| group_config = _quant_algo_to_group_config(algo, layer_group_size) | |
| group_config = _quant_algo_to_group_config(algo, layer_group_size) | |
| # Note: in the MIXED_PRECISION path, `targets` is a list of fully-qualified | |
| # layer *names* (e.g., "model.layers.0.self_attn.q_proj"), not module | |
| # type names (e.g., "Linear" from `module.__class__.__name__`). This | |
| # differs from other branches (such as FP8/NVFP4), and consumers that | |
| # need per-layer details should use `quantized_layers` instead. |
| elif quant_algo == "NVFP4": | ||
| gs = group_size or 16 | ||
| return { | ||
| "input_activations": { | ||
| "dynamic": False, | ||
| "num_bits": 4, | ||
| "type": "float", | ||
| "group_size": gs, | ||
| }, | ||
| "weights": {"dynamic": False, "num_bits": 4, "type": "float", "group_size": gs}, | ||
| } | ||
| elif quant_algo in ("NVFP4_AWQ", "W4A16_AWQ", "W4A8_AWQ"): | ||
| gs = group_size or 128 | ||
| weight_bits = 4 |
There was a problem hiding this comment.
_quant_algo_to_group_config uses gs = group_size or <default>, which treats 0 as “missing” and silently replaces it with the default. Elsewhere in this file (e.g., the NVFP4 non-mixed path) group_size is read with .get(..., default) and would preserve an explicit 0. For consistency and to avoid hiding invalid inputs, consider checking group_size is None (or explicitly validating group_size > 0) instead of relying on truthiness.
| elif quant_algo in ("NVFP4_AWQ", "W4A16_AWQ", "W4A8_AWQ"): | ||
| gs = group_size or 128 | ||
| weight_bits = 4 | ||
| act_bits = 8 if "A8" in quant_algo else 4 |
There was a problem hiding this comment.
Is this a bug?
A8 not in W4A16_AWQ so the act_bits falls back to 4, but the act_bits should be 16
There was a problem hiding this comment.
It's already addressed in the latest commit.
77b54c5 to
8852b12
Compare
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
8852b12 to
92a3fa7
Compare
What does this PR do?
Type of change: ?
Overview: Support mixed-precision per layer quant config in config.json, since it's the first-class source of truth in deployment FWs.
Usage
Testing
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
Release Notes