add triton-fp8w8a8g128 quant type. by hiworldwzj · Pull Request #1214 · ModelTC/LightLLM

hiworldwzj · 2026-02-26T04:18:09Z

添加一种新的量化方法，激活使用per group 量化，权重使用per channel 量化。

gemini-code-assist · 2026-02-26T04:18:32Z

Summary of Changes

Hello @hiworldwzj, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

此拉取请求旨在通过引入 triton-fp8w8a8g128 和 triton-fp8w8a8g64 两种新的 FP8 量化方法，扩展 LightLLM 的量化能力。这些新方法结合了权重的 per-channel 量化和激活的 per-group 量化，旨在提供更精细的控制和潜在的性能优化。核心改动包括新增了用于权重和混合矩阵乘法的 Triton 核函数，并对现有激活量化核函数进行了功能增强，同时更新了相关文档和命令行接口以支持这些新功能。

Highlights

新增量化方法: 引入了两种新的 Triton FP8 量化类型：triton-fp8w8a8g128 和 triton-fp8w8a8g64，分别支持激活 per-group 128 和 per-group 64 量化，同时权重采用 per-channel 量化。
新的 Triton 核函数: 添加了 fp8w8a8_perchannel_quant_kernel.py 用于权重的 per-channel 量化，以及 scaled_mm_per_token_group_quant_kernel.py 用于实现激活 per-group 和权重 per-channel 的混合量化矩阵乘法。
激活量化核函数改进: 修改了现有的 fp8act_quant_kernel.py，为其 _per_token_group_quant_fp8 函数增加了 NEED_MASK 参数，以支持在加载数据时进行条件掩码操作，提高了灵活性。
文档和命令行参数更新: 更新了中英文文档 api_server_args.rst 和 api_cli.py 中的命令行帮助信息，以包含新添加的量化方法。
量化方法注册: 在 lightllm.common.quantization 模块中注册了新的 FP8w8a8g128QuantizationMethod 和 FP8w8a8g64QuantizationMethod 类，将其集成到 LightLLM 的量化框架中。
偏置项断言: 在现有的 fp8w8a8 和 fp8w8a8-b128 量化方法中添加了断言，明确指出当前不支持偏置项（bias）的添加。

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

docs/CN/source/tutorial/api_server_args.rst
- 新增了 triton-fp8w8a8g128 和 triton-fp8w8a8g64 量化方法的中文描述。
docs/EN/source/tutorial/api_server_args.rst
- 新增了 triton-fp8w8a8g128 和 triton-fp8w8a8g64 量化方法的英文描述。
lightllm/common/basemodel/triton_kernel/quantization/fp8act_quant_kernel.py
- 修改了 _per_token_group_quant_fp8 函数，增加了 NEED_MASK 参数以支持条件掩码操作。
- 在 lightllm_per_token_group_quant_fp8 调用中传递 NEED_MASK 参数。
lightllm/common/basemodel/triton_kernel/quantization/fp8w8a8_perchannel_quant_kernel.py
- 新增了 weight_quant_kernel Triton 核函数，用于实现权重的 per-channel 量化。
- 新增了 mm_weight_quant 和 weight_quant Python 函数，封装了 Triton 核函数以进行权重批处理量化。
lightllm/common/basemodel/triton_kernel/quantization/scaled_mm_per_token_group_quant_kernel.py
- 新增了 grouped_launch 函数，用于 Triton 核函数的并行启动模式。
- 新增了 _scaled_mm_act_per_group_w_perchannel_kernel Triton 核函数，实现了激活 per-group 和权重 per-channel 的混合量化矩阵乘法。
- 新增了 scaled_mm_act_per_group_w_perchannel Python 函数，封装了 Triton 核函数并集成了自动调优和 TMA 支持。
- 包含了用于测试和基准测试新核函数的 if __name__ == '__main__': 块。
lightllm/common/quantization/init.py
- 导入了新的 w8a8gx 模块，以注册新的量化方法。
lightllm/common/quantization/w8a8.py
- 在 apply 方法中为 fp8w8a8 量化添加了偏置项不支持的断言。
- 更新了 quantize 方法中 weight_quant 的导入路径。
- 在 apply 方法中为 fp8w8a8-b128 量化添加了偏置项不支持的断言。
lightllm/common/quantization/w8a8gx.py
- 新增了 _BaseQuantizationMethod 抽象基类。
- 新增了 FP8w8a8g128QuantizationMethod 类，实现了权重 per-channel 量化和激活 per-group 128 量化。
- 新增了 FP8w8a8g64QuantizationMethod 类，继承自 FP8w8a8g128QuantizationMethod 并将激活 group size 设置为 64。
lightllm/server/api_cli.py
- 更新了 --quant_method 命令行参数的帮助信息，包含了 triton-fp8w8a8g128 和 triton-fp8w8a8g64 选项。

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

本次 PR 引入了一种新的 FP8 量化方法 triton-fp8w8a8g128，其中权重采用 per-channel 量化，激活采用 per-group 量化。同时，也添加了对应的 g64 版本。代码整体实现良好，包含新的 Triton 内核、量化方法实现以及相应的文档和命令行参数更新。我提出了一些建议，主要包括：

优化 weight_quant 函数中处理3D张量的性能。
完善文档和命令行帮助信息中关于 triton-fp8w8a8g64 的描述，使其更加清晰准确。
这些修改将有助于提升代码性能和可维护性。

gemini-code-assist · 2026-02-26T04:20:27Z

lightllm/common/basemodel/triton_kernel/quantization/fp8w8a8_perchannel_quant_kernel.py

+    if x.dim() == 3:
+        y_quant = torch.empty((x.shape[0], x.shape[1], x.shape[2]), dtype=torch.float8_e4m3fn, device=x.device)
+        s_scales = torch.empty((x.shape[0], x.shape[1], 1), dtype=torch.float32, device=x.device)
+        for i in range(x.shape[0]):
+            y_quant[i], s_scales[i] = mm_weight_quant(x[i])
+        return y_quant, s_scales
+    else:
+        y_quant, s_scales = mm_weight_quant(x)
+        return y_quant, s_scales


处理3D张量的 for 循环会为每个批次项（batch item）单独启动一个CUDA内核，这在批次较大时效率低下。
为了提高性能，建议将3D张量重塑（reshape）为2D，进行一次批处理量化，然后再将结果重塑回3D。这样可以大大减少内核启动的开销。

Suggested change

if x.dim() == 3:

y_quant = torch.empty((x.shape[0], x.shape[1], x.shape[2]), dtype=torch.float8_e4m3fn, device=x.device)

s_scales = torch.empty((x.shape[0], x.shape[1], 1), dtype=torch.float32, device=x.device)

for i in range(x.shape[0]):

y_quant[i], s_scales[i] = mm_weight_quant(x[i])

return y_quant, s_scales

else:

y_quant, s_scales = mm_weight_quant(x)

return y_quant, s_scales

if x.dim() == 3:

b, m, n = x.shape

y_quant, s_scales = mm_weight_quant(x.view(-1, n))

return y_quant.view(b, m, n), s_scales.view(b, m, 1)

else:

y_quant, s_scales = mm_weight_quant(x)

return y_quant, s_scales

gemini-code-assist · 2026-02-26T04:20:27Z

docs/CN/source/tutorial/api_server_args.rst

    * ``deepgemm-fp8w8a8-b128``
    * ``triton-fp8w8a8-block128``
+    * ``triton-fp8w8a8g128``: 权重 per-channel 量化和激活 per-group 128 量化
+    * ``triton-fp8w8a8g64``: 权重 per-channel 量化, group size 64


triton-fp8w8a8g64 的描述可以更完整，明确指出激活（activation）也是按组量化的，与 g128 的描述保持一致，以避免用户混淆。

Suggested change

* ``triton-fp8w8a8g64``: 权重 per-channel 量化, group size 64

* ``triton-fp8w8a8g64``: 权重 per-channel 量化和激活 per-group 64 量化

gemini-code-assist · 2026-02-26T04:20:27Z

docs/EN/source/tutorial/api_server_args.rst

    * ``deepgemm-fp8w8a8-b128``
    * ``triton-fp8w8a8-block128``
+    * ``triton-fp8w8a8g128``: weight per-channel quant and activation per-group 128 quant
+    * ``triton-fp8w8a8g64``: weight per-channel quantization with group size 64


triton-fp8w8a8g64 的描述不完整。为了避免混淆，建议明确指出其激活（activation）也是按组量化的，与 triton-fp8w8a8g128 的描述保持一致。

Suggested change

* ``triton-fp8w8a8g64``: weight per-channel quantization with group size 64

* ``triton-fp8w8a8g64``: weight per-channel quant and activation per-group 64 quant

gemini-code-assist · 2026-02-26T04:20:27Z

lightllm/server/api_cli.py

        help="""Quantization method: vllm-w8a8 | vllm-fp8w8a8 | vllm-fp8w8a8-b128
-                        | deepgemm-fp8w8a8-b128 | triton-fp8w8a8-block128 | awq | awq_marlin""",
+                        | deepgemm-fp8w8a8-b128 | triton-fp8w8a8-block128 | awq | awq_marlin |
+                        | triton-fp8w8a8g128 (weight perchannel quant and act per group quant) |
+                        triton-fp8w8a8g64 (weight perchannel quantization with group size 64)""",


wangzaijun added 7 commits February 25, 2026 07:03

通过自动识别triton quant算子是否需要mask 来区分加速可以加速的场景进行优化。

1965afa

add fp8 act per group weight per channel

bf412bb

add triton-fp8w8a8g128 quant type.

094170a

fix

e9e740c

fix

0301ca7

fix

3837bc1

fix

7c15c90

gemini-code-assist bot reviewed Feb 26, 2026

View reviewed changes

fix

063f0dd

hiworldwzj merged commit f2ab54e into main Feb 26, 2026
1 check passed

hiworldwzj deleted the wzj_dev branch February 26, 2026 04:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add triton-fp8w8a8g128 quant type.#1214

add triton-fp8w8a8g128 quant type.#1214
hiworldwzj merged 8 commits intomainfrom
wzj_dev

hiworldwzj commented Feb 26, 2026

Uh oh!

gemini-code-assist bot commented Feb 26, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 26, 2026

Uh oh!

gemini-code-assist bot Feb 26, 2026

Uh oh!

gemini-code-assist bot Feb 26, 2026

Uh oh!

gemini-code-assist bot Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	* ``triton-fp8w8a8g64``: 权重 per-channel 量化, group size 64
	* ``triton-fp8w8a8g64``: 权重 per-channel 量化和激活 per-group 64 量化

	* ``triton-fp8w8a8g64``: weight per-channel quantization with group size 64
	* ``triton-fp8w8a8g64``: weight per-channel quant and activation per-group 64 quant

Conversation

hiworldwzj commented Feb 26, 2026

Uh oh!

gemini-code-assist bot commented Feb 26, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant