Skip to content

add triton-fp8w8a8g128 quant type.#1214

Merged
hiworldwzj merged 8 commits intomainfrom
wzj_dev
Feb 26, 2026
Merged

add triton-fp8w8a8g128 quant type.#1214
hiworldwzj merged 8 commits intomainfrom
wzj_dev

Conversation

@hiworldwzj
Copy link
Collaborator

添加一种新的量化方法,激活使用per group 量化, 权重使用per channel 量化。

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @hiworldwzj, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

此拉取请求旨在通过引入 triton-fp8w8a8g128triton-fp8w8a8g64 两种新的 FP8 量化方法,扩展 LightLLM 的量化能力。这些新方法结合了权重的 per-channel 量化和激活的 per-group 量化,旨在提供更精细的控制和潜在的性能优化。核心改动包括新增了用于权重和混合矩阵乘法的 Triton 核函数,并对现有激活量化核函数进行了功能增强,同时更新了相关文档和命令行接口以支持这些新功能。

Highlights

  • 新增量化方法: 引入了两种新的 Triton FP8 量化类型:triton-fp8w8a8g128triton-fp8w8a8g64,分别支持激活 per-group 128 和 per-group 64 量化,同时权重采用 per-channel 量化。
  • 新的 Triton 核函数: 添加了 fp8w8a8_perchannel_quant_kernel.py 用于权重的 per-channel 量化,以及 scaled_mm_per_token_group_quant_kernel.py 用于实现激活 per-group 和权重 per-channel 的混合量化矩阵乘法。
  • 激活量化核函数改进: 修改了现有的 fp8act_quant_kernel.py,为其 _per_token_group_quant_fp8 函数增加了 NEED_MASK 参数,以支持在加载数据时进行条件掩码操作,提高了灵活性。
  • 文档和命令行参数更新: 更新了中英文文档 api_server_args.rstapi_cli.py 中的命令行帮助信息,以包含新添加的量化方法。
  • 量化方法注册: 在 lightllm.common.quantization 模块中注册了新的 FP8w8a8g128QuantizationMethodFP8w8a8g64QuantizationMethod 类,将其集成到 LightLLM 的量化框架中。
  • 偏置项断言: 在现有的 fp8w8a8fp8w8a8-b128 量化方法中添加了断言,明确指出当前不支持偏置项(bias)的添加。

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • docs/CN/source/tutorial/api_server_args.rst
    • 新增了 triton-fp8w8a8g128triton-fp8w8a8g64 量化方法的中文描述。
  • docs/EN/source/tutorial/api_server_args.rst
    • 新增了 triton-fp8w8a8g128triton-fp8w8a8g64 量化方法的英文描述。
  • lightllm/common/basemodel/triton_kernel/quantization/fp8act_quant_kernel.py
    • 修改了 _per_token_group_quant_fp8 函数,增加了 NEED_MASK 参数以支持条件掩码操作。
    • lightllm_per_token_group_quant_fp8 调用中传递 NEED_MASK 参数。
  • lightllm/common/basemodel/triton_kernel/quantization/fp8w8a8_perchannel_quant_kernel.py
    • 新增了 weight_quant_kernel Triton 核函数,用于实现权重的 per-channel 量化。
    • 新增了 mm_weight_quantweight_quant Python 函数,封装了 Triton 核函数以进行权重批处理量化。
  • lightllm/common/basemodel/triton_kernel/quantization/scaled_mm_per_token_group_quant_kernel.py
    • 新增了 grouped_launch 函数,用于 Triton 核函数的并行启动模式。
    • 新增了 _scaled_mm_act_per_group_w_perchannel_kernel Triton 核函数,实现了激活 per-group 和权重 per-channel 的混合量化矩阵乘法。
    • 新增了 scaled_mm_act_per_group_w_perchannel Python 函数,封装了 Triton 核函数并集成了自动调优和 TMA 支持。
    • 包含了用于测试和基准测试新核函数的 if __name__ == '__main__': 块。
  • lightllm/common/quantization/init.py
    • 导入了新的 w8a8gx 模块,以注册新的量化方法。
  • lightllm/common/quantization/w8a8.py
    • apply 方法中为 fp8w8a8 量化添加了偏置项不支持的断言。
    • 更新了 quantize 方法中 weight_quant 的导入路径。
    • apply 方法中为 fp8w8a8-b128 量化添加了偏置项不支持的断言。
  • lightllm/common/quantization/w8a8gx.py
    • 新增了 _BaseQuantizationMethod 抽象基类。
    • 新增了 FP8w8a8g128QuantizationMethod 类,实现了权重 per-channel 量化和激活 per-group 128 量化。
    • 新增了 FP8w8a8g64QuantizationMethod 类,继承自 FP8w8a8g128QuantizationMethod 并将激活 group size 设置为 64。
  • lightllm/server/api_cli.py
    • 更新了 --quant_method 命令行参数的帮助信息,包含了 triton-fp8w8a8g128triton-fp8w8a8g64 选项。
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

本次 PR 引入了一种新的 FP8 量化方法 triton-fp8w8a8g128,其中权重采用 per-channel 量化,激活采用 per-group 量化。同时,也添加了对应的 g64 版本。代码整体实现良好,包含新的 Triton 内核、量化方法实现以及相应的文档和命令行参数更新。我提出了一些建议,主要包括:

  1. 优化 weight_quant 函数中处理3D张量的性能。
  2. 完善文档和命令行帮助信息中关于 triton-fp8w8a8g64 的描述,使其更加清晰准确。
    这些修改将有助于提升代码性能和可维护性。

Comment on lines +41 to +49
if x.dim() == 3:
y_quant = torch.empty((x.shape[0], x.shape[1], x.shape[2]), dtype=torch.float8_e4m3fn, device=x.device)
s_scales = torch.empty((x.shape[0], x.shape[1], 1), dtype=torch.float32, device=x.device)
for i in range(x.shape[0]):
y_quant[i], s_scales[i] = mm_weight_quant(x[i])
return y_quant, s_scales
else:
y_quant, s_scales = mm_weight_quant(x)
return y_quant, s_scales
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

处理3D张量的 for 循环会为每个批次项(batch item)单独启动一个CUDA内核,这在批次较大时效率低下。
为了提高性能,建议将3D张量重塑(reshape)为2D,进行一次批处理量化,然后再将结果重塑回3D。这样可以大大减少内核启动的开销。

Suggested change
if x.dim() == 3:
y_quant = torch.empty((x.shape[0], x.shape[1], x.shape[2]), dtype=torch.float8_e4m3fn, device=x.device)
s_scales = torch.empty((x.shape[0], x.shape[1], 1), dtype=torch.float32, device=x.device)
for i in range(x.shape[0]):
y_quant[i], s_scales[i] = mm_weight_quant(x[i])
return y_quant, s_scales
else:
y_quant, s_scales = mm_weight_quant(x)
return y_quant, s_scales
if x.dim() == 3:
b, m, n = x.shape
y_quant, s_scales = mm_weight_quant(x.view(-1, n))
return y_quant.view(b, m, n), s_scales.view(b, m, 1)
else:
y_quant, s_scales = mm_weight_quant(x)
return y_quant, s_scales

* ``deepgemm-fp8w8a8-b128``
* ``triton-fp8w8a8-block128``
* ``triton-fp8w8a8g128``: 权重 per-channel 量化和激活 per-group 128 量化
* ``triton-fp8w8a8g64``: 权重 per-channel 量化, group size 64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

triton-fp8w8a8g64 的描述可以更完整,明确指出激活(activation)也是按组量化的,与 g128 的描述保持一致,以避免用户混淆。

Suggested change
* ``triton-fp8w8a8g64``: 权重 per-channel 量化, group size 64
* ``triton-fp8w8a8g64``: 权重 per-channel 量化和激活 per-group 64 量化

* ``deepgemm-fp8w8a8-b128``
* ``triton-fp8w8a8-block128``
* ``triton-fp8w8a8g128``: weight per-channel quant and activation per-group 128 quant
* ``triton-fp8w8a8g64``: weight per-channel quantization with group size 64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

triton-fp8w8a8g64 的描述不完整。为了避免混淆,建议明确指出其激活(activation)也是按组量化的,与 triton-fp8w8a8g128 的描述保持一致。

Suggested change
* ``triton-fp8w8a8g64``: weight per-channel quantization with group size 64
* ``triton-fp8w8a8g64``: weight per-channel quant and activation per-group 64 quant

Comment on lines 478 to +481
help="""Quantization method: vllm-w8a8 | vllm-fp8w8a8 | vllm-fp8w8a8-b128
| deepgemm-fp8w8a8-b128 | triton-fp8w8a8-block128 | awq | awq_marlin""",
| deepgemm-fp8w8a8-b128 | triton-fp8w8a8-block128 | awq | awq_marlin |
| triton-fp8w8a8g128 (weight perchannel quant and act per group quant) |
triton-fp8w8a8g64 (weight perchannel quantization with group size 64)""",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

triton-fp8w8a8g64 的帮助信息可以更完整,明确指出激活(activation)也是按组量化的,与 triton-fp8w8a8g128 的描述保持一致,以提高清晰度。

Suggested change
help="""Quantization method: vllm-w8a8 | vllm-fp8w8a8 | vllm-fp8w8a8-b128
| deepgemm-fp8w8a8-b128 | triton-fp8w8a8-block128 | awq | awq_marlin""",
| deepgemm-fp8w8a8-b128 | triton-fp8w8a8-block128 | awq | awq_marlin |
| triton-fp8w8a8g128 (weight perchannel quant and act per group quant) |
triton-fp8w8a8g64 (weight perchannel quantization with group size 64)""",
help="""Quantization method: vllm-w8a8 | vllm-fp8w8a8 | vllm-fp8w8a8-b128
| deepgemm-fp8w8a8-b128 | triton-fp8w8a8-block128 | awq | awq_marlin |
| triton-fp8w8a8g128 (weight perchannel quant and act per group quant) |
triton-fp8w8a8g64 (weight perchannel quant and act per group 64 quant)"""

@hiworldwzj hiworldwzj merged commit f2ab54e into main Feb 26, 2026
1 check passed
@hiworldwzj hiworldwzj deleted the wzj_dev branch February 26, 2026 04:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant