Wzj tpsp by hiworldwzj · Pull Request #1269 · ModelTC/LightLLM

hiworldwzj · 2026-04-13T06:53:23Z

No description provided.

gemini-code-assist

Code Review

This pull request introduces "TPSP mix mode" to integrate sequence parallelism with tensor parallelism, optimizing microbatch overlap performance. It refactors BaseLayerInfer with unified communication helpers and updates CudaGraph to support TP-aligned batching. Feedback points out a calculation error in microbatch_overlap_prefill regarding prefix tokens, a logically flawed assertion in the sp_pad_copy kernel that causes unavoidable failures, and a missing assignment of gathered data in Llama's post-inference. Additionally, caching environment flags is recommended to improve efficiency in hot inference paths.

gemini-code-assist · 2026-04-13T06:58:09Z

lightllm/common/basemodel/basemodel.py

+        infer_handle_token_num0 = triton.cdiv(model_input0.total_token_num, self.tp_world_size_) * self.tp_world_size_
+        infer_handle_token_num1 = triton.cdiv(model_input1.total_token_num, self.tp_world_size_) * self.tp_world_size_


In microbatch_overlap_prefill, the calculation of infer_handle_token_num0 and infer_handle_token_num1 uses model_input.total_token_num instead of the actual handle token count (total_token_num - prefix_total_token_num). This is inconsistent with the logic in _prefill (line 485) and will lead to incorrect padding if prefix tokens are present, as the sequence parallelism split should be aligned based on the tokens being processed in the current forward pass.

Suggested change

infer_handle_token_num0 = triton.cdiv(model_input0.total_token_num, self.tp_world_size_) * self.tp_world_size_

infer_handle_token_num1 = triton.cdiv(model_input1.total_token_num, self.tp_world_size_) * self.tp_world_size_

infer_handle_token_num0 = triton.cdiv(origin_handle_token_num0, self.tp_world_size_) * self.tp_world_size_

infer_handle_token_num1 = triton.cdiv(origin_handle_token_num1, self.tp_world_size_) * self.tp_world_size_

gemini-code-assist · 2026-04-13T06:58:09Z

lightllm/common/basemodel/triton_kernel/sp_pad_copy.py

+    assert (
+        in_token_num % sp_world_size == 0
+    ), f"in_token_num % sp_world_size != 0, in_token_num: {in_token_num}, sp_world_size: {sp_world_size}"


This assertion is logically flawed. If in_token_num % sp_world_size == 0, the function returns at line 50. Therefore, if execution reaches line 51, the condition in_token_num % sp_world_size == 0 is guaranteed to be false, causing the assertion to always fail. If the intention is to enforce alignment, the padding logic below (lines 54-73) becomes dead code and should be removed. If padding is intended to be supported, this assertion should be removed or corrected.

gemini-code-assist · 2026-04-13T06:58:09Z

lightllm/models/llama/layer_infer/post_layer_infer.py

-            )
-            # len(infer_state.input_ids) 获取真实输入长度
-            input_embdings = gather_data[0 : len(infer_state.input_ids)]
+        self._tpsp_allgather(input=input_embdings, infer_state=infer_state)


The result of self._tpsp_allgather is ignored. Since _tpsp_allgather returns a new tensor when gathering is performed, input_embdings must be updated with the return value to ensure the subsequent _token_forward call uses the gathered data.

Suggested change

self._tpsp_allgather(input=input_embdings, infer_state=infer_state)

input_embdings = self._tpsp_allgather(input=input_embdings, infer_state=infer_state)

gemini-code-assist · 2026-04-13T06:58:09Z

lightllm/common/basemodel/layer_infer/base_layer_infer.py

        raise Exception("need to impl")
+
+    def _tpsp_allgather(self, input: torch.Tensor, infer_state: InferStateInfo):
+        if self.tp_world_size_ > 1 and get_env_start_args().enable_tpsp_mix_mode:


Calling get_env_start_args() repeatedly in the hot path (inside _tpsp_allgather, _tpsp_reduce, and _tpsp_sp_split) for every layer is inefficient. It is better to cache the enable_tpsp_mix_mode flag in the __init__ method of BaseLayerInfer or pass it through the infer_state.

hiworldwzj added 4 commits April 13, 2026 06:28

fix

6c7f231

fix

3d8a6a9

fix

d7e8836

fix

521b890

gemini-code-assist bot reviewed Apr 13, 2026

View reviewed changes

hiworldwzj added 21 commits April 13, 2026 07:18

fix

f84f1dd

fix

a2632f9

fix

9df356d

fix

2774a68

fix

48d88ae

fix

45e02c7

fix

31d503a

fix

608cccf

fix

a86f980

fix

570b61c

fix

700ebea

fix

8c598ed

fix

71eddd4

fix

ce07e76

fix

cb0fe58

fix

28b3798

fix

0402bc8

fix

59ff2aa

fix

9a7cc5e

fix

bcd07fc

fix

17439a3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wzj tpsp#1269

Wzj tpsp#1269
hiworldwzj wants to merge 25 commits intomainfrom
wzj_tpsp

hiworldwzj commented Apr 13, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 13, 2026

Uh oh!

gemini-code-assist bot Apr 13, 2026

Uh oh!

gemini-code-assist bot Apr 13, 2026

Uh oh!

gemini-code-assist bot Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		infer_handle_token_num0 = triton.cdiv(model_input0.total_token_num, self.tp_world_size_) * self.tp_world_size_
		infer_handle_token_num1 = triton.cdiv(model_input1.total_token_num, self.tp_world_size_) * self.tp_world_size_

	self._tpsp_allgather(input=input_embdings, infer_state=infer_state)
	input_embdings = self._tpsp_allgather(input=input_embdings, infer_state=infer_state)

Conversation

hiworldwzj commented Apr 13, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant