[hotfix] fix parameter shape checking#6124
[hotfix] fix parameter shape checking#6124flybird11111 wants to merge 1 commit intohpcaitech:mainfrom
Conversation
| break | ||
| if partition_dim is not None: | ||
| assert ( | ||
| original_shape[partition_dim] == tp_size * current_shape[partition_dim] |
There was a problem hiding this comment.
original_shape = [128256, 4096]
current_shape = [16064, 4096]
tp_size = 8
16064 * 8 = 128512 != 128256
When I delete 114~118, the size of model.embed_tokens.weight is [16064, 4096] in the saved model .
| if partition_dim is not None: | ||
| assert ( | ||
| original_shape[partition_dim] == tp_size * current_shape[partition_dim] | ||
| ), f"The parameter isn't evenly distributed among tensor parallel group: \ |
There was a problem hiding this comment.
current_shape should be [16032, 4096]
There was a problem hiding this comment.
original_shape refers to the shape before padding, while tp_size * current_shape refers to the shape after padding. Therefore, the assert will throw an error.
|
why class VocabParallelEmbedding1D(PaddingParallelModule):
r"""Embedding parallelized in the vocabulary dimension.
Args:
num_embeddings (int): number of embeddings.
embedding_dim (int): dimension of embedding.
padding_idx (int, optional): If specified, the entries at padding_idx do not contribute to the gradient;
therefore, the embedding vector at padding_idx is not updated during training,
i.e. it remains as a fixed “pad”, defaults to None.
dtype (:class:`torch.dtype`, optional): The dtype of parameters, defaults to None.
weight_initializer (:class:`typing.Callable`, optional):
he initializer of weight, defaults to normal initializer.
The ``args`` and ``kwargs`` used in :class:``torch.nn.functional.embedding`` should contain:
::
max_norm (float, optional): If given, each embedding vector with norm larger than max_norm is
renormalized to have norm max_norm. Note: this will modify weight in-place.
norm_type (float, optional): The p of the p-norm to compute for the max_norm option. Default 2.
scale_grad_by_freq (bool, optional): If given, this will scale gradients by the inverse
of frequency of the words in the mini-batch. Default False.
sparse (bool, optional): If True, gradient w.r.t. weight will be a sparse tensor. Default False.
More details about ``args`` and ``kwargs`` could be found in
`Embedding <https://pytorch.org/docs/stable/generated/torch.nn.functional.embedding.html#torch.nn.functional.embedding>`_.
More details about initializer please refer to
`init <https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/nn/init.py>`_.
"""
def __init__(
self,
num_embeddings: int,
embedding_dim: int,
padding_idx: int = None,
dtype: torch.dtype = None,
device: torch.device = None,
process_group: ProcessGroup = None,
weight: Optional[nn.Parameter] = None,
weight_initializer: Callable = init.normal_(),
make_vocab_size_divisible_by: int = 64,
fp8_communication: bool = False,
*args,
**kwargs,
):
self.num_embeddings = num_embeddings
self.embedding_dim = embedding_dim
self.embed_args = args
self.embed_kwargs = kwargs
self.process_group = process_group
self.fp8_communication = fp8_communication
tensor_parallel_size = dist.get_world_size(group=process_group)
tensor_parallel_rank = dist.get_rank(group=process_group)
# generate weight and bias
if weight is None:
factory_kwargs = {"device": device, "dtype": dtype}
weight = nn.Parameter(torch.empty((num_embeddings, self.embedding_dim), **factory_kwargs))
else:
weight.data = weight.data.to(device=device, dtype=dtype)
# calculate new padding size
multiple = make_vocab_size_divisible_by * tensor_parallel_size
if num_embeddings % multiple != 0:
self.num_embeddings = num_embeddings + multiple - (num_embeddings % multiple) |
|
Propose to revise the following code: insert |
The make_vocab_size_divisible_by is used to ensure that the vocabulary size is divisible by 64. This is because torch.mm will select a faster operator when the size is a multiple of 64. Later, the result will be unpadded. |
The issue has been closed. Could you please look at the #6110? |
📌 Checklist before creating the PR
[doc/gemini/tensor/...]: A concise descriptionpip install pre-commit && pre-commit install🚨 Issue number
📝 What does this PR do?
💥 Checklist before requesting a review
⭐️ Do you enjoy contributing to Colossal-AI?
Tell us more if you don't enjoy contributing to Colossal-AI.