-
Notifications
You must be signed in to change notification settings - Fork 6.7k
LTX2 distilled checkpoint support #12934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+508
−74
Merged
Changes from all commits
Commits
Show all changes
36 commits
Select commit
Hold shift + click to select a range
3d78f9d
add constants for distill sigmas values and allow ltx pipeline to pas…
rootonchair 9c754a4
add time conditioning conversion and token packing for latents
rootonchair 6fbeacf
make style & quality
rootonchair 82c2e7f
remove prenorm
rootonchair d988fc3
Merge branch 'main' into feat/distill-ltx2
sayakpaul 837fd85
add sigma param to ltx2 i2v
rootonchair 96fbcd8
fix copies and add pack latents to i2v
rootonchair faeccc5
Merge branch 'feat/distill-ltx2' of github.com:rootonchair/diffusers …
rootonchair 9575e06
Apply suggestions from code review
rootonchair ce5a514
Merge branch 'main' into feat/distill-ltx2
sayakpaul 18f1603
Merge branch 'main' into feat/distill-ltx2
sayakpaul eb01780
Infer latent dims if latents/audio_latents is supplied
dg845 31b0f5d
Merge branch 'feat/distill-ltx2' of github.com:rootonchair/diffusers …
rootonchair 7574bf9
add note for predefined sigmas
rootonchair c22eed5
run make style and quality
rootonchair 4ee1c3d
Merge branch 'main' into feat/distill-ltx2
sayakpaul c282485
revert distill timesteps & set original_state_dict_repo_idd to defaul…
rootonchair 62acd4c
add latent normalize
rootonchair 7a56648
add create noised state, delete last sigmas
rootonchair 68788f7
remove normalize step in latent upsample pipeline and move it to ltx2…
rootonchair 8a9179b
add create noise latent to i2v pipeline
rootonchair 7e637be
fix copies
rootonchair 121c085
parse none value in weight conversion script
rootonchair d0650d7
explicit shape handling
rootonchair 1e6a8b9
Apply suggestions from code review
rootonchair f6f682f
make style
rootonchair af43114
Merge branch 'main' into feat/distill-ltx2
sayakpaul 12f2514
add two stage inference tests
rootonchair ce6adfb
add ltx2 documentation
rootonchair fd4084e
Merge branch 'feat/distill-ltx2' of github.com:rootonchair/diffusers …
rootonchair 0c70c8b
update i2v expected_audio_slice
rootonchair 7303754
Apply suggestions from code review
rootonchair 833b427
Apply suggestion from @dg845
rootonchair 8d8b649
Merge branch 'main' into feat/distill-ltx2
sayakpaul 0191986
Update ltx2.md to remove one-stage example
rootonchair c45bd6d
Merge branch 'main' into feat/distill-ltx2
dg845 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -24,6 +24,179 @@ You can find all the original LTX-Video checkpoints under the [Lightricks](https | |||||
|
|
||||||
| The original codebase for LTX-2 can be found [here](https://github.com/Lightricks/LTX-2). | ||||||
|
|
||||||
| ## Two-stages Generation | ||||||
| Recommended pipeline to achieve production quality generation, this pipeline is composed of two stages: | ||||||
|
|
||||||
| - Stage 1: Generate a video at the target resolution using diffusion sampling with classifier-free guidance (CFG). This stage produces a coherent low-noise video sequence that respects the text/image conditioning. | ||||||
| - Stage 2: Upsample the Stage 1 output by 2 and refine details using a distilled LoRA model to improve fidelity and visual quality. Stage 2 may apply lighter CFG to preserve the structure from Stage 1 while enhancing texture and sharpness. | ||||||
|
|
||||||
| Sample usage of text-to-video two stages pipeline | ||||||
|
|
||||||
| ```py | ||||||
| import torch | ||||||
| from diffusers import FlowMatchEulerDiscreteScheduler | ||||||
| from diffusers.pipelines.ltx2 import LTX2Pipeline, LTX2LatentUpsamplePipeline | ||||||
| from diffusers.pipelines.ltx2.latent_upsampler import LTX2LatentUpsamplerModel | ||||||
| from diffusers.pipelines.ltx2.utils import STAGE_2_DISTILLED_SIGMA_VALUES | ||||||
| from diffusers.pipelines.ltx2.export_utils import encode_video | ||||||
|
|
||||||
| device = "cuda:0" | ||||||
| width = 768 | ||||||
| height = 512 | ||||||
|
|
||||||
| pipe = LTX2Pipeline.from_pretrained( | ||||||
| "Lightricks/LTX-2", torch_dtype=torch.bfloat16 | ||||||
| ) | ||||||
| pipe.enable_sequential_cpu_offload(device=device) | ||||||
|
|
||||||
| prompt = "A beautiful sunset over the ocean" | ||||||
| negative_prompt = "shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static." | ||||||
|
|
||||||
| # Stage 1 default (non-distilled) inference | ||||||
| frame_rate = 24.0 | ||||||
| video_latent, audio_latent = pipe( | ||||||
| prompt=prompt, | ||||||
| negative_prompt=negative_prompt, | ||||||
| width=width, | ||||||
| height=height, | ||||||
| num_frames=121, | ||||||
| frame_rate=frame_rate, | ||||||
| num_inference_steps=40, | ||||||
| sigmas=None, | ||||||
| guidance_scale=4.0, | ||||||
| output_type="latent", | ||||||
| return_dict=False, | ||||||
| ) | ||||||
|
|
||||||
| latent_upsampler = LTX2LatentUpsamplerModel.from_pretrained( | ||||||
| "Lightricks/LTX-2", | ||||||
| subfolder="latent_upsampler", | ||||||
| torch_dtype=torch.bfloat16, | ||||||
| ) | ||||||
| upsample_pipe = LTX2LatentUpsamplePipeline(vae=pipe.vae, latent_upsampler=latent_upsampler) | ||||||
| upsample_pipe.enable_model_cpu_offload(device=device) | ||||||
| upscaled_video_latent = upsample_pipe( | ||||||
| latents=video_latent, | ||||||
| output_type="latent", | ||||||
| return_dict=False, | ||||||
| )[0] | ||||||
|
|
||||||
| # Load Stage 2 distilled LoRA | ||||||
| pipe.load_lora_weights( | ||||||
| "Lightricks/LTX-2", adapter_name="stage_2_distilled", weight_name="ltx-2-19b-distilled-lora-384.safetensors" | ||||||
| ) | ||||||
| pipe.set_adapters("stage_2_distilled", 1.0) | ||||||
| # VAE tiling is usually necessary to avoid OOM error when VAE decoding | ||||||
| pipe.vae.enable_tiling() | ||||||
| # Change scheduler to use Stage 2 distilled sigmas as is | ||||||
| new_scheduler = FlowMatchEulerDiscreteScheduler.from_config( | ||||||
| pipe.scheduler.config, use_dynamic_shifting=False, shift_terminal=None | ||||||
| ) | ||||||
| pipe.scheduler = new_scheduler | ||||||
| # Stage 2 inference with distilled LoRA and sigmas | ||||||
| video, audio = pipe( | ||||||
| latents=upscaled_video_latent, | ||||||
| audio_latents=audio_latent, | ||||||
| prompt=prompt, | ||||||
| negative_prompt=negative_prompt, | ||||||
| num_inference_steps=3, | ||||||
| noise_scale=STAGE_2_DISTILLED_SIGMA_VALUES[0], # renoise with first sigma value https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-pipelines/src/ltx_pipelines/ti2vid_two_stages.py#L218 | ||||||
| sigmas=STAGE_2_DISTILLED_SIGMA_VALUES, | ||||||
| guidance_scale=1.0, | ||||||
| output_type="np", | ||||||
| return_dict=False, | ||||||
| ) | ||||||
| video = (video * 255).round().astype("uint8") | ||||||
| video = torch.from_numpy(video) | ||||||
|
|
||||||
| encode_video( | ||||||
| video[0], | ||||||
| fps=frame_rate, | ||||||
| audio=audio[0].float().cpu(), | ||||||
| audio_sample_rate=pipe.vocoder.config.output_sampling_rate, | ||||||
| output_path="ltx2_lora_distilled_sample.mp4", | ||||||
| ) | ||||||
| ``` | ||||||
|
|
||||||
| ## Distilled checkpoint generation | ||||||
| Fastest two-stages generation pipeline using a distilled checkpoint. | ||||||
|
|
||||||
| ```py | ||||||
| import torch | ||||||
| from diffusers.pipelines.ltx2 import LTX2Pipeline, LTX2LatentUpsamplePipeline | ||||||
| from diffusers.pipelines.ltx2.latent_upsampler import LTX2LatentUpsamplerModel | ||||||
| from diffusers.pipelines.ltx2.utils import DISTILLED_SIGMA_VALUES, STAGE_2_DISTILLED_SIGMA_VALUES | ||||||
| from diffusers.pipelines.ltx2.export_utils import encode_video | ||||||
|
|
||||||
| device = "cuda" | ||||||
| width = 768 | ||||||
| height = 512 | ||||||
| random_seed = 42 | ||||||
| generator = torch.Generator(device).manual_seed(random_seed) | ||||||
| model_path = "rootonchair/LTX-2-19b-distilled" | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dg845 let's get these transferred to the Lightricks org. Would you be able to check internally? |
||||||
|
|
||||||
| pipe = LTX2Pipeline.from_pretrained( | ||||||
| model_path, torch_dtype=torch.bfloat16 | ||||||
| ) | ||||||
| pipe.enable_sequential_cpu_offload(device=device) | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
|
||||||
| prompt = "A beautiful sunset over the ocean" | ||||||
| negative_prompt = "shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static." | ||||||
|
|
||||||
| frame_rate = 24.0 | ||||||
| video_latent, audio_latent = pipe( | ||||||
| prompt=prompt, | ||||||
| negative_prompt=negative_prompt, | ||||||
| width=width, | ||||||
| height=height, | ||||||
| num_frames=121, | ||||||
| frame_rate=frame_rate, | ||||||
| num_inference_steps=8, | ||||||
| sigmas=DISTILLED_SIGMA_VALUES, | ||||||
| guidance_scale=1.0, | ||||||
| generator=generator, | ||||||
| output_type="latent", | ||||||
| return_dict=False, | ||||||
| ) | ||||||
|
|
||||||
| latent_upsampler = LTX2LatentUpsamplerModel.from_pretrained( | ||||||
| model_path, | ||||||
| subfolder="latent_upsampler", | ||||||
| torch_dtype=torch.bfloat16, | ||||||
| ) | ||||||
| upsample_pipe = LTX2LatentUpsamplePipeline(vae=pipe.vae, latent_upsampler=latent_upsampler) | ||||||
| upsample_pipe.enable_model_cpu_offload(device=device) | ||||||
| upscaled_video_latent = upsample_pipe( | ||||||
| latents=video_latent, | ||||||
| output_type="latent", | ||||||
| return_dict=False, | ||||||
| )[0] | ||||||
|
|
||||||
| video, audio = pipe( | ||||||
| latents=upscaled_video_latent, | ||||||
| audio_latents=audio_latent, | ||||||
| prompt=prompt, | ||||||
| negative_prompt=negative_prompt, | ||||||
| num_inference_steps=3, | ||||||
| noise_scale=STAGE_2_DISTILLED_SIGMA_VALUES[0], # renoise with first sigma value https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-pipelines/src/ltx_pipelines/distilled.py#L178 | ||||||
| sigmas=STAGE_2_DISTILLED_SIGMA_VALUES, | ||||||
| generator=generator, | ||||||
| guidance_scale=1.0, | ||||||
| output_type="np", | ||||||
| return_dict=False, | ||||||
| ) | ||||||
| video = (video * 255).round().astype("uint8") | ||||||
| video = torch.from_numpy(video) | ||||||
|
|
||||||
| encode_video( | ||||||
| video[0], | ||||||
| fps=frame_rate, | ||||||
| audio=audio[0].float().cpu(), | ||||||
| audio_sample_rate=pipe.vocoder.config.output_sampling_rate, | ||||||
| output_path="ltx2_distilled_sample.mp4", | ||||||
| ) | ||||||
| ``` | ||||||
|
|
||||||
| ## LTX2Pipeline | ||||||
|
|
||||||
| [[autodoc]] LTX2Pipeline | ||||||
|
|
||||||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to interfere, but I've tested this quite a lot (rtx 4090), and using
enable_sequential_cpu_offloadis actually a very good default, considering how big this model is, since it makes it possible to run locally.