Skip to content

Conversation

@basilwong
Copy link
Contributor

Description

This PR adds a comprehensive tutorial on using Mosaic for GPU memory profiling in PyTorch.

Mosaic is a post-analysis tool for memory usage that was instrumental in debugging OOM issues during the 405B LLaMA training.

What users will learn

  1. Categorical Memory Profiling - Breaking down memory by category (activation, gradient, optimizer, parameters)
  2. Debugging Unexpected Memory - Using stack trace analysis to find abandoned debug code causing memory bloat
  3. Pipeline Integration - Using Mosaic's Python API for automated memory monitoring and CI/CD regression testing

Tutorial structure

  • Introduction to Mosaic and installation
  • Simple usage examples (CLI commands)
  • Real-World Case 1: Activation Checkpointing Analysis
  • Real-World Case 2: Debugging Unexpected Memory Usage
  • Real-World Case 3: Pipeline Integration with Python API

Requirements

  • PyTorch with CUDA support
  • pip install git+https://github.com/facebookresearch/mosaic.git
  • GPU required to run the examples

Checklist

  • Tutorial runs without errors
  • Tutorial follows sphinx-gallery format
  • Images included in _static/img/mosaic/
  • CI passes

Related Links

Introduces a beginner tutorial demonstrating how to use Mosaic for
GPU memory analysis in PyTorch. The tutorial covers:

- Analyzing memory savings from activation checkpointing
- Debugging unexpected memory usage from abandoned code
- Integrating Mosaic into training pipelines for CI/CD

Includes graceful handling for environments without GPU access.
@pytorch-bot
Copy link

pytorch-bot bot commented Jan 26, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/3744

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the cla signed label Jan 26, 2026
@svekars svekars added the mosaic label Jan 26, 2026
svekars and others added 4 commits January 26, 2026 13:04
Add HAS_MOSAIC_CLI check to skip Mosaic CLI subprocess calls
when the mosaic package is not installed. This prevents
FileNotFoundError in CI environments that have CUDA but
don't have Mosaic installed.
Remove check=True from subprocess.run calls to prevent
exceptions when Mosaic CLI commands fail. Instead, check
return codes and print informative messages. This allows
the tutorial to run in environments where Mosaic is
partially installed or configured differently.
Set __main__.__file__ to a valid file path if not present.
Transformers library reads this file to inspect source code,
so we provide the tutorial file path or fall back to the
transformers module path if __file__ is not available.
@basilwong basilwong force-pushed the mosaic-memory-profiling-tutorial branch from 16248f0 to a64f6c0 Compare January 27, 2026 02:16
Wrap buggy model instantiation in try/except to handle
ValueError from newer transformers versions that don't
support experts implementation on GPT2Model. Falls back
gracefully when the demo cannot run.
basilwong and others added 2 commits January 27, 2026 10:57
- Add tutorial to "What's new in PyTorch tutorials" section
- Add customcarditem in Profiling section of index.rst
- Add customcarditem and toctree entry in ecosystem.rst
@sekyondaMeta sekyondaMeta added the skip-link-check Will allow you to skip linkcheck on a PR. Should only should be used when a link can't be fixed. label Jan 27, 2026
@svekars svekars merged commit 36a4b88 into pytorch:main Jan 27, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed mosaic skip-link-check Will allow you to skip linkcheck on a PR. Should only should be used when a link can't be fixed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants