Add Automated QDQ placement reference - part 4.2#842
Add Automated QDQ placement reference - part 4.2#842willg-nv wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughA new comprehensive documentation file describing the Automatic Q/DQ Placement Optimizer architecture for ONNX models. The document covers framework components, optimization workflows, data models, the full lifecycle from initialization through evaluation, usage examples via CLI and Python API, design rationale, limitations, and related terminology. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@docs/source/reference/2_qdq_placement.rst`:
- Around line 143-144: The summary line claiming "Supports forward-only,
backward-only, and bidirectional search" is inconsistent with later text that
says only forward (downstream) region expansion is implemented and backward
expansion is a potential future addition; update the brief feature list to
accurately reflect the current implementation by changing that line to state
that only forward (downstream) region expansion is currently implemented and
that backward and bidirectional search are planned/experimental, and ensure the
change references the same "region search" wording so readers see the consistent
status across the document.
- Around line 343-347: Add explicit documentation for configuring TensorRT
plugin library paths: state the config key name (tensorrt_plugin_paths) and the
CLI flag (--tensorrt-plugin-paths) and an environment variable alternative
(TENSORRT_PLUGIN_PATH), describe the accepted formats (comma-separated list or
JSON array in the config file, single string for a single path) and where to
place them in the config/CLI, and note that plugins are loaded during TensorRT
engine creation/initialization (before inference/warmup) so they must be
accessible at startup; include a short example for both CLI and config file
usage and clarify failure behavior if a plugin path is invalid.
- Line 299: Update the documentation for init_benchmark_instance(timing_cache,
warmup, timing) to include explicit parameter types and descriptions (e.g.,
whether timing_cache is a file path string or a binary blob, the expected
type/units for warmup and timing), indicate which parameters are optional and
their default values, specify the function's return type and what the return
value represents, and add a short usage example showing a typical call and
expected outcome; reference the exact symbol init_benchmark_instance and its
parameters timing_cache, warmup, timing so readers can locate the function
implementation if they need more detail.
- Around line 1088-1091: Fix the two broken/renamed links in
docs/source/reference/2_qdq_placement.rst by updating the "ONNX Quantization"
link (currently https://onnx.ai/onnx/technical/quantization.html) to the correct
ONNX quantization documentation path or remove that list item if no direct
replacement exists, and update the "NVIDIA ModelOpt" GitHub URL from
https://github.com/NVIDIA/TensorRT-Model-Optimizer to the renamed repository
https://github.com/NVIDIA/Model-Optimizer; target the list entries labeled "ONNX
Quantization" and "NVIDIA ModelOpt" in the diff when making the changes.
🧹 Nitpick comments (3)
docs/source/reference/2_qdq_placement.rst (3)
899-914: Clarify dual behavior ofexport_onnxandbenchmark.run.The example shows two different usage patterns:
- Lines 899-900:
export_onnx("baseline.onnx", ...)exports to file, thenbenchmark.run("baseline.onnx")reads from file- Lines 912-913:
export_onnx(None, ...)returns bytes, thenbenchmark.run(model_bytes)accepts bytesThis dual behavior (file path vs. in-memory bytes) should be explicitly documented in the method descriptions. Consider adding a note explaining when to use each approach (e.g., use bytes for intermediate schemes to avoid disk I/O).
390-407: Consider adding parameter tuning guidance.The configuration parameters are well-documented with clear defaults, but users would benefit from guidance on:
- When to increase
maximum_generation_attempts(e.g., for models with many constraints)- When to decrease
pattern_cache_minimum_distance(e.g., for fine-grained exploration)- Trade-offs between
schemes_per_regionand optimization time- Performance implications of
pattern_cache_max_entries_per_patternThis would help users make informed decisions when tuning the autotuner for their specific use cases.
920-1037: Consider adding troubleshooting and operational guidance sections.The documentation thoroughly covers architecture and usage but would benefit from additional sections on:
Error Handling: What happens when TensorRT engine build fails? How are failures logged? Can users configure retry behavior?
Troubleshooting: Common issues and solutions:
- What if no speedup is achieved?
- What if TensorRT can't parse the model?
- What if optimization takes too long?
Performance Expectations: How long does typical optimization take? What factors affect optimization time? (This would help users set realistic expectations)
Logging Configuration: How to enable verbose logging? What information is logged at different verbosity levels?
These additions would make the documentation more complete for production use cases.
|
Suggestion to rename this as |
| * Input tensor index to that region | ||
| * Enables quantization of data flowing into subregions | ||
|
|
||
| **InsertionScheme** |
There was a problem hiding this comment.
The beginning of this section says "Defines 3 types of Q/DQ insertion locations".
How does InsertionScheme and Resolution Process fit into this section? Should a new section be created?
There was a problem hiding this comment.
InsertionScheme is collection of insertion points.
I have added one line to explain why "Resolution Process" is needed for Insertion points.
I think it should be reasonable to put them in section?
| 6. Configuration (common.py) | ||
| ----------------------------- | ||
|
|
||
| Config Class |
| Autotuning Workflow | ||
| =================== | ||
|
|
||
| Complete Optimization Process |
Signed-off-by: Will Guo <willg@nvidia.com>
Signed-off-by: Will Guo <willg@nvidia.com>
bf5a4c2 to
3778027
Compare
There was a problem hiding this comment.
Pull request overview
Adds a new reference document describing the architecture and usage of the automatic ONNX Q/DQ placement autotuning tool (modelopt.onnx.quantization.autotune) within the ModelOpt documentation set.
Changes:
- Introduces a comprehensive reference page covering architecture, core components, workflows, CLI usage, and Python API usage for automatic Q/DQ placement.
- Documents state management / crash recovery and pattern-cache warm-start concepts.
- Adds a glossary and external references relevant to ONNX quantization and TensorRT benchmarking.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Measure of how different two insertion schemes are, typically computed as Hamming | ||
| distance between their insertion point sets. |
There was a problem hiding this comment.
Glossary entry says scheme diversity is "typically computed as Hamming distance", but the implementation uses an edit distance based on symmetric difference of insertion-point sets (InsertionScheme.distance). Please adjust the glossary to match the actual metric used.
| Measure of how different two insertion schemes are, typically computed as Hamming | |
| distance between their insertion point sets. | |
| Measure of how different two insertion schemes are, computed as an edit distance | |
| based on the symmetric difference between their insertion point sets. |
| @@ -0,0 +1,1117 @@ | |||
| ==================================================== | |||
There was a problem hiding this comment.
The document title uses an overline+underline style. Other pages under docs/source/reference/ (e.g., 2_security.rst, 1_modelopt_api.rst) use an underline-only top-level heading; consider matching that convention for consistent rendering/styling.
| ==================================================== |
| **Region Discovery Algorithm:** | ||
|
|
||
| 1. **Bottom-Up Search**: Start from individual operations | ||
| 2. **Local Expansion**: Expand forward/backward from seed nodes within step limits |
There was a problem hiding this comment.
This algorithm description says regions expand "forward/backward" from seed nodes, but the current implementation in region_search.py is forward-only (it builds forward reachability and traverses from inputs → outputs). This also conflicts with the later "Forward-Only Region Search" section; please update this step to reflect forward-only behavior (or document any backward traversal if it exists).
| 2. **Local Expansion**: Expand forward/backward from seed nodes within step limits | |
| 2. **Local Expansion**: Expand forward from seed nodes (inputs → outputs) within step limits |
|
|
||
| * Individual operations or small operation sequences | ||
| * Conv, MatMul, Gemm, Add, etc. | ||
| * Forward/backward expansion around seed nodes |
There was a problem hiding this comment.
This bullet mentions "Forward/backward expansion" for leaf region building, but the current RegionPartitioner logic in region_search.py traverses forward along tensor users (dataflow direction). Consider rewording to "forward expansion" for accuracy and to avoid contradicting the later forward-only section.
| * Forward/backward expansion around seed nodes | |
| * Forward expansion around seed nodes |
| * Schemes are filtered by minimum Hamming distance | ||
| * Ensures cache contains diverse candidates | ||
| * Prevents redundant similar schemes |
There was a problem hiding this comment.
The docs describe diversity filtering as using "Hamming distance", but the implementation uses InsertionScheme.distance() which is an edit distance computed as the size of the symmetric difference of insertion-point sets. To avoid confusion (and match the code terminology), consider calling this "edit distance" / "symmetric-difference distance" consistently.
| * Schemes are filtered by minimum Hamming distance | |
| * Ensures cache contains diverse candidates | |
| * Prevents redundant similar schemes | |
| * Schemes are filtered by a minimum edit distance (symmetric-difference over insertion-point sets) | |
| * Ensures cache contains diverse candidates | |
| * Prevents redundant, highly similar schemes |
| Diversity Filtering | ||
| ~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| * Compute Hamming distance between schemes |
There was a problem hiding this comment.
This section again refers to "Hamming distance" between schemes, but the code computes an edit distance via symmetric difference of insertion-point sets (InsertionScheme.distance). Please align the terminology here with the implementation (or clarify the equivalence if you intend a Hamming-over-bitvector interpretation).
| * Compute Hamming distance between schemes | |
| * Compute distance between schemes via symmetric difference of insertion-point sets |
| Low-Level API | ||
| ~~~~~~~~~~~~~ | ||
|
|
||
| Initialize the global benchmark with ``init_benchmark_instance``, then use ``benchmark_onnx_model`` for measurements. The workflow uses this same global; when calling the workflow from Python you do not need to call ``init_benchmark_instance`` yourself (the CLI does it). |
There was a problem hiding this comment.
This sentence is internally inconsistent: it says you don't need to call init_benchmark_instance when calling the workflow from Python because the CLI does it. In practice, the CLI initializes the global benchmark only for CLI runs; for Python usage you still need to call init_benchmark_instance (or the workflow needs to do it) before benchmark_onnx_model can return real latencies.
| Initialize the global benchmark with ``init_benchmark_instance``, then use ``benchmark_onnx_model`` for measurements. The workflow uses this same global; when calling the workflow from Python you do not need to call ``init_benchmark_instance`` yourself (the CLI does it). | |
| Initialize the global benchmark with ``init_benchmark_instance``, then use ``benchmark_onnx_model`` for measurements. The workflow and low-level helpers share this same global instance; when calling them from Python you must call ``init_benchmark_instance`` yourself before any benchmarking. The CLI initializes the benchmark automatically only for CLI runs. |
|
According to discussion. ModelOpt API documents are generated automatically, a seperated reference doc is not needed. I have moved some sections to Auto QDQ guide doc. I will close this PR. |
What does this PR do?
This PR uploads reference document of Automated QDQ placement tool. This tool could automatically search QDQ insertion points with better performance.
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit