Skip to content

Add Automated QDQ placement reference - part 4.2#842

Closed
willg-nv wants to merge 2 commits intoNVIDIA:mainfrom
willg-nv:dev-willg-integrate-auto-qdq-placement-part4.2
Closed

Add Automated QDQ placement reference - part 4.2#842
willg-nv wants to merge 2 commits intoNVIDIA:mainfrom
willg-nv:dev-willg-integrate-auto-qdq-placement-part4.2

Conversation

@willg-nv
Copy link
Contributor

@willg-nv willg-nv commented Feb 3, 2026

What does this PR do?

This PR uploads reference document of Automated QDQ placement tool. This tool could automatically search QDQ insertion points with better performance.

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes/No
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes/No
  • Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

  • Documentation
    • Added comprehensive documentation for the Automatic Q/DQ Placement Optimizer, covering architecture overview, core components, optimization lifecycle, usage examples for both command-line and Python APIs, design rationale, current limitations, and a glossary of relevant terms.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 3, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

A new comprehensive documentation file describing the Automatic Q/DQ Placement Optimizer architecture for ONNX models. The document covers framework components, optimization workflows, data models, the full lifecycle from initialization through evaluation, usage examples via CLI and Python API, design rationale, limitations, and related terminology.

Changes

Cohort / File(s) Summary
Documentation
docs/source/reference/2_qdq_placement.rst
New comprehensive reference documentation (1092 lines) detailing the Q/DQ Placement Optimizer architecture, including component descriptions, lifecycle stages, data models, usage examples, design rationale, and glossary.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically identifies the main change: adding documentation for the Automated QDQ placement optimizer, identified as part 4.2 of a series.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@docs/source/reference/2_qdq_placement.rst`:
- Around line 143-144: The summary line claiming "Supports forward-only,
backward-only, and bidirectional search" is inconsistent with later text that
says only forward (downstream) region expansion is implemented and backward
expansion is a potential future addition; update the brief feature list to
accurately reflect the current implementation by changing that line to state
that only forward (downstream) region expansion is currently implemented and
that backward and bidirectional search are planned/experimental, and ensure the
change references the same "region search" wording so readers see the consistent
status across the document.
- Around line 343-347: Add explicit documentation for configuring TensorRT
plugin library paths: state the config key name (tensorrt_plugin_paths) and the
CLI flag (--tensorrt-plugin-paths) and an environment variable alternative
(TENSORRT_PLUGIN_PATH), describe the accepted formats (comma-separated list or
JSON array in the config file, single string for a single path) and where to
place them in the config/CLI, and note that plugins are loaded during TensorRT
engine creation/initialization (before inference/warmup) so they must be
accessible at startup; include a short example for both CLI and config file
usage and clarify failure behavior if a plugin path is invalid.
- Line 299: Update the documentation for init_benchmark_instance(timing_cache,
warmup, timing) to include explicit parameter types and descriptions (e.g.,
whether timing_cache is a file path string or a binary blob, the expected
type/units for warmup and timing), indicate which parameters are optional and
their default values, specify the function's return type and what the return
value represents, and add a short usage example showing a typical call and
expected outcome; reference the exact symbol init_benchmark_instance and its
parameters timing_cache, warmup, timing so readers can locate the function
implementation if they need more detail.
- Around line 1088-1091: Fix the two broken/renamed links in
docs/source/reference/2_qdq_placement.rst by updating the "ONNX Quantization"
link (currently https://onnx.ai/onnx/technical/quantization.html) to the correct
ONNX quantization documentation path or remove that list item if no direct
replacement exists, and update the "NVIDIA ModelOpt" GitHub URL from
https://github.com/NVIDIA/TensorRT-Model-Optimizer to the renamed repository
https://github.com/NVIDIA/Model-Optimizer; target the list entries labeled "ONNX
Quantization" and "NVIDIA ModelOpt" in the diff when making the changes.
🧹 Nitpick comments (3)
docs/source/reference/2_qdq_placement.rst (3)

899-914: Clarify dual behavior of export_onnx and benchmark.run.

The example shows two different usage patterns:

  1. Lines 899-900: export_onnx("baseline.onnx", ...) exports to file, then benchmark.run("baseline.onnx") reads from file
  2. Lines 912-913: export_onnx(None, ...) returns bytes, then benchmark.run(model_bytes) accepts bytes

This dual behavior (file path vs. in-memory bytes) should be explicitly documented in the method descriptions. Consider adding a note explaining when to use each approach (e.g., use bytes for intermediate schemes to avoid disk I/O).


390-407: Consider adding parameter tuning guidance.

The configuration parameters are well-documented with clear defaults, but users would benefit from guidance on:

  • When to increase maximum_generation_attempts (e.g., for models with many constraints)
  • When to decrease pattern_cache_minimum_distance (e.g., for fine-grained exploration)
  • Trade-offs between schemes_per_region and optimization time
  • Performance implications of pattern_cache_max_entries_per_pattern

This would help users make informed decisions when tuning the autotuner for their specific use cases.


920-1037: Consider adding troubleshooting and operational guidance sections.

The documentation thoroughly covers architecture and usage but would benefit from additional sections on:

  1. Error Handling: What happens when TensorRT engine build fails? How are failures logged? Can users configure retry behavior?

  2. Troubleshooting: Common issues and solutions:

    • What if no speedup is achieved?
    • What if TensorRT can't parse the model?
    • What if optimization takes too long?
  3. Performance Expectations: How long does typical optimization take? What factors affect optimization time? (This would help users set realistic expectations)

  4. Logging Configuration: How to enable verbose logging? What information is logged at different verbosity levels?

These additions would make the documentation more complete for production use cases.

@gcunhase
Copy link
Contributor

Suggestion to rename this as *_onnx_autotuner or *_onnx_autoqdq as per #841 (comment).

* Input tensor index to that region
* Enables quantization of data flowing into subregions

**InsertionScheme**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The beginning of this section says "Defines 3 types of Q/DQ insertion locations".

How does InsertionScheme and Resolution Process fit into this section? Should a new section be created?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InsertionScheme is collection of insertion points.
I have added one line to explain why "Resolution Process" is needed for Insertion points.

I think it should be reasonable to put them in section?

6. Configuration (common.py)
-----------------------------

Config Class
Copy link
Contributor

@gcunhase gcunhase Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be included in the ModelOpt ONNX API doc here? @cjluo-nv

We should find a way to make this doc generation automatic as it will be very difficult to maintain it manually.

Autotuning Workflow
===================

Complete Optimization Process
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially move this to #843

willg-nv added 2 commits March 2, 2026 09:34
Signed-off-by: Will Guo <willg@nvidia.com>
Signed-off-by: Will Guo <willg@nvidia.com>
@willg-nv willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part4.2 branch from bf5a4c2 to 3778027 Compare March 2, 2026 10:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new reference document describing the architecture and usage of the automatic ONNX Q/DQ placement autotuning tool (modelopt.onnx.quantization.autotune) within the ModelOpt documentation set.

Changes:

  • Introduces a comprehensive reference page covering architecture, core components, workflows, CLI usage, and Python API usage for automatic Q/DQ placement.
  • Documents state management / crash recovery and pattern-cache warm-start concepts.
  • Adds a glossary and external references relevant to ONNX quantization and TensorRT benchmarking.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1107 to +1108
Measure of how different two insertion schemes are, typically computed as Hamming
distance between their insertion point sets.
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glossary entry says scheme diversity is "typically computed as Hamming distance", but the implementation uses an edit distance based on symmetric difference of insertion-point sets (InsertionScheme.distance). Please adjust the glossary to match the actual metric used.

Suggested change
Measure of how different two insertion schemes are, typically computed as Hamming
distance between their insertion point sets.
Measure of how different two insertion schemes are, computed as an edit distance
based on the symmetric difference between their insertion point sets.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,1117 @@
====================================================
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The document title uses an overline+underline style. Other pages under docs/source/reference/ (e.g., 2_security.rst, 1_modelopt_api.rst) use an underline-only top-level heading; consider matching that convention for consistent rendering/styling.

Suggested change
====================================================

Copilot uses AI. Check for mistakes.
**Region Discovery Algorithm:**

1. **Bottom-Up Search**: Start from individual operations
2. **Local Expansion**: Expand forward/backward from seed nodes within step limits
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This algorithm description says regions expand "forward/backward" from seed nodes, but the current implementation in region_search.py is forward-only (it builds forward reachability and traverses from inputs → outputs). This also conflicts with the later "Forward-Only Region Search" section; please update this step to reflect forward-only behavior (or document any backward traversal if it exists).

Suggested change
2. **Local Expansion**: Expand forward/backward from seed nodes within step limits
2. **Local Expansion**: Expand forward from seed nodes (inputs → outputs) within step limits

Copilot uses AI. Check for mistakes.

* Individual operations or small operation sequences
* Conv, MatMul, Gemm, Add, etc.
* Forward/backward expansion around seed nodes
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bullet mentions "Forward/backward expansion" for leaf region building, but the current RegionPartitioner logic in region_search.py traverses forward along tensor users (dataflow direction). Consider rewording to "forward expansion" for accuracy and to avoid contradicting the later forward-only section.

Suggested change
* Forward/backward expansion around seed nodes
* Forward expansion around seed nodes

Copilot uses AI. Check for mistakes.
Comment on lines +638 to +640
* Schemes are filtered by minimum Hamming distance
* Ensures cache contains diverse candidates
* Prevents redundant similar schemes
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs describe diversity filtering as using "Hamming distance", but the implementation uses InsertionScheme.distance() which is an edit distance computed as the size of the symmetric difference of insertion-point sets. To avoid confusion (and match the code terminology), consider calling this "edit distance" / "symmetric-difference distance" consistently.

Suggested change
* Schemes are filtered by minimum Hamming distance
* Ensures cache contains diverse candidates
* Prevents redundant similar schemes
* Schemes are filtered by a minimum edit distance (symmetric-difference over insertion-point sets)
* Ensures cache contains diverse candidates
* Prevents redundant, highly similar schemes

Copilot uses AI. Check for mistakes.
Diversity Filtering
~~~~~~~~~~~~~~~~~~~

* Compute Hamming distance between schemes
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section again refers to "Hamming distance" between schemes, but the code computes an edit distance via symmetric difference of insertion-point sets (InsertionScheme.distance). Please align the terminology here with the implementation (or clarify the equivalence if you intend a Hamming-over-bitvector interpretation).

Suggested change
* Compute Hamming distance between schemes
* Compute distance between schemes via symmetric difference of insertion-point sets

Copilot uses AI. Check for mistakes.
Low-Level API
~~~~~~~~~~~~~

Initialize the global benchmark with ``init_benchmark_instance``, then use ``benchmark_onnx_model`` for measurements. The workflow uses this same global; when calling the workflow from Python you do not need to call ``init_benchmark_instance`` yourself (the CLI does it).
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is internally inconsistent: it says you don't need to call init_benchmark_instance when calling the workflow from Python because the CLI does it. In practice, the CLI initializes the global benchmark only for CLI runs; for Python usage you still need to call init_benchmark_instance (or the workflow needs to do it) before benchmark_onnx_model can return real latencies.

Suggested change
Initialize the global benchmark with ``init_benchmark_instance``, then use ``benchmark_onnx_model`` for measurements. The workflow uses this same global; when calling the workflow from Python you do not need to call ``init_benchmark_instance`` yourself (the CLI does it).
Initialize the global benchmark with ``init_benchmark_instance``, then use ``benchmark_onnx_model`` for measurements. The workflow and low-level helpers share this same global instance; when calling them from Python you must call ``init_benchmark_instance`` yourself before any benchmarking. The CLI initializes the benchmark automatically only for CLI runs.

Copilot uses AI. Check for mistakes.
@willg-nv
Copy link
Contributor Author

willg-nv commented Mar 3, 2026

According to discussion. ModelOpt API documents are generated automatically, a seperated reference doc is not needed. I have moved some sections to Auto QDQ guide doc. I will close this PR.

@willg-nv willg-nv closed this Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants