Skip to content

Integrate Automated QDQ benchmark - part 3.1#837

Merged
cjluo-nv merged 10 commits intoNVIDIA:mainfrom
willg-nv:dev-willg-integrate-auto-qdq-placement-part3.1
Feb 25, 2026
Merged

Integrate Automated QDQ benchmark - part 3.1#837
cjluo-nv merged 10 commits intoNVIDIA:mainfrom
willg-nv:dev-willg-integrate-auto-qdq-placement-part3.1

Conversation

@willg-nv
Copy link
Contributor

@willg-nv willg-nv commented Feb 2, 2026

What does this PR do?

This PR integrates benchmark module to QDQ autotunner. This benchamrk module is used to evaluate ONNX model perf.
This PR is 1/3 of #703. Once all small PRs are merged #703 could be closed.

PR 3.1: #837
PR 3.2 #838
PR 3.3: #839

Testing

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes
  • Did you write any new necessary tests?: No
  • Did you add or update any necessary documentation?: No, document will be added in part 4.
  • Did you update Changelog?: No, change log will be updated when all changes are merged.

Additional Information

Summary by CodeRabbit

  • New Features
    • Added ONNX quantization autotuning capabilities with a consolidated module providing streamlined import paths for core components.
    • Introduced unified benchmarking framework supporting TensorRT-based model evaluation with both command-line and Python API implementations.
    • Added support for timing cache persistence, custom plugin libraries, shape validation, and dynamic input shape configuration for flexible model testing and optimization.

✏️ Tip: You can customize this high-level summary in your review settings.

@willg-nv willg-nv requested a review from a team as a code owner February 2, 2026 02:51
@willg-nv willg-nv requested a review from gcunhase February 2, 2026 02:51
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 2, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 2, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR introduces a benchmarking framework for ONNX model evaluation within the quantization autotuning subsystem. It provides an abstract Benchmark interface with two concrete implementations: TrtExecBenchmark for CLI-based benchmarking via trtexec, and TensorRTPyBenchmark for Python API-based benchmarking via TensorRT. A package initializer consolidates public exports from the autotune submodules.

Changes

Cohort / File(s) Summary
Public API Exports
modelopt/onnx/quantization/autotune/__init__.py
Centralizes and re-exports 17 public entities from the autotune subsystem, including benchmark classes, region abstractions, configuration, error types, and search orchestration driver.
Benchmarking Framework
modelopt/onnx/quantization/autotune/benchmark.py
Introduces abstract Benchmark base class and two concrete implementations: TrtExecBenchmark (CLI-based timing via trtexec) and TensorRTPyBenchmark (Python API-based timing via TensorRT with CUDA integration, plugin loading, dynamic shape support, and statistics collection).

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant TensorRTPyBenchmark
    participant TensorRT as TensorRT Engine
    participant CUDA as CUDA Runtime
    participant FileSystem

    User->>TensorRTPyBenchmark: run(onnx_model_bytes)
    TensorRTPyBenchmark->>FileSystem: Load/create timing cache
    TensorRTPyBenchmark->>TensorRT: Load plugins if configured
    TensorRTPyBenchmark->>TensorRT: Build engine from ONNX
    TensorRTPyBenchmark->>TensorRT: Create context
    
    TensorRTPyBenchmark->>CUDA: Allocate device memory
    TensorRTPyBenchmark->>TensorRT: Warmup runs (N iterations)
    CUDA-->>TensorRTPyBenchmark: Kernel execution
    
    TensorRTPyBenchmark->>TensorRT: Timing runs (M iterations)
    loop Collect Latencies
        TensorRT->>CUDA: Execute inference
        CUDA-->>TensorRTPyBenchmark: Timing data
    end
    
    TensorRTPyBenchmark->>TensorRTPyBenchmark: Calculate statistics (min/max/mean/median)
    TensorRTPyBenchmark->>FileSystem: Save timing cache & results
    TensorRTPyBenchmark-->>User: Return median latency (float)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Integrate Automated QDQ benchmark - part 3.1' directly reflects the main change: introducing benchmark module (TrtExecBenchmark, TensorRTPyBenchmark) into the QDQ autotuner package, as confirmed by the added files and PR objectives.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@modelopt/onnx/quantization/autotune/benchmark.py`:
- Around line 207-212: The parameter flush_timing_cache on TrtExecBenchmark.run
is declared but unused; either remove it from the signature or implement the
same behavior as TensorRTPyBenchmark.run by flushing the timing cache before
running. To implement, add a small helper (e.g., _flush_timing_cache) or call
the appropriate trtexec/engine cache clear routine at the start of
TrtExecBenchmark.run when flush_timing_cache is True, then proceed with the
existing benchmarking flow; if you prefer removal, delete the flush_timing_cache
parameter from TrtExecBenchmark.run and any callers to keep APIs consistent with
actual behavior.
- Around line 262-266: The code raises NameError when log_file is None because
it references the undefined variable output; in the block that handles a failed
regex match (where you have match = re.search(self.latency_pattern,
result.stdout, ...)), replace the use of output with the already-available
result.stdout (or otherwise initialize output before the conditional) so the
logger.debug call uses a defined variable; update the code around the match
check that references log_file/result.stdout and self.logger to log
result.stdout instead of output when log_file is None.
- Around line 676-691: The _save_timing_cache method has inverted and wrong
checks, a typo, unsafe finally cleanup, and incorrect serialization logic:
ensure a builder config is created before use (assign config =
self.builder.create_builder_config() before the try or inside try but guard
finally), only attempt to combine when self._timing_cache is not None (change
the condition from if self._timing_cache is None to if self._timing_cache is not
None), call output_cache.combine(self._timing_cache, ignore_errors=True) (fix
typo combine), then serialize the resulting cache (or if the intent is to
persist the existing cache directly, serialize self._timing_cache) and write it
to self.timing_cache_file; finally, in the finally block only delete or close
config if it was successfully created (e.g., check if 'config' in locals() or
use a variable initialized to None) to avoid UnboundLocalError.
- Line 145: The function parameter trtexec_args currently uses a mutable default
list (trtexec_args: list = []) which can be shared across calls; change the
signature to use None (e.g., trtexec_args: Optional[list] = None) and inside the
function (in the body of the same function that declares trtexec_args)
initialize it to an empty list if None (trtexec_args = trtexec_args or [])
before use; update the type hint to Optional[list] (or List[str]) and add the
necessary typing import if absent, ensuring all references to trtexec_args in
the function use the newly initialized local list.
🧹 Nitpick comments (4)
modelopt/onnx/quantization/autotune/benchmark.py (4)

97-98: Unnecessary global statement.

The global logger declaration is not needed here since logger is already available at module scope from the import on line 57. The global keyword is only required when you intend to reassign a module-level variable inside a function/method.

Suggested fix
-        global logger
         self.timing_cache_file = timing_cache_file or "/tmp/trtexec_timing.cache"  # nosec B108

543-544: Potential dtype mismatch for integer input tensors.

np.random.randn(size) generates float64 values, which are then cast via .astype(dtype). For integer dtypes, this truncates floats to integers (all becoming 0 for values in (-1, 1)). Consider using np.random.randint for integer types or ensuring the random data is appropriate for the tensor's dtype.

Suggested approach
                 if engine.get_tensor_mode(tensor_name) == trt.TensorIOMode.INPUT:
-                    np.copyto(host_mem, np.random.randn(size).astype(dtype))
+                    if np.issubdtype(dtype, np.integer):
+                        np.copyto(host_mem, np.random.randint(0, 100, size=size, dtype=dtype))
+                    else:
+                        np.copyto(host_mem, np.random.randn(size).astype(dtype))
                     inputs.append({"host": host_mem, "device": device_mem, "name": tensor_name})

104-115: Base class signature mismatch with subclasses.

The abstract run() method has signature run(path_or_bytes, log_file), but both TrtExecBenchmark.run() and TensorRTPyBenchmark.run() add a flush_timing_cache parameter. This breaks Liskov substitution - code using Benchmark type hints won't know about the extra parameter. Consider adding the parameter to the base class signature.

Suggested fix for base class
     `@abstractmethod`
-    def run(self, path_or_bytes: str | bytes, log_file: str | None = None) -> float:
+    def run(self, path_or_bytes: str | bytes, log_file: str | None = None, flush_timing_cache: bool = False) -> float:
         """Run benchmark on the given ONNX model.

         Args:
             path_or_bytes: Path to the ONNX model (str) or raw model data (bytes)
             log_file: Optional path to save benchmark logs
+            flush_timing_cache: Whether to flush timing cache to disk after build

         Returns:
             Measured latency in milliseconds, or float("inf") on failure
         """

185-191: Inconsistent plugin validation behavior between benchmark implementations.

TrtExecBenchmark logs a warning and continues when a plugin library is not found (line 188-189), while TensorRTPyBenchmark raises FileNotFoundError (line 359). Consider aligning the behavior - failing fast is generally safer for configuration errors.

@willg-nv willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part3.1 branch 2 times, most recently from 3780d4f to f405dbb Compare February 9, 2026 08:39
@gcunhase
Copy link
Contributor

gcunhase commented Feb 9, 2026

How will Auto Q/DQ dependencies be ensured in ModelOpt?

See previous discussion on pycuda requirement: https://github.com/NVIDIA/Model-Optimizer/pull/703/changes#r2733859240

How will TensorRT's dependency be ensured?

@gcunhase
Copy link
Contributor

gcunhase commented Feb 9, 2026

Should we add a test file here? This could potentially be merged with test_workflows.py in #839 (comment)

@willg-nv willg-nv requested a review from a team as a code owner February 10, 2026 09:33
@willg-nv willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part3.1 branch from a548f43 to 6a7f38a Compare February 10, 2026 09:44
@willg-nv
Copy link
Contributor Author

Should we add a test file here? This could potentially be merged with test_workflows.py in #839 (comment)

I think this will be covered by test_workflow.py. I have a question, does ModelOpt CI support running trtexec and benchmark tests?

@willg-nv
Copy link
Contributor Author

How will Auto Q/DQ dependencies be ensured in ModelOpt?

See previous discussion on pycuda requirement: https://github.com/NVIDIA/Model-Optimizer/pull/703/changes#r2733859240

How will TensorRT's dependency be ensured?

As is discussed, TensorRT dependency should be installed by users. And pycuda dependency should be added to example requirements.

@ajrasane
Copy link
Contributor

/ok to test b4855ad

@willg-nv willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part3.1 branch from b4855ad to fdff6c7 Compare February 12, 2026 23:49
@willg-nv willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part3.1 branch from d264be4 to bcbf3d6 Compare February 12, 2026 23:58
@gcunhase
Copy link
Contributor

Should we add a test file here? This could potentially be merged with test_workflows.py in #839 (comment)

I think this will be covered by test_workflow.py. I have a question, does ModelOpt CI support running trtexec and benchmark tests?

It doesn't seem like unittests run with TRT support, but examples testing do. See

##### ONNX/TensorRT Example Tests #####

@kevalmorabia97 / @cjluo-nv to confirm this. Also, do GPU tests have TRT support? We have some tests that benefit from that (see test_plugin.py, for example).

@kevalmorabia97
Copy link
Collaborator

Should we add a test file here? This could potentially be merged with test_workflows.py in #839 (comment)

I think this will be covered by test_workflow.py. I have a question, does ModelOpt CI support running trtexec and benchmark tests?

It doesn't seem like unittests run with TRT support, but examples testing do. See

##### ONNX/TensorRT Example Tests #####

@kevalmorabia97 / @cjluo-nv to confirm this. Also, do GPU tests have TRT support? We have some tests that benefit from that (see test_plugin.py, for example).

@gcunhase ONNX/TensorRT Example tests (tensorrt docker) and GPU tests (pytorch docker) both run on GPU runner and both docker images have TRT installed

@willg-nv willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part3.1 branch 2 times, most recently from fcd81a3 to c941aca Compare February 23, 2026 07:18
Copy link

@modelopt-bot modelopt-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #837 - Integrate Automated QDQ Benchmark

Overall Assessment

This PR adds benchmark infrastructure for QDQ autotuning with two benchmark implementations (TrtExecBenchmark using CLI and TensorRTPyBenchmark using Python API). The code is functional but has significant readability concerns in the benchmark.py file.

Key Issues Identified

1. Massive run() Method in TensorRTPyBenchmark

The run() method spans ~300 lines with deep nesting and multiple concerns:

  • Building TensorRT engine
  • Handling dynamic shapes
  • Memory allocation (host + device)
  • Warmup and timing loops
  • Cleanup

Recommendation: Split into helper methods:

  • _build_engine() - returns (serialized_engine, build_time)
  • _allocate_buffers() - returns (inputs, outputs, stream_handle)
  • _setup_execution_context() - returns context
  • _run_warmup()
  • _run_timing() - returns latencies array

Location: benchmark.py lines ~360-640

2. Hardcoded Regex Pattern

self.latency_pattern = r"\[I\]\s+Latency:.*?median\s*=\s*([\d.]+)\s*ms"

This regex is brittle and may break with trtexec output format changes. Consider making it configurable or using structured output (XML/JSON) if trtexec supports it.

Location: benchmark.py line ~272

3. Typo in Docstring

The package docstring has "benchamrk" instead of "benchmark":

This benchamrk module is used to evaluate ONNX model perf.

Location: PR description line 7, also appears in PR title logic

4. Duplicate Docstring Parameters

TrtExecBenchmark.__init__ repeats the same parameter descriptions that are already in the parent class. Consider using :param timing_cache_file: references instead of full descriptions to reduce duplication.

Location: benchmark.py lines ~203-222

5. Complex Conditional in Shape Validation

if not (min_dim <= opt_dim <= max_dim):
    raise ValueError(
        f"Invalid shape range at dimension {i}: "
        f"min={min_dim}, opt={opt_dim}, max={max_dim}. "
        f"Must satisfy min <= opt <= max"
    )

This is readable but could be simplified by extracting into a helper or using assert style validation.

Location: benchmark.py lines ~410-416

6. Inconsistent Error Patterns

Some methods return float("inf") on failure while others raise exceptions. This inconsistency makes error handling confusing for callers. Standardize on one approach (exceptions preferred for unexpected errors).

Location: Various places in benchmark.py

Minor Suggestions

  1. Nit: The _alloc_pinned_host() nested function inside run() could be a static method for better testability.

  2. Nit: The logger.debug(f"Running...") messages could be consolidated or use a progress indicator for long-running benchmarks.

  3. Nit: Consider validating warmup_runs and timing_runs are positive integers in Benchmark.__init__.

Context from Related PRs (#844-846)

Having reviewed PRs 844-846 which implement RegionPattern, RegionSearch, and RegionInspect, this PR fits well as part 3.1 (benchmark infrastructure) in the overall QDQ autotune pipeline. The region-based partitioning from those PRs will feed into these benchmarks.

Summary

The functionality is solid but the run() method needs refactoring before this merges. The 300+ line method with 5+ levels of nesting is difficult to maintain and review. Splitting it into focused helpers would significantly improve readability and testability.

Priority: Medium-High - recommend addressing the run() method split before merge.

Copy link

@modelopt-bot modelopt-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comments for specific feedback.

Signed-off-by: Will Guo <willg@nvidia.com>
Signed-off-by: Will Guo <willg@nvidia.com>
Signed-off-by: Will Guo <willg@nvidia.com>
Signed-off-by: Will Guo <willg@nvidia.com>
Signed-off-by: Will Guo <willg@nvidia.com>
Signed-off-by: Will Guo <willg@nvidia.com>
Signed-off-by: Will Guo <willg@nvidia.com>
Signed-off-by: Will Guo <willg@nvidia.com>
@willg-nv willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part3.1 branch from c941aca to a77a663 Compare February 24, 2026 03:08
@willg-nv
Copy link
Contributor Author

Review: PR #837 - Integrate Automated QDQ Benchmark

Overall Assessment

This PR adds benchmark infrastructure for QDQ autotuning with two benchmark implementations (TrtExecBenchmark using CLI and TensorRTPyBenchmark using Python API). The code is functional but has significant readability concerns in the benchmark.py file.

Key Issues Identified

1. Massive run() Method in TensorRTPyBenchmark

The run() method spans ~300 lines with deep nesting and multiple concerns:

  • Building TensorRT engine
  • Handling dynamic shapes
  • Memory allocation (host + device)
  • Warmup and timing loops
  • Cleanup

Recommendation: Split into helper methods:

  • _build_engine() - returns (serialized_engine, build_time)
  • _allocate_buffers() - returns (inputs, outputs, stream_handle)
  • _setup_execution_context() - returns context
  • _run_warmup()
  • _run_timing() - returns latencies array

Location: benchmark.py lines ~360-640

2. Hardcoded Regex Pattern

self.latency_pattern = r"\[I\]\s+Latency:.*?median\s*=\s*([\d.]+)\s*ms"

This regex is brittle and may break with trtexec output format changes. Consider making it configurable or using structured output (XML/JSON) if trtexec supports it.

Location: benchmark.py line ~272

3. Typo in Docstring

The package docstring has "benchamrk" instead of "benchmark":

This benchamrk module is used to evaluate ONNX model perf.

Location: PR description line 7, also appears in PR title logic

4. Duplicate Docstring Parameters

TrtExecBenchmark.__init__ repeats the same parameter descriptions that are already in the parent class. Consider using :param timing_cache_file: references instead of full descriptions to reduce duplication.

Location: benchmark.py lines ~203-222

5. Complex Conditional in Shape Validation

if not (min_dim <= opt_dim <= max_dim):
    raise ValueError(
        f"Invalid shape range at dimension {i}: "
        f"min={min_dim}, opt={opt_dim}, max={max_dim}. "
        f"Must satisfy min <= opt <= max"
    )

This is readable but could be simplified by extracting into a helper or using assert style validation.

Location: benchmark.py lines ~410-416

6. Inconsistent Error Patterns

Some methods return float("inf") on failure while others raise exceptions. This inconsistency makes error handling confusing for callers. Standardize on one approach (exceptions preferred for unexpected errors).

Location: Various places in benchmark.py

Minor Suggestions

  1. Nit: The _alloc_pinned_host() nested function inside run() could be a static method for better testability.
  2. Nit: The logger.debug(f"Running...") messages could be consolidated or use a progress indicator for long-running benchmarks.
  3. Nit: Consider validating warmup_runs and timing_runs are positive integers in Benchmark.__init__.

Context from Related PRs (#844-846)

Having reviewed PRs 844-846 which implement RegionPattern, RegionSearch, and RegionInspect, this PR fits well as part 3.1 (benchmark infrastructure) in the overall QDQ autotune pipeline. The region-based partitioning from those PRs will feed into these benchmarks.

Summary

The functionality is solid but the run() method needs refactoring before this merges. The 300+ line method with 5+ levels of nesting is difficult to maintain and review. Splitting it into focused helpers would significantly improve readability and testability.

Priority: Medium-High - recommend addressing the run() method split before merge.

resolved 1, 3, 4, 5

@willg-nv willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part3.1 branch from a77a663 to 93e2c38 Compare February 24, 2026 03:27
Signed-off-by: Will Guo <willg@nvidia.com>
@willg-nv willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part3.1 branch from 93e2c38 to d74bb08 Compare February 24, 2026 06:58
@modelopt-bot
Copy link

Follow-up Review: PR #837

Thank you for the revisions! The code readability has improved significantly.

✅ Issues Addressed

  1. Massive run() method refactored - The TensorRTPyBenchmark.run() method has been split into focused helpers:

    • _validate_shape_range() (module-level function)
    • _build_engine()
    • _allocate_buffers()
    • _setup_execution_context()
    • _run_warmup()
    • _run_timing()

    This is excellent - the run() method is now readable and maintainable.

  2. Duplicate docstrings fixed - TrtExecBenchmark.__init__ now uses :meth:Benchmark.__init__ references instead of duplicating parameters.

  3. Shape validation extracted - The complex conditional in set_shapes now calls _validate_shape_range().

⚠️ Remaining Concerns (Minor)

  1. Hardcoded regex still present - The self.latency_pattern in TrtExecBenchmark remains:

    self.latency_pattern = r"\[I\]\s+Latency:.*?median\s*=\s*([\d.]+)\s*ms"

    Consider making this configurable or adding fallback patterns for different trtexec versions.

  2. Nested function - _alloc_pinned_host is still nested inside _allocate_buffers. Consider making it a @staticmethod for better testability:

    @staticmethod
    def _alloc_pinned_host(size: int, dtype: np.dtype) -> tuple[Any, np.ndarray]:
        ...
  3. Inconsistent error handling - Some methods return float("inf") on failure (e.g., _setup_execution_context, run), while others raise exceptions (e.g., _allocate_buffers raises RuntimeError). Consider standardizing - either always raise for unexpected errors, or document when float("inf") is expected.

Overall

The major readability issues have been resolved. The code is now much cleaner and easier to follow. LGTM pending the minor items above (which could be addressed in a follow-up if preferred).

@codecov
Copy link

codecov bot commented Feb 24, 2026

Codecov Report

❌ Patch coverage is 14.64789% with 303 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.17%. Comparing base (52e662d) to head (57e03ab).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/onnx/quantization/autotune/benchmark.py 13.18% 303 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #837      +/-   ##
==========================================
- Coverage   73.10%   72.17%   -0.93%     
==========================================
  Files         205      207       +2     
  Lines       22294    22656     +362     
==========================================
+ Hits        16297    16352      +55     
- Misses       5997     6304     +307     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cjluo-nv
Copy link
Collaborator

@willg-nv could you fix the 2 and 3 above?

Follow-up Review: PR #837

Thank you for the revisions! The code readability has improved significantly.

✅ Issues Addressed

  1. Massive run() method refactored - The TensorRTPyBenchmark.run() method has been split into focused helpers:

    • _validate_shape_range() (module-level function)
    • _build_engine()
    • _allocate_buffers()
    • _setup_execution_context()
    • _run_warmup()
    • _run_timing()

    This is excellent - the run() method is now readable and maintainable.

  2. Duplicate docstrings fixed - TrtExecBenchmark.__init__ now uses :meth:Benchmark.__init__ references instead of duplicating parameters.

  3. Shape validation extracted - The complex conditional in set_shapes now calls _validate_shape_range().

⚠️ Remaining Concerns (Minor)

  1. Hardcoded regex still present - The self.latency_pattern in TrtExecBenchmark remains:

    self.latency_pattern = r"\[I\]\s+Latency:.*?median\s*=\s*([\d.]+)\s*ms"

    Consider making this configurable or adding fallback patterns for different trtexec versions.

  2. Nested function - _alloc_pinned_host is still nested inside _allocate_buffers. Consider making it a @staticmethod for better testability:

    @staticmethod
    def _alloc_pinned_host(size: int, dtype: np.dtype) -> tuple[Any, np.ndarray]:
        ...
  3. Inconsistent error handling - Some methods return float("inf") on failure (e.g., _setup_execution_context, run), while others raise exceptions (e.g., _allocate_buffers raises RuntimeError). Consider standardizing - either always raise for unexpected errors, or document when float("inf") is expected.

Overall

The major readability issues have been resolved. The code is now much cleaner and easier to follow. LGTM pending the minor items above (which could be addressed in a follow-up if preferred).

@willg-nv could you fix 2 & 3?

Signed-off-by: Will Guo <willg@nvidia.com>
@willg-nv
Copy link
Contributor Author

@willg-nv could you fix the 2 and 3 above?

Follow-up Review: PR #837

Thank you for the revisions! The code readability has improved significantly.

✅ Issues Addressed

  1. Massive run() method refactored - The TensorRTPyBenchmark.run() method has been split into focused helpers:

    • _validate_shape_range() (module-level function)
    • _build_engine()
    • _allocate_buffers()
    • _setup_execution_context()
    • _run_warmup()
    • _run_timing()

    This is excellent - the run() method is now readable and maintainable.

  2. Duplicate docstrings fixed - TrtExecBenchmark.__init__ now uses :meth:Benchmark.__init__ references instead of duplicating parameters.

  3. Shape validation extracted - The complex conditional in set_shapes now calls _validate_shape_range().

⚠️ Remaining Concerns (Minor)

  1. Hardcoded regex still present - The self.latency_pattern in TrtExecBenchmark remains:

    self.latency_pattern = r"\[I\]\s+Latency:.*?median\s*=\s*([\d.]+)\s*ms"

    Consider making this configurable or adding fallback patterns for different trtexec versions.

  2. Nested function - _alloc_pinned_host is still nested inside _allocate_buffers. Consider making it a @staticmethod for better testability:

    @staticmethod
    def _alloc_pinned_host(size: int, dtype: np.dtype) -> tuple[Any, np.ndarray]:
        ...
  3. Inconsistent error handling - Some methods return float("inf") on failure (e.g., _setup_execution_context, run), while others raise exceptions (e.g., _allocate_buffers raises RuntimeError). Consider standardizing - either always raise for unexpected errors, or document when float("inf") is expected.

Overall

The major readability issues have been resolved. The code is now much cleaner and easier to follow. LGTM pending the minor items above (which could be addressed in a follow-up if preferred).

@willg-nv could you fix 2 & 3?

fixed

@cjluo-nv
Copy link
Collaborator

/ok to test 57e03ab

@cjluo-nv cjluo-nv enabled auto-merge (squash) February 25, 2026 06:53
@cjluo-nv cjluo-nv merged commit e589ac8 into NVIDIA:main Feb 25, 2026
38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants