Arm backend: Split and decouple model evaluators by martinlsm · Pull Request #18118 · pytorch/executorch

martinlsm · 2026-03-12T12:07:07Z

Tidy up the code in arm_model_evaluators.py by:

Making them no longer overlap. For example, ImageNetEvaluator no longer carries out numerical evaluation and checks the file compression ratio of the TOSA file; these evaluations are instead carried out solely by NumericalEvaluator and FileCompressionEvaluator respectively.
Rename GenericModelEvaluator to NumericalModelEvaluator and make it only evaluate via elementwise numerical comparison between reference and test model.
Add FileCompressionEvaluator which measures file compression ratio of a TOSA file.

This change makes it easier for a user to deliberately select exactly what measures they want to evaluate for a model.

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell

Tidy up the code in arm_model_evaluators.py by: - Making them no longer overlap. For example, `ImageNetEvaluator` no longer carries out numerical evaluation and checks the file compression ratio of the TOSA file; these evaluations are instead carried out solely by `NumericalEvaluator` and `FileCompressionEvaluator` respectively. - Rename `GenericModelEvaluator` to `NumericalModelEvaluator` and make it only evaluate via elementwise numerical comparison between reference and test model. - Add `FileCompressionEvaluator` which measures file compression ratio of a TOSA file. This change makes it easier for a user to deliberately select exactly what measures they want to evaluate for a model. Signed-off-by: Martin Lindström <Martin.Lindstroem@arm.com> Change-Id: Ic98a00409f637264359658eaa17219c86f2520f9

pytorch-bot · 2026-03-12T12:07:12Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18118

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 Awaiting Approval, 17 New Failures, 1 Cancelled Job

As of commit eb06c08 with merge base 48bd687 ():

AWAITING APPROVAL - The following workflows need approval before CI can run:

Claude Code (gh)
periodic (gh)

NEW FAILURES - The following jobs have failed:

Build Presets / windows (pybind) / build (gh)
Build Presets / windows (windows) / build (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
pull / unittest / windows / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
pull / unittest-editable / macos / macos-job (gh)
export/tests/test_target_recipes.py::TestTargetRecipes::test_linear_model
pull / unittest-editable / windows / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
trunk / test-models-windows (emformer_transcribe, portable) / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
trunk / test-models-windows (emformer_transcribe, xnnpack-q8) / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
trunk / test-models-windows (mobilebert, portable) / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
trunk / test-models-windows (mobilebert, xnnpack-q8) / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
trunk / test-models-windows (mv3, portable) / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
trunk / test-models-windows (mv3, xnnpack-q8) / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
trunk / test-models-windows (resnet50, portable) / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
trunk / test-models-windows (resnet50, xnnpack-q8) / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
trunk / test-models-windows (vit, portable) / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
trunk / test-models-windows (vit, xnnpack-q8) / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
trunk / unittest-release / windows / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
Windows MSVC Build / build-windows-msvc / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1

CANCELLED JOB - The following job was cancelled. Please retry:

trunk / test-mcu-cortex-m-backend / linux-job (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

martinlsm · 2026-03-12T12:07:17Z

@pytorchbot label ciflow/trunk

martinlsm · 2026-03-12T12:07:24Z

@pytorchbot label "partner: arm"

martinlsm · 2026-03-12T12:07:35Z

@pytorchbot label "release notes: none"

Copilot

Pull request overview

This PR refactors arm_model_evaluator.py by splitting the previously coupled evaluator classes into independent, single-responsibility evaluators. The monolithic GenericModelEvaluator (with its subclass ImageNetEvaluator) and the orchestration functions (evaluate_model, evaluator_calibration_data) are replaced by three focused classes behind a common Evaluator base.

Changes:

Renamed GenericModelEvaluator to NumericalModelEvaluator, now exclusively computing elementwise numerical error metrics between a reference and test model.
Decoupled ImageNetEvaluator from numerical evaluation; it now only computes top-1/top-5 accuracy and owns its own dataset loading/transforms.
Added FileCompressionEvaluator as a standalone evaluator for TOSA flatbuffer compression ratio.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`backends/arm/util/arm_model_evaluator.py`	Splits evaluators into `NumericalModelEvaluator`, `ImageNetEvaluator`, and `FileCompressionEvaluator` with a shared `Evaluator` base; removes orchestration code.
`backends/arm/test/misc/test_model_evaluator.py`	Updates tests to use the new evaluator classes and their simplified APIs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-12T12:10:40Z

backends/arm/util/arm_model_evaluator.py

+        Metrics (lists per output tensor):
+            * max_error
+            * max_absolute_error
+            * max_percentage_error (safe-divided; zero ref elements -> 0%)
+            * mean_absolute_error
+
+        """
+        if self._eval_dtype is not None:
+            eval_inputs = tuple(
+                inp.to(self._eval_dtype) for inp in self._example_inputs
            )
-            seed = default_seed
-    else:
-        seed = default_seed
-    rng = random.Random(
-        seed
-    )  # nosec B311 - deterministic shuffling for evaluation only
-    indices = list(range(len(dataset)))
-    rng.shuffle(indices)
-    selected = sorted(indices[:k])
-    return torch.utils.data.DataLoader(
-        torch.utils.data.Subset(dataset, selected), batch_size=1, shuffle=False
-    )
+        else:
+            eval_inputs = self._example_inputs

+        ref_outputs, _ = tree_flatten(self._ref_model(*self._example_inputs))
+        eval_outputs, _ = tree_flatten(self._eval_model(*eval_inputs))

-def _load_imagenet_folder(directory: str) -> datasets.ImageFolder:
-    """Shared helper to load an ImageNet-layout folder.
+        metrics = self._get_model_error(ref_outputs, eval_outputs)

-    Raises FileNotFoundError for a missing directory early to aid debugging.
+        return metrics

-    """
-    directory_path = Path(directory)
-    if not directory_path.exists():
-        raise FileNotFoundError(f"Directory: {directory} does not exist.")
-    transform = _get_imagenet_224_transforms()
-    return datasets.ImageFolder(directory_path, transform=transform)
+    @staticmethod
+    def _get_model_error(ref_outputs, eval_outputs) -> dict[str, Any]:
+        metrics = {}
+
+        for ref_output, eval_output in zip(ref_outputs, eval_outputs):
+            difference = ref_output - eval_output
+            # Avoid divide by zero: elements where ref_output == 0 produce 0% contribution
+            percentage_error = torch.where(
+                ref_output != 0,
+                difference / ref_output * 100,
+                torch.zeros_like(difference),
+            )
+
+            metrics["max_error"] = torch.max(difference).item()
+            metrics["max_absolute_error"] = torch.max(torch.abs(difference)).item()
+            metrics["max_percentage_error"] = torch.max(percentage_error).item()
+            metrics["mean_absolute_error"] = torch.mean(
+                torch.abs(difference).float()
+            ).item()


The docstring says "Metrics (lists per output tensor)" but the implementation now stores plain scalars, silently overwriting metrics from earlier outputs when a model produces multiple output tensors. The old code used defaultdict(list) and .append() to accumulate per-output metrics. Either the docstring should be updated to reflect that only the last output's metrics are kept (if that's intentional for single-output models), or the implementation should accumulate metrics across outputs (e.g. by appending to lists or indexing by output position).

martinlsm requested a review from digantdesai as a code owner March 12, 2026 12:07

Copilot AI review requested due to automatic review settings March 12, 2026 12:07

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 12, 2026

pytorch-bot bot added the ciflow/trunk label Mar 12, 2026

pytorch-bot bot added the partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm label Mar 12, 2026

pytorch-bot bot added the release notes: none Do not include this in the release notes label Mar 12, 2026

Copilot started reviewing on behalf of martinlsm March 12, 2026 12:07 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

zingo approved these changes Mar 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arm backend: Split and decouple model evaluators#18118

Arm backend: Split and decouple model evaluators#18118
martinlsm wants to merge 1 commit intopytorch:mainfrom
martinlsm:marlin-evaluators

martinlsm commented Mar 12, 2026 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Mar 12, 2026 •

edited

Loading

Uh oh!

martinlsm commented Mar 12, 2026

Uh oh!

martinlsm commented Mar 12, 2026

Uh oh!

martinlsm commented Mar 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

martinlsm commented Mar 12, 2026 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18118

❌ 2 Awaiting Approval, 17 New Failures, 1 Cancelled Job

Uh oh!

martinlsm commented Mar 12, 2026

Uh oh!

martinlsm commented Mar 12, 2026

Uh oh!

martinlsm commented Mar 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

martinlsm commented Mar 12, 2026 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Mar 12, 2026 •

edited

Loading