Skip to content

Arm backend: Split and decouple model evaluators#18118

Open
martinlsm wants to merge 1 commit intopytorch:mainfrom
martinlsm:marlin-evaluators
Open

Arm backend: Split and decouple model evaluators#18118
martinlsm wants to merge 1 commit intopytorch:mainfrom
martinlsm:marlin-evaluators

Conversation

@martinlsm
Copy link
Collaborator

@martinlsm martinlsm commented Mar 12, 2026

Tidy up the code in arm_model_evaluators.py by:

  • Making them no longer overlap. For example, ImageNetEvaluator no longer carries out numerical evaluation and checks the file compression ratio of the TOSA file; these evaluations are instead carried out solely by NumericalEvaluator and FileCompressionEvaluator respectively.
  • Rename GenericModelEvaluator to NumericalModelEvaluator and make it only evaluate via elementwise numerical comparison between reference and test model.
  • Add FileCompressionEvaluator which measures file compression ratio of a TOSA file.

This change makes it easier for a user to deliberately select exactly what measures they want to evaluate for a model.

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell

Tidy up the code in arm_model_evaluators.py by:
- Making them no longer overlap. For example, `ImageNetEvaluator` no
  longer carries out numerical evaluation and checks the file
  compression ratio of the TOSA file; these evaluations are instead
  carried out solely by `NumericalEvaluator` and
  `FileCompressionEvaluator` respectively.
- Rename `GenericModelEvaluator` to `NumericalModelEvaluator` and make
  it only evaluate via elementwise numerical comparison between
  reference and test model.
- Add `FileCompressionEvaluator` which measures file compression ratio
  of a TOSA file.

This change makes it easier for a user to deliberately select exactly
what measures they want to evaluate for a model.

Signed-off-by: Martin Lindström <Martin.Lindstroem@arm.com>
Change-Id: Ic98a00409f637264359658eaa17219c86f2520f9
@martinlsm martinlsm requested a review from digantdesai as a code owner March 12, 2026 12:07
Copilot AI review requested due to automatic review settings March 12, 2026 12:07
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 12, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18118

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 Awaiting Approval, 17 New Failures, 1 Cancelled Job

As of commit eb06c08 with merge base 48bd687 (image):

AWAITING APPROVAL - The following workflows need approval before CI can run:

NEW FAILURES - The following jobs have failed:

CANCELLED JOB - The following job was cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 12, 2026
@martinlsm
Copy link
Collaborator Author

@pytorchbot label ciflow/trunk

@martinlsm
Copy link
Collaborator Author

@pytorchbot label "partner: arm"

@pytorch-bot pytorch-bot bot added the partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm label Mar 12, 2026
@martinlsm
Copy link
Collaborator Author

@pytorchbot label "release notes: none"

@pytorch-bot pytorch-bot bot added the release notes: none Do not include this in the release notes label Mar 12, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors arm_model_evaluator.py by splitting the previously coupled evaluator classes into independent, single-responsibility evaluators. The monolithic GenericModelEvaluator (with its subclass ImageNetEvaluator) and the orchestration functions (evaluate_model, evaluator_calibration_data) are replaced by three focused classes behind a common Evaluator base.

Changes:

  • Renamed GenericModelEvaluator to NumericalModelEvaluator, now exclusively computing elementwise numerical error metrics between a reference and test model.
  • Decoupled ImageNetEvaluator from numerical evaluation; it now only computes top-1/top-5 accuracy and owns its own dataset loading/transforms.
  • Added FileCompressionEvaluator as a standalone evaluator for TOSA flatbuffer compression ratio.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
backends/arm/util/arm_model_evaluator.py Splits evaluators into NumericalModelEvaluator, ImageNetEvaluator, and FileCompressionEvaluator with a shared Evaluator base; removes orchestration code.
backends/arm/test/misc/test_model_evaluator.py Updates tests to use the new evaluator classes and their simplified APIs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +53 to +92
Metrics (lists per output tensor):
* max_error
* max_absolute_error
* max_percentage_error (safe-divided; zero ref elements -> 0%)
* mean_absolute_error

"""
if self._eval_dtype is not None:
eval_inputs = tuple(
inp.to(self._eval_dtype) for inp in self._example_inputs
)
seed = default_seed
else:
seed = default_seed
rng = random.Random(
seed
) # nosec B311 - deterministic shuffling for evaluation only
indices = list(range(len(dataset)))
rng.shuffle(indices)
selected = sorted(indices[:k])
return torch.utils.data.DataLoader(
torch.utils.data.Subset(dataset, selected), batch_size=1, shuffle=False
)
else:
eval_inputs = self._example_inputs

ref_outputs, _ = tree_flatten(self._ref_model(*self._example_inputs))
eval_outputs, _ = tree_flatten(self._eval_model(*eval_inputs))

def _load_imagenet_folder(directory: str) -> datasets.ImageFolder:
"""Shared helper to load an ImageNet-layout folder.
metrics = self._get_model_error(ref_outputs, eval_outputs)

Raises FileNotFoundError for a missing directory early to aid debugging.
return metrics

"""
directory_path = Path(directory)
if not directory_path.exists():
raise FileNotFoundError(f"Directory: {directory} does not exist.")
transform = _get_imagenet_224_transforms()
return datasets.ImageFolder(directory_path, transform=transform)
@staticmethod
def _get_model_error(ref_outputs, eval_outputs) -> dict[str, Any]:
metrics = {}

for ref_output, eval_output in zip(ref_outputs, eval_outputs):
difference = ref_output - eval_output
# Avoid divide by zero: elements where ref_output == 0 produce 0% contribution
percentage_error = torch.where(
ref_output != 0,
difference / ref_output * 100,
torch.zeros_like(difference),
)

metrics["max_error"] = torch.max(difference).item()
metrics["max_absolute_error"] = torch.max(torch.abs(difference)).item()
metrics["max_percentage_error"] = torch.max(percentage_error).item()
metrics["mean_absolute_error"] = torch.mean(
torch.abs(difference).float()
).item()
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says "Metrics (lists per output tensor)" but the implementation now stores plain scalars, silently overwriting metrics from earlier outputs when a model produces multiple output tensors. The old code used defaultdict(list) and .append() to accumulate per-output metrics. Either the docstring should be updated to reflect that only the last output's metrics are kept (if that's intentional for single-output models), or the implementation should accumulate metrics across outputs (e.g. by appending to lists or indexing by output position).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm release notes: none Do not include this in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants