[SPARK-56346][SQL] Use PartitionPredicate in DSV2 Metadata Only Delete by szehon-ho · Pull Request #55179 · apache/spark

szehon-ho · 2026-04-02T23:59:00Z

Summary

When OptimizeMetadataOnlyDeleteFromTable fails to push standard V2 predicates (e.g. IN, STARTS_WITH) for a metadata-only delete, it now falls back to a second pass that:

Converts partition-column filters to PartitionPredicates (reusing SPARK-55596 infrastructure)
Translates remaining data-column filters to standard V2 predicates
Combines them (partition predicates first) and calls table.canDeleteWhere

This mirrors the two-pass approach already used for scan filter pushdown in PushDownUtils.pushPartitionPredicates.

Changes

OptimizeMetadataOnlyDeleteFromTable: Added tryDeleteWithPartitionPredicates fallback method and tryTranslateToV2 helper
PushDownUtils: Extracted createPartitionPredicates and made flattenNestedPartitionFilters package-private for reuse; getPartitionPredicateSchema now returns None for empty partition fields
InMemoryTableWithV2Filter: Extracted evalPredicate to companion object for reuse by test tables
InMemoryPartitionPredicateDeleteTable (new): Test table supporting PartitionPredicates and configurable data predicate acceptance
DataSourceV2EnhancedDeleteFilterSuite (new): 9 test cases covering first-pass accept, second-pass accept/reject, mixed partition+data filters, UDF on non-contiguous partition columns, multiple PartitionPredicates, and row-level fallback

Test plan

DataSourceV2EnhancedDeleteFilterSuite — 9/9 pass
DataSourceV2EnhancedPartitionFilterSuite — 19/19 pass (no regressions)
GroupBasedDeleteFromTableSuite — 32/32 pass (no regressions)
Scalastyle — 0 errors

When `OptimizeMetadataOnlyDeleteFromTable` fails to push standard V2 predicates for a metadata-only delete, it now falls back to a second pass that converts partition-column filters to `PartitionPredicate`s (SPARK-55596) and combines them with translated V2 data filters.

szehon-ho · 2026-04-03T00:08:07Z

...talyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTableWithV2Filter.scala

+   * Evaluates a single V2 predicate by resolving column values through the
+   * given function. Supports =, <=>, IS_NULL, IS_NOT_NULL, and ALWAYS_TRUE.
+   */
+  def evalPredicate(


just refactor for re-use in new test InMemoryTable

szehon-ho · 2026-04-03T00:09:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala

  }

+  /**
+   * Separates partition filters from data filters and converts pushable partition


again, refactor for re-use in OptimizeMetadataOnlyDeleteQuery

szehon-ho · 2026-04-03T00:09:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala

+   * Returns a map from flattened expression to original.
   */
-  private def normalizeNestedPartitionFilters(
+  private[v2] def flattenNestedPartitionFilters(


rename, because 'normalize' is already used in OptimizeMetadataOnlyDelete

cloud-fan

Summary

Prior state and problem: OptimizeMetadataOnlyDeleteFromTable could only perform metadata-only deletes when all filter expressions translated to standard V2 predicates (e.g., =, <=>, IS_NULL). Filters like IN, STARTS_WITH, or UDFs on partition columns caused a fallback to expensive row-level operations even though the table might accept PartitionPredicates.

Design approach: Add a second-pass fallback in the delete optimization rule that mirrors the existing two-pass approach in PushDownUtils.pushPartitionPredicates for scan filter pushdown. When the first pass (V2 translation) fails or is rejected, the second pass:

Separates filters into partition-column and data-column categories
Converts partition filters to PartitionPredicates via PartitionPredicateImpl
Translates remaining data filters to standard V2 predicates
Combines both and calls table.canDeleteWhere

Key design decisions:

No supportsIterativePushdown gate for the delete path (the scan path has one). This is intentional — canDeleteWhere already serves as the acceptance gate, and the supportsIterativePushdown opt-in is specific to ScanBuilder.
All-or-nothing semantics: if any remaining data filter can't translate to V2, the entire second pass fails and falls back to row-level. This differs from the scan path (which returns remaining filters for post-scan evaluation) because metadata-only deletes require complete filter acceptance.

Implementation sketch: OptimizeMetadataOnlyDeleteFromTable.apply → first tries tryTranslateToV2 (standard V2 path), on failure → tryDeleteWithPartitionPredicates (second pass via shared PushDownUtils.createPartitionPredicates and flattenNestedPartitionFilters), on failure → row-level plan. The shared methods were extracted from the existing pushPartitionPredicates and made package-private for reuse.

General comments

The logDebug message on the original first-pass success path was removed. With three possible outcomes now (first-pass V2, second-pass partition predicates, row-level fallback), adding logDebug for each path would help with debugging filter pushdown behavior.

cloud-fan · 2026-04-03T02:52:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala

      }
      if (fields.length == transforms.length) {
-        Some(fields.toSeq)
+        Some(fields.toSeq).filter(_.nonEmpty)


This .filter(_.nonEmpty) guard is redundant: the outer check at line 139 guarantees transforms.nonEmpty, and fields.length == transforms.length at line 151 ensures fields is non-empty.

Suggested change

Some(fields.toSeq).filter(_.nonEmpty)

Some(fields.toSeq)

cloud-fan · 2026-04-03T02:52:57Z

...est/scala/org/apache/spark/sql/connector/catalog/InMemoryPartitionPredicateDeleteTable.scala

+      candidateKeys
+    }
+
+    // Handle data predicates (simulate data source with data column statistics)


The comment says "data column statistics" but the code evaluates predicates row-by-row, not via statistics.

Suggested change

// Handle data predicates (simulate data source with data column statistics)

// Handle data predicates (simulate a data source applying row-level data filters)

szehon-ho force-pushed the delete_partition_filter branch from cb0ff92 to 33d100e Compare April 3, 2026 00:05

szehon-ho commented Apr 3, 2026

View reviewed changes

cloud-fan reviewed Apr 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56346][SQL] Use PartitionPredicate in DSV2 Metadata Only Delete#55179

[SPARK-56346][SQL] Use PartitionPredicate in DSV2 Metadata Only Delete#55179
szehon-ho wants to merge 1 commit intoapache:masterfrom
szehon-ho:delete_partition_filter

szehon-ho commented Apr 2, 2026

Uh oh!

szehon-ho Apr 3, 2026

Uh oh!

szehon-ho Apr 3, 2026

Uh oh!

szehon-ho Apr 3, 2026

Uh oh!

cloud-fan left a comment

Uh oh!

cloud-fan Apr 3, 2026

Uh oh!

cloud-fan Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	// Handle data predicates (simulate data source with data column statistics)
	// Handle data predicates (simulate a data source applying row-level data filters)

Conversation

szehon-ho commented Apr 2, 2026

Summary

Changes

Test plan

Uh oh!

szehon-ho Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

szehon-ho Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

szehon-ho Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Summary

General comments

Uh oh!

cloud-fan Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

cloud-fan Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants