Skip to content

fix(approx_fns): use exact percentile when no compression#21388

Open
aryan-212 wants to merge 2 commits intoapache:mainfrom
aryan-212:approx-percentile-fixes
Open

fix(approx_fns): use exact percentile when no compression#21388
aryan-212 wants to merge 2 commits intoapache:mainfrom
aryan-212:approx-percentile-fixes

Conversation

@aryan-212
Copy link
Copy Markdown
Contributor

@aryan-212 aryan-212 commented Apr 5, 2026

The interpolation step assumes centroids represent clusters of multiple points. But if the number of input rows is small (≤ the digest's max_size / compression threshold), no compression ever happens: every centroid has weight 1 and corresponds to exactly one input value.

In that regime, interpolation is not just unnecessary — it is actively wrong. The t-digest interpolates between adjacent centroids based on where the rank falls inside the centroid's weight, using half-deltas to neighbors. When every centroid has weight 1, this produces values that drift away from any actual data point.

This is particularly surprising for users running small queries or unit tests — they expect percentile functions on a handful of values to return one of those values.

(used gpt to frame this a bit properly)

Concrete Example

Lets take a small example from the TPCDS Schema

select cc_sq_ft from call_center;
none cc_sq_ft
1 6144
2 6144
3 19345
4 21156
5 21156
6 22743
7 34643
8 42935
9 52514
10 65772
11 76815
12 84336
13 105138
14 119886

Now if we take a small APPROX_PERCENTILE query like:-

select approx_percentile(cc_sq_ft,0.85) from call_center limit 50

From here, 0.85*14 yields 11.9 or 12 so the output for the above APPROX_PERCENITLE query should be 84336 and that is what we get when we run the same query in Databricks

Screenshot 2026-04-06 at 12 11 21 AM

But in Datafusion this comes up as

Screenshot 2026-04-06 at 12 12 21 AM

This PR aims to fix this.

@github-actions github-actions bot added the functions Changes to functions implementation label Apr 5, 2026
@aryan-212 aryan-212 force-pushed the approx-percentile-fixes branch 2 times, most recently from 22718b8 to e997594 Compare April 6, 2026 06:02
@github-actions github-actions bot added the core Core DataFusion crate label Apr 6, 2026
@aryan-212 aryan-212 force-pushed the approx-percentile-fixes branch 2 times, most recently from 40f862d to 95a4eff Compare April 6, 2026 06:26
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Apr 6, 2026
@aryan-212 aryan-212 force-pushed the approx-percentile-fixes branch from 95a4eff to d8339ff Compare April 6, 2026 06:39
@aryan-212 aryan-212 force-pushed the approx-percentile-fixes branch from d8339ff to 4f86249 Compare April 6, 2026 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant