fix(approx_fns): use exact percentile when no compression by aryan-212 · Pull Request #21388 · apache/datafusion

aryan-212 · 2026-04-05T18:43:34Z

The interpolation step assumes centroids represent clusters of multiple points. But if the number of input rows is small (≤ the digest's max_size / compression threshold), no compression ever happens: every centroid has weight 1 and corresponds to exactly one input value.

In that regime, interpolation is not just unnecessary — it is actively wrong. The t-digest interpolates between adjacent centroids based on where the rank falls inside the centroid's weight, using half-deltas to neighbors. When every centroid has weight 1, this produces values that drift away from any actual data point.

This is particularly surprising for users running small queries or unit tests — they expect percentile functions on a handful of values to return one of those values.

(used gpt to frame this a bit properly)

Concrete Example

Lets take a small example from the TPCDS Schema

select cc_sq_ft from call_center;

none	cc_sq_ft
1	6144
2	6144
3	19345
4	21156
5	21156
6	22743
7	34643
8	42935
9	52514
10	65772
11	76815
12	84336
13	105138
14	119886

Now if we take a small APPROX_PERCENTILE query like:-

select approx_percentile(cc_sq_ft,0.85) from call_center limit 50

From here, 0.85*14 yields 11.9 or 12 so the output for the above APPROX_PERCENITLE query should be 84336 and that is what we get when we run the same query in Databricks

But in Datafusion this comes up as

This PR aims to fix this.

github-actions bot added the functions Changes to functions implementation label Apr 5, 2026

aryan-212 force-pushed the approx-percentile-fixes branch 2 times, most recently from 22718b8 to e997594 Compare April 6, 2026 06:02

github-actions bot added the core Core DataFusion crate label Apr 6, 2026

fix(approx_fns): use exact percentile when no compression

dad9ce1

aryan-212 force-pushed the approx-percentile-fixes branch 2 times, most recently from 40f862d to 95a4eff Compare April 6, 2026 06:26

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Apr 6, 2026

aryan-212 force-pushed the approx-percentile-fixes branch from 95a4eff to d8339ff Compare April 6, 2026 06:39

fix test

4f86249

aryan-212 force-pushed the approx-percentile-fixes branch from d8339ff to 4f86249 Compare April 6, 2026 07:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(approx_fns): use exact percentile when no compression#21388

fix(approx_fns): use exact percentile when no compression#21388
aryan-212 wants to merge 2 commits intoapache:mainfrom
aryan-212:approx-percentile-fixes

aryan-212 commented Apr 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aryan-212 commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Concrete Example

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aryan-212 commented Apr 5, 2026 •

edited

Loading