Skip to content

Arrow IPC binary fetch path for DataFrame execution#1489

Merged
Martozar merged 7 commits intogooddata:masterfrom
Martozar:c.mze-cq-105
Apr 13, 2026
Merged

Arrow IPC binary fetch path for DataFrame execution#1489
Martozar merged 7 commits intogooddata:masterfrom
Martozar:c.mze-cq-105

Conversation

@Martozar
Copy link
Copy Markdown
Contributor

@Martozar Martozar commented Mar 30, 2026

Summary

Adds a native Arrow IPC binary fetch path to gooddata-pandas, providing a faster alternative to the existing JSON-paged AFM path for large result sets.

What changed

gooddata-sdk — binary fetch

  • BareExecutionResponse.read_result_arrow() fetches execution results from the server's binary IPC endpoint and returns a pyarrow.Table.

gooddata-pandas — Arrow→DataFrame conversion

  • DataFrameFactory.for_exec_def_arrow() — new public method that mirrors for_exec_def() but uses the binary path.
  • for_arrow_table() — pure conversion from pa.Table to (pd.DataFrame, DataFrameMetadata), enabling callers to bring their own Arrow data.
  • convert_arrow_table_to_dataframe() — low-level converter that reconstructs row/column MultiIndex, subtotals, primary labels, and types from Arrow field metadata.

Why

The JSON paging path serialises every result to JSON and pages it in chunks — it is CPU-heavy and slow for wide or deep result sets. Arrow IPC transfers binary columnar
data in a single round-trip. End-to-end benchmarks against the GoodData demo workspace show 1.3×–33× speedup depending on table shape, with larger tables benefiting most .

Test coverage

  • 140 unit tests covering: missing metadata keys (all three required keys), self_destruct mode, _build_field_index edge cases (subtotal padding, asymmetric depth), compute_row_totals_indexes with empty dimensions, for_arrow_table correctness across flat/transposed/subtotals/both-dim-totals cases.
  • 47 ground-truth fixture cases generated against the live API and committed to tests/dataframe/fixtures/arrow/, including 3-metric tables, 3-level nested subtotals, multi-aggregation multi-metric tables, and asymmetric totals (different levels/aggregations per metric).
  • IPC test fixture updated to use ipc.new_file to match the server format.

@Martozar Martozar force-pushed the c.mze-cq-105 branch 3 times, most recently from 7453528 to 0380d40 Compare March 30, 2026 09:22
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 30, 2026

Codecov Report

❌ Patch coverage is 91.99134% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.66%. Comparing base (49ea0d5) to head (6828bad).
⚠️ Report is 27 commits behind head on master.

Files with missing lines Patch % Lines
...data-pandas/src/gooddata_pandas/arrow_convertor.py 96.85% 10 Missing ⚠️
...s/gooddata-pandas/src/gooddata_pandas/dataframe.py 83.33% 10 Missing ⚠️
...ages/gooddata-pandas/src/gooddata_pandas/series.py 57.14% 6 Missing ⚠️
...ta-sdk/src/gooddata_sdk/compute/model/execution.py 76.47% 4 Missing ⚠️
...es/gooddata-pandas/src/gooddata_pandas/__init__.py 50.00% 3 Missing ⚠️
...gooddata-pandas/src/gooddata_pandas/data_access.py 88.00% 3 Missing ⚠️
...ddata-sdk/src/gooddata_sdk/compute/model/filter.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1489      +/-   ##
==========================================
+ Coverage   78.13%   78.66%   +0.53%     
==========================================
  Files         228      230       +2     
  Lines       14926    15400     +474     
==========================================
+ Hits        11662    12114     +452     
- Misses       3264     3286      +22     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Martozar Martozar marked this pull request as draft March 30, 2026 10:47
Comment thread packages/gooddata-sdk/pyproject.toml Outdated
Comment thread packages/gooddata-pandas/src/gooddata_pandas/dataframe.py
@Martozar Martozar changed the title C.mze cq 105 Arrow IPC binary fetch path for DataFrame execution Apr 1, 2026
@Martozar Martozar marked this pull request as ready for review April 1, 2026 11:01
Comment thread packages/gooddata-sdk/src/gooddata_sdk/catalog/export/service.py
Comment thread packages/gooddata-sdk/src/gooddata_sdk/compute/model/execution.py Outdated
@Martozar Martozar force-pushed the c.mze-cq-105 branch 2 times, most recently from d7fbc76 to 4e99271 Compare April 1, 2026 13:54
Martozar added 4 commits April 1, 2026 15:57
Switch read_result_arrow to explicitly request application/vnd.apache.arrow.stream
via Accept header and pipe the HTTP response directly into ipc.open_stream(),
eliminating the intermediate BytesIO buffer. Update tests accordingly.
no23reason
no23reason previously approved these changes Apr 1, 2026
@Martozar Martozar force-pushed the c.mze-cq-105 branch 3 times, most recently from 2468c32 to 8dc2511 Compare April 13, 2026 08:34
Add a parallel Arrow IPC execution path to DataFrameFactory and SeriesFactory
that fetches results via the binary endpoint instead of JSON pagination:

- arrow_convertor: pa.Table -> DataFrame conversion with label_overrides,
  grand_totals reordering, column_totals_indexes, primary_labels resolution,
  and metric field index helper
- dataframe: for_exec_def_arrow(), for_arrow_table(), for_exec_result_id Arrow
  branch; Arrow path wired through for_visualization(), for_created_visualization()
- series: use_arrow=True on indexed() / not_indexed()
- ArrowConfig holds conversion params (self_destruct, types_mapper, custom_mapping);
  use_arrow is a dedicated DataFrameFactory.__init__ parameter

risk: nonprod
Backfill column_totals_indexes into all 36 fixture meta.json files; extend
parity tests to cover all four DataFrameMetadata fields
(row_totals_indexes, column_totals_indexes, primary_labels_from_index,
primary_labels_from_columns) and expand for_arrow_table tests from
4 hand-picked cases to the full fixture set.

risk: nonprod
# (C) 2026 GoodData Corporation
from __future__ import annotations

import json
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using orjson – can be done as a follow-up.

Comment thread packages/gooddata-sdk/pyproject.toml Outdated
"python-dotenv~=1.0.0",
"deepdiff~=8.5.0",
"tests_support",
"pyarrow>=16.1.0",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is pyarrow>=23.0.1 in project.optional-dependencies. Consider unifying it.

@Martozar Martozar force-pushed the c.mze-cq-105 branch 2 times, most recently from 40b16be to 3dc178e Compare April 13, 2026 09:11
Replace stdlib json with orjson in arrow_convertor.py for faster metadata parsing. Add orjson>=3.11.0 to the arrow optional dependency group and align the test group's pyarrow floor to match the arrow extra (>=23.0.1).

risk: nonprod
@Martozar Martozar merged commit d7f50b7 into gooddata:master Apr 13, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants