Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Jan 15, 2026

Rationale for this change

Accessing Arrow data or any of the formats can have non-trivial security implications, this is an attempt at documenting those.

What changes are included in this PR?

Add a Security Considerations page in the Format section.

Doc preview: https://s3.amazonaws.com/arrow-data/pr_docs/48870/format/Security.html

Are these changes tested?

N/A

Are there any user-facing changes?

No.

@pitrou
Copy link
Member Author

pitrou commented Jan 15, 2026

@github-actions crossbow submit preview-docs

@github-actions
Copy link

Revision: 593babb

Submitted crossbow builds: ursacomputing/crossbow @ actions-4f7018459b

Task Status
preview-docs GitHub Actions

Copy link
Member

@raboof raboof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable (without any particular Arrow expertise)

(noticed two typo's)

------------------

A less obvious pitfall is when some parts of an Arrow array are left uninitialized.
For example, if a element of a primitive Arrow array is marked null through its
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For example, if a element of a primitive Arrow array is marked null through its
For example, if an element of a primitive Arrow array is marked null through its

purposes. It is therefore tempting, when creating an array with null values, to
not initialize the corresponding value slots.

However, this then introduces a serious security if the Arrow data is serialized
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
However, this then introduces a serious security if the Arrow data is serialized
However, this then introduces a serious security risk if the Arrow data is serialized

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 15, 2026
uninitialized in a buffer if the array might be sent to, or read by, a untrusted
third-party, even when the uninitialized data is logically irrelevant. The
easiest way to do this, though perhaps not the most efficient, is to zero-initialize
any buffer that will not be populated in full.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth pointing out something about query engines and dataframe libraries deciding to not do so for internal/intermediate values in computations but applying a canonicalization pass when data leaves the system.

from an untrusted source (for example because you are writing a proxy to
an arbitrary third-party service), it is **recommended** that you validate
the data first, as the consumer may assume that the data is valid already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In addition to invalid pointers, some array types have offsets, sizes, and buffer indices that might be out-of-bounds. The library producing arrays through the
C data interface might be performing only very light validation of these values.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jan 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants