-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-48868: [Doc] Document security model for the Arrow formats #48870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@github-actions crossbow submit preview-docs |
|
Revision: 593babb Submitted crossbow builds: ursacomputing/crossbow @ actions-4f7018459b
|
raboof
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable (without any particular Arrow expertise)
(noticed two typo's)
| ------------------ | ||
|
|
||
| A less obvious pitfall is when some parts of an Arrow array are left uninitialized. | ||
| For example, if a element of a primitive Arrow array is marked null through its |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| For example, if a element of a primitive Arrow array is marked null through its | |
| For example, if an element of a primitive Arrow array is marked null through its |
| purposes. It is therefore tempting, when creating an array with null values, to | ||
| not initialize the corresponding value slots. | ||
|
|
||
| However, this then introduces a serious security if the Arrow data is serialized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| However, this then introduces a serious security if the Arrow data is serialized | |
| However, this then introduces a serious security risk if the Arrow data is serialized |
| uninitialized in a buffer if the array might be sent to, or read by, a untrusted | ||
| third-party, even when the uninitialized data is logically irrelevant. The | ||
| easiest way to do this, though perhaps not the most efficient, is to zero-initialize | ||
| any buffer that will not be populated in full. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth pointing out something about query engines and dataframe libraries deciding to not do so for internal/intermediate values in computations but applying a canonicalization pass when data leaves the system.
| from an untrusted source (for example because you are writing a proxy to | ||
| an arbitrary third-party service), it is **recommended** that you validate | ||
| the data first, as the consumer may assume that the data is valid already. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| In addition to invalid pointers, some array types have offsets, sizes, and buffer indices that might be out-of-bounds. The library producing arrays through the | |
| C data interface might be performing only very light validation of these values. |
Rationale for this change
Accessing Arrow data or any of the formats can have non-trivial security implications, this is an attempt at documenting those.
What changes are included in this PR?
Add a Security Considerations page in the Format section.
Doc preview: https://s3.amazonaws.com/arrow-data/pr_docs/48870/format/Security.html
Are these changes tested?
N/A
Are there any user-facing changes?
No.