Skip to content

cat: avoid unnecessary allocation#11675

Merged
sylvestre merged 1 commit intouutils:mainfrom
oech3:cat-alloc
Apr 12, 2026
Merged

cat: avoid unnecessary allocation#11675
sylvestre merged 1 commit intouutils:mainfrom
oech3:cat-alloc

Conversation

@oech3
Copy link
Copy Markdown
Contributor

@oech3 oech3 commented Apr 6, 2026

Allocate buffer on heap instead of stack for read()/write() show-path which is unnecessary if splice() fast-path succeed.

$ echo 1 > /tmp/1
> taskset -c 0 hyperfine -N --runs 10000 "/tmp/coreutils/target/release/cat-stack /tmp/1" "target/release/cat-heap /tmp/1"
Benchmark 1: /tmp/coreutils/target/release/cat-stack /tmp/1
  Time (mean ± σ):     921.2 µs ±  84.4 µs    [User: 372.9 µs, System: 443.7 µs]
  Range (min … max):   843.0 µs … 3926.9 µs    10000 runs
Benchmark 2: target/release/cat-heap /tmp/1
  Time (mean ± σ):     908.6 µs ± 117.0 µs    [User: 380.6 µs, System: 424.1 µs]
  Range (min … max):   821.4 µs … 4337.6 µs    10000 runs 
Summary
  target/release/cat-heap /tmp/1 ran
    1.01 ± 0.16 times faster than /tmp/coreutils/target/release/cat-stack /tmp/1

related #10832

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

GNU testsuite comparison:

Skipping an intermittent issue tests/tty/tty-eof (passes in this run but fails in the 'main' branch)
Note: The gnu test tests/basenc/bounded-memory is now being skipped but was previously passing.
Note: The gnu test tests/dd/no-allocate is now being skipped but was previously passing.
Note: The gnu test tests/tail/tail-n0f is now being skipped but was previously passing.
Congrats! The gnu test tests/cut/bounded-memory is now passing!

@oech3 oech3 marked this pull request as ready for review April 6, 2026 07:55
@oech3 oech3 marked this pull request as draft April 6, 2026 08:04
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

GNU testsuite comparison:

Skip an intermittent issue tests/cut/bounded-memory (fails in this run but passes in the 'main' branch)
Skip an intermittent issue tests/date/date-locale-hour (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/date/resolution (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/cut/cut-huge-range is now passing!

@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented Apr 6, 2026

hyperfine is flakey

@oech3 oech3 marked this pull request as ready for review April 6, 2026 08:49
@xtqqczze
Copy link
Copy Markdown
Contributor

xtqqczze commented Apr 6, 2026

Switching from a stack allocation to a heap allocation doesn’t avoid allocation...

@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented Apr 6, 2026 via email

@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented Apr 6, 2026

I saw more perf difference with 1024 * 1024 by switching to vec. So I think vec's allocation is deffered.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 8, 2026

GNU testsuite comparison:

Skipping an intermittent issue tests/cut/bounded-memory (passes in this run but fails in the 'main' branch)
Skipping an intermittent issue tests/date/date-locale-hour (passes in this run but fails in the 'main' branch)
Note: The gnu test tests/rm/many-dir-entries-vs-OOM is now being skipped but was previously passing.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 8, 2026

GNU testsuite comparison:

Skipping an intermittent issue tests/tty/tty-eof (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/cut/cut-huge-range is now passing!

@sylvestre sylvestre merged commit efd0f0c into uutils:main Apr 12, 2026
169 checks passed
@oech3 oech3 deleted the cat-alloc branch April 12, 2026 14:38
@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented Apr 12, 2026

We might use nightly fill_buf in the future to avoid 0-fill at here.

@xtqqczze
Copy link
Copy Markdown
Contributor

We might use nightly fill_buf in the future to avoid 0-fill at here.

Presumably you mean nightly-only Read::read_buf. Might be worth prototyping an implementation to validate this approach.

@xtqqczze
Copy link
Copy Markdown
Contributor

1.01 ± 0.16 times faster

This doesn’t appear to be a statistically significant improvement; the reported uncertainty is large enough that the result is consistent with both a slowdown and a speedup.

@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented Apr 19, 2026

When I manually changed it with large MiB, it causes stack overflow without vec! . So I think Linux is saving RAM usage at least for.

(but we should avoid N MiB pipe usage for small input)

@xtqqczze
Copy link
Copy Markdown
Contributor

When I manually changed it with large MiB

But we’re talking about the 64 KiB stack allocation here.

@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented Apr 19, 2026

Linux can still save 64KiB

@xtqqczze
Copy link
Copy Markdown
Contributor

The stack space is already reserved, so switching to a heap allocation actually increases overall memory usage, at least in theory.

@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented Apr 19, 2026

If splice() fast-path succeed, cat does not take code path allocating buf.

@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented Apr 19, 2026

This is impossible to test on macOS, but changing buf to large stack causes serious perf drop while vec does not when splice() succeed. So allocation is omitted on Linux.

@xtqqczze
Copy link
Copy Markdown
Contributor

This PR introduced a heap allocation on Linux where there wasn’t one previously. Based on the data in the description, there is no statistically significant improvement. Using a significantly larger stack array would risk stack overflow and violate clippy::large_stack_arrays.

@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented Apr 19, 2026

changing buf to large stack

This is just for verification for allocation bypass. I'm not intended to to do at production.

@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented Apr 19, 2026

How to actually bypass allocation completely in the case splice() fast-path succeed in your thought?

@xtqqczze
Copy link
Copy Markdown
Contributor

Reverting the PR would avoid the unnecessary heap allocation and allocate for free using existing stack space. Your observed improvement in hyperfine is likely just noise or an artifact of LLVM optimization.

@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented Apr 19, 2026

I want to completely stop allocating it when splice() succeed. How to do that? Who guarantee "existing stack space"?

@xtqqczze
Copy link
Copy Markdown
Contributor

There is typically 2 MiB stack already reserved per thread, see https://doc.rust-lang.org/std/thread/#stack-size. Using a fixed-size stack buffer will not introduce an additional system allocation.

@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented Apr 19, 2026

Hmm. At least, 1 MiB vec! with pure splice path was faster than 1 MiB stack clearly.

@xtqqczze
Copy link
Copy Markdown
Contributor

1 MiB is too large for a stack array and risks stack overflow. It also violates clippy::large_stack_arrays.

@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented Apr 19, 2026

Did you see #11675 (comment) ? It is just for local verification.

If 2 MiB stack is actually free, 1 MiB stack should not drop perf. But it dropped perf.

@oech3

This comment was marked as outdated.

@xtqqczze
Copy link
Copy Markdown
Contributor

Ah, the likely reason for your performance drop is that a 1 MiB stack buffer must be zeroed at function entry. In our case we only use a 64 KiB buffer, so that overhead is negligible. If an uninitialized buffer could be used via Read::read_buf, this would not be a factor.

@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented Apr 19, 2026

I would split function containing the stack array and avoid call stack too.

@xtqqczze
Copy link
Copy Markdown
Contributor

I guess the change made sense to avoid unnecessary zero-initialization, but the following would also have worked:

    // Use a small stack array to avoid unnecessary zero-initialization overhead when splice() was used
    #[cfg(any(target_os = "linux", target_os = "android"))]
    let mut buf = [0; 512];
    #[cfg(not(any(target_os = "linux", target_os = "android")))]
    let mut buf = [0; 1024 * 8];

@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented Apr 19, 2026

Ofcause. But I wanted to save slow-path's syscalls for the sake.

@xtqqczze
Copy link
Copy Markdown
Contributor

I think your new approach in #11906 is much easier to understand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants