cat: avoid unnecessary allocation#11675
Conversation
|
GNU testsuite comparison: |
|
GNU testsuite comparison: |
|
hyperfine is flakey |
|
Switching from a stack allocation to a heap allocation doesn’t avoid allocation... |
|
buf is declared after splice code.
|
|
I saw more perf difference with 1024 * 1024 by switching to vec. So I think vec's allocation is deffered. |
|
GNU testsuite comparison: |
|
GNU testsuite comparison: |
|
We might use nightly fill_buf in the future to avoid 0-fill at here. |
Presumably you mean nightly-only |
This doesn’t appear to be a statistically significant improvement; the reported uncertainty is large enough that the result is consistent with both a slowdown and a speedup. |
|
When I manually changed it with large MiB, it causes stack overflow without vec! . So I think Linux is saving RAM usage at least for. (but we should avoid N MiB pipe usage for small input) |
But we’re talking about the 64 KiB stack allocation here. |
|
Linux can still save 64KiB |
|
The stack space is already reserved, so switching to a heap allocation actually increases overall memory usage, at least in theory. |
|
If splice() fast-path succeed, cat does not take code path allocating buf. |
|
This is impossible to test on macOS, but changing buf to large stack causes serious perf drop while vec does not when splice() succeed. So allocation is omitted on Linux. |
|
This PR introduced a heap allocation on Linux where there wasn’t one previously. Based on the data in the description, there is no statistically significant improvement. Using a significantly larger stack array would risk stack overflow and violate |
This is just for verification for allocation bypass. I'm not intended to to do at production. |
|
How to actually bypass allocation completely in the case splice() fast-path succeed in your thought? |
|
Reverting the PR would avoid the unnecessary heap allocation and allocate for free using existing stack space. Your observed improvement in hyperfine is likely just noise or an artifact of LLVM optimization. |
|
I want to completely stop allocating it when splice() succeed. How to do that? Who guarantee "existing stack space"? |
|
There is typically 2 MiB stack already reserved per thread, see https://doc.rust-lang.org/std/thread/#stack-size. Using a fixed-size stack buffer will not introduce an additional system allocation. |
|
Hmm. At least, 1 MiB vec! with pure splice path was faster than 1 MiB stack clearly. |
|
1 MiB is too large for a stack array and risks stack overflow. It also violates |
|
Did you see #11675 (comment) ? It is just for local verification. If 2 MiB stack is actually free, 1 MiB stack should not drop perf. But it dropped perf. |
This comment was marked as outdated.
This comment was marked as outdated.
|
Ah, the likely reason for your performance drop is that a 1 MiB stack buffer must be zeroed at function entry. In our case we only use a 64 KiB buffer, so that overhead is negligible. If an uninitialized buffer could be used via |
|
I would split function containing the stack array and avoid call stack too. |
|
I guess the change made sense to avoid unnecessary zero-initialization, but the following would also have worked: // Use a small stack array to avoid unnecessary zero-initialization overhead when splice() was used
#[cfg(any(target_os = "linux", target_os = "android"))]
let mut buf = [0; 512];
#[cfg(not(any(target_os = "linux", target_os = "android")))]
let mut buf = [0; 1024 * 8]; |
|
Ofcause. But I wanted to save slow-path's syscalls for the sake. |
|
I think your new approach in #11906 is much easier to understand. |
Allocate buffer on heap instead of stack for read()/write() show-path which is unnecessary if splice() fast-path succeed.
related #10832