Skip to content

fix(topology): Fix for issue causing stalling on shutdown for sinks configured w/ disk buffers#24949

Open
graphcareful wants to merge 6 commits intovectordotdev:masterfrom
graphcareful:rob/fix-disk-buffer-reload-bug
Open

fix(topology): Fix for issue causing stalling on shutdown for sinks configured w/ disk buffers#24949
graphcareful wants to merge 6 commits intovectordotdev:masterfrom
graphcareful:rob/fix-disk-buffer-reload-bug

Conversation

@graphcareful
Copy link
Contributor

@graphcareful graphcareful commented Mar 17, 2026

Summary

This PR includes two fixes when sinks configured with buffers are shutdown. First issue is observed when reloading a config that contains a sink with a disk buffer. Vector will wait until batch.timeout_secs has completed which causes buffers to flush and unblocks the reload process. The fix for this is to send the cancel() signal to the sink, so it doesn't block on its buffer not being flushed downstream.

The second fix is for an issue with the same root cause. I noticed the same hang on issue of control-c with the aws_s3 sink. I employed the same solution there in the stop() method - to call the cancel() signal early in the stop() method.

Vector configuration

sources:
  http-server-source:
    type: "http_server"
    address: "127.0.0.1:8001"

sinks:
  amazon-s3-sink:
    inputs:
      - "http-server-source"
    type: "aws_s3"
    bucket: e2e-tests
    region: us-east-1
    encoding:
      codec: json
    buffer:
      type: disk
      max_size: 30000000000
      when_full: block
    batch:
      timeout_secs: 500
      max_bytes: 100000000

How did you test this PR?

By using the config above and a simple HTTP traffic generator. Modifying the sink buffer and re-saving the file, looking for errors from the source.

Change Type

  • Bug fix
  • New feature
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Notes

  • Please read our Vector contributor resources.
  • Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
  • Some CI checks run only after we manually approve them.
    • We recommend adding a pre-push hook, please see this template.
    • Alternatively, we recommend running the following locally before pushing to the remote branch:
      • make fmt
      • make check-clippy (if there are failures it's possible some of them can be fixed with make clippy-fix)
      • make test
  • After a review is requested, please avoid force pushes to help us review incrementally.
    • Feel free to push as many commits as you want. They will be squashed into one before merging.
    • For example, you can run git merge origin master and git push.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run make build-licenses to regenerate the license inventory and commit the changes (if any). More details here.

@graphcareful graphcareful requested a review from a team as a code owner March 17, 2026 19:05
@graphcareful graphcareful added domain: topology Anything related to Vector's topology code domain: config Anything related to configuring Vector domain: buffers Anything related to Vector's memory/disk buffers labels Mar 17, 2026
@pront
Copy link
Member

pront commented Mar 17, 2026

@codex review

@bruceg bruceg added the type: bug A code related bug. label Mar 17, 2026
@pront

This comment was marked as outdated.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eb5630a0fd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +156 to +157
for (_, trigger) in self.detach_triggers {
trigger.into_inner().cancel();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve graceful draining during global shutdown

Canceling every sink detach_trigger at the top of RunningTopology::stop forces all sink input streams (take_until_if(tripwire)) to end before shutdown_coordinator.shutdown_all(...) starts upstream shutdown, so sinks can terminate while events are still buffered or in flight. That changes graceful shutdown semantics into a lossy path for any pipeline with pending events, because those events are never drained by the sink.

Useful? React with 👍 / 👎.

@graphcareful graphcareful force-pushed the rob/fix-disk-buffer-reload-bug branch from eb5630a to 28a64b0 Compare March 23, 2026 17:26
@github-actions github-actions bot removed the domain: topology Anything related to Vector's topology code label Mar 23, 2026
Comment on lines +71 to +72
let notified = ledger.writer_notified();
let mut wait_for_writer = spawn(notified);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is effectively the following, which is closer to the original code, right?

Suggested change
let notified = ledger.writer_notified();
let mut wait_for_writer = spawn(notified);
let notified = spawn(ledger.writer_notified());

Comment on lines +374 to +375
pub fn writer_notified(&self) -> Notified<'_> {
self.writer_notify.notified()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this functionally changes anything, other than changing the return type from an effective impl Future into a concrete type. Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows the caller to register to be notified before checking some pre-condition to await for, avoiding any races between the await and the condition. This is more of a defensive change, the core of the bugfix is checking is_writer_done in reader.rs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm saying is that async fn wait_for_writer(&self) desugars to fn wait_for_writer(&self) -> impl Future<Output = ()> and contains a state machine that just waits on writer_notify.notified when it's polled (which the optimizer will simplify to just calling the original future), and the new code returns the inner future directly, which will behave the same way. I guess what I'm keying on is that the comment change seems to indicate that the behavior of this function changed, but it hasn't.

@graphcareful graphcareful requested review from bruceg and pront March 23, 2026 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: buffers Anything related to Vector's memory/disk buffers domain: config Anything related to configuring Vector type: bug A code related bug.

Projects

None yet

3 participants