Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Jan 15, 2026

This PR addresses several concurrency and resource management issues in the async shuffle external sorter that could cause CPU waste, memory leaks, and improper cleanup during error conditions.

Builds on #3192

Changes

  1. Replace busy-wait loop with proper blocking

Problem: When the maximum number of concurrent spill tasks was reached, the code used a CPU-burning spin loop to check if any task completed.

Before:

while (asyncSpillTasks.size() == threadNum) {                                                                                                                                                                       
    for (Future<Void> spillingTask : asyncSpillTasks) {                                                                                                                                                             
        if (spillingTask.isDone()) {                                                                                                                                                                                
            asyncSpillTasks.remove(spillingTask);                                                                                                                                                                   
            break;                                                                                                                                                                                                  
        }                                                                                                                                                                                                           
    }                                                                                                                                                                                                               
}                                                                                                                                                                                                                   

After: Block on the oldest task using Future.get(), which properly yields the CPU while waiting.

  1. Fix exception handling in closeAndGetSpills()

Problem: If one async task threw an exception, subsequent tasks were not awaited, potentially leaving background threads running and resources unreleased.

After: Wait for ALL tasks to complete, collecting exceptions using addSuppressed() to preserve error information from multiple failures.

  1. Add runtime validation for thread pool initialization

Problem: Used assert for validating threadNum > 0 (disabled in production) and no null check for thread pool.

After: Proper runtime checks with descriptive error messages:

  • IllegalArgumentException if threadNum <= 0
  • IllegalStateException if thread pool is null
  1. Fix memory leak on exception in async spill task

Problem: If writeSortedFileNative() threw an exception, the SpillSorter remained in the spillingSorters queue with its memory unreleased.

After: Wrap in try-finally to ensure freeMemory(), freeArray(), and spillingSorters.remove() always execute.

  1. Add cancellation support in cleanupResources()

Problem: When a task was killed or aborted, background spill threads continued running with no way to stop them.

After: Cancel all pending async tasks and wait briefly for their cleanup to complete before freeing memory and deleting spill files.

  1. Make peakMemoryUsedBytes volatile

Problem: Field could have stale reads when accessed from different threads (main thread vs background spill threads).

After: Added volatile modifier to ensure visibility across threads.

@codecov-commenter
Copy link

codecov-commenter commented Jan 15, 2026

Codecov Report

❌ Patch coverage is 65.97222% with 98 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.81%. Comparing base (f09f8af) to head (9488c4b).
⚠️ Report is 849 commits behind head on main.

Files with missing lines Patch % Lines
.../shuffle/sort/CometShuffleExternalSorterAsync.java 61.71% 55 Missing and 12 partials ⚠️
...k/shuffle/sort/CometShuffleExternalSorterSync.java 71.29% 25 Missing and 6 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3195      +/-   ##
============================================
+ Coverage     56.12%   59.81%   +3.68%     
- Complexity      976     1426     +450     
============================================
  Files           119      170      +51     
  Lines         11743    15715    +3972     
  Branches       2251     2600     +349     
============================================
+ Hits           6591     9400    +2809     
- Misses         4012     4996     +984     
- Partials       1140     1319     +179     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants