-
Notifications
You must be signed in to change notification settings - Fork 113
Description
Describe the bug, including details regarding any error messages, version, and platform.
We are encountering a critical issue when using the Apache Arrow Java Dataset API to read a large Parquet file (15GB) from hdfs. The JVM process is killed by the OS OOM Killer despite having sufficient Docker container memory (15GB) and conservative JVM heap/direct memory settings.
Environment
Apache Arrow Version: 17.0.0
OS: Linux
JDK: jdk8
File Format: Parquet (15GB, generated by Pandas/PyArrow default config)
Hardware: Docker container limited to 15GB RAM.
Reproduction Steps
Configuration
Docker Memory Limit: 15GB
JVM Arguments:
java
-Xms1g -Xmx3g
-XX:MaxDirectMemorySize=3g
-XX:MaxMetaspaceSize=256m
-XX:+UseG1GC
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-Xloggc:gc.log
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=./heap_dump.hprof
-Darrow.memory.debug.allocator=true
-jar arrow-test-1.0-SNAPSHOT.jar
- Code Snippet
java:
// fileSize: 15G
String hdfsUri = "hdfs://node01.cdh5:8020/user/hive/warehouse/perf_test_200col_500w_nopk_parquet/ds=1/perf_test_200col_500w_nopk_parquet";
ScanOptions options = new ScanOptions(100000);
try (RootAllocator allocator = new RootAllocator(512 * 1024 * 1024);
NativeMemoryPool nativePool = NativeMemoryPool.createListenable(DirectReservationListener.instance());
FileSystemDatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, nativePool, FileFormat.PARQUET, hdfsUri);
Dataset dataset = datasetFactory.finish();
Scanner scanner = dataset.newScan(options);
ArrowReader reader = scanner.scanBatches()) {
long totalRows = 0;
int batchCount = 0;
while (reader.loadNextBatch()) {
try (VectorSchemaRoot root = reader.getVectorSchemaRoot()) {
int rowCount = root.getRowCount();
totalRows += rowCount;
batchCount++;
System.out.println("Batch " + batchCount + " - rows: " + rowCount + ", total: " + totalRows);
}
}
System.out.println("\n===== Processing Complete =====");
System.out.println("Total batches: " + batchCount);
System.out.println("Total rows: " + totalRows);
} catch (Exception e) {
logger.error("Error processing parquet file", e);
e.printStackTrace();
}
Observed Behavior:
Batch 1 - rows: 41547, total: 41547
/mnt/executor/sandbox/shell-0000001022165.sh: line 21: 92 Killed java -Xms1g -Xmx3g -XX:MaxDirectMemorySize=3g -XX:MaxMetaspaceSize=256m -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:gc.log -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=./heap_dump.hprof -Darrow.memory.debug.allocator=true -jar /mnt/executor/sandbox/resources/0000001022165/resource.arrow-test-1.0-SNAPSHOT.jar