[Java][Dataset] OOM Killer triggered by unbounded native memory usage during Parquet read; NativeMemoryPool.createListenable crashes with JNI error #49472

### Describe the bug, including details regarding any error messages, version, and platform.

We are encountering a critical issue when using the Apache Arrow Java Dataset API to read a large Parquet file (15GB) from hdfs. The JVM process is killed by the OS OOM Killer despite having sufficient Docker container memory (15GB) and conservative JVM heap/direct memory settings.

Environment
Apache Arrow Version: 17.0.0
OS: Linux
JDK: jdk8
File Format: Parquet (15GB, generated by Pandas/PyArrow default config)
Hardware: Docker container limited to 15GB RAM.

Reproduction Steps

Configuration
Docker Memory Limit: 15GB
JVM Arguments:
java
-Xms1g -Xmx3g
-XX:MaxDirectMemorySize=3g
-XX:MaxMetaspaceSize=256m
-XX:+UseG1GC
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-Xloggc:gc.log
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=./heap_dump.hprof
-Darrow.memory.debug.allocator=true
-jar arrow-test-1.0-SNAPSHOT.jar

2. Code Snippet
java:
> 

    // fileSize: 15G
    String hdfsUri = "hdfs://node01.cdh5:8020/user/hive/warehouse/perf_test_200col_500w_nopk_parquet/ds=1/perf_test_200col_500w_nopk_parquet";

    ScanOptions options = new ScanOptions(100000);
    try (RootAllocator allocator = new RootAllocator(512 * 1024 * 1024);
         NativeMemoryPool nativePool = NativeMemoryPool.createListenable(DirectReservationListener.instance());
         FileSystemDatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, nativePool, FileFormat.PARQUET, hdfsUri);
         Dataset dataset = datasetFactory.finish();
         Scanner scanner = dataset.newScan(options);
         ArrowReader reader = scanner.scanBatches()) {

        long totalRows = 0;
        int batchCount = 0;

        while (reader.loadNextBatch()) {
            try (VectorSchemaRoot root = reader.getVectorSchemaRoot()) {
                int rowCount = root.getRowCount();
                totalRows += rowCount;
                batchCount++;

                System.out.println("Batch " + batchCount + " - rows: " + rowCount + ", total: " + totalRows);
            }
        }

        System.out.println("\n===== Processing Complete =====");
        System.out.println("Total batches: " + batchCount);
        System.out.println("Total rows: " + totalRows);
    } catch (Exception e) {
        logger.error("Error processing parquet file", e);
        e.printStackTrace();
    }

> 

Observed Behavior:
Batch 1 - rows: 41547, total: 41547
/mnt/executor/sandbox/shell-0000001022165.sh: line 21: 92 Killed java -Xms1g -Xmx3g -XX:MaxDirectMemorySize=3g -XX:MaxMetaspaceSize=256m -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:gc.log -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=./heap_dump.hprof -Darrow.memory.debug.allocator=true -jar /mnt/executor/sandbox/resources/0000001022165/resource.arrow-test-1.0-SNAPSHOT.jar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Java][Dataset] OOM Killer triggered by unbounded native memory usage during Parquet read; NativeMemoryPool.createListenable crashes with JNI error #49472 #1057

Describe the bug, including details regarding any error messages, version, and platform.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Java][Dataset] OOM Killer triggered by unbounded native memory usage during Parquet read; NativeMemoryPool.createListenable crashes with JNI error #49472 #1057

Description

Describe the bug, including details regarding any error messages, version, and platform.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions