Skip to content

dd_oome_notifier.sh and dd_crash_uploader.sh fail when JDK_JAVA_OPTIONS contains JMX or other port-binding flags #10766

@florianmutter

Description

@florianmutter

Tracer Version(s)

1.55.0

Java Version(s)

21.0.9

JVM Vendor

Eclipse Adoptium / Temurin

Bug Report

The dd_oome_notifier.sh (and dd_crash_uploader.sh) scripts spawned via -XX:OnOutOfMemoryError do not unset JDK_JAVA_OPTIONS, JAVA_TOOL_OPTIONS, or _JAVA_OPTIONS before launching a child java process. The child JVM therefore inherits the full application JVM configuration, which causes three distinct problems:

  1. Port conflicts — flags like JMX remote (-Dcom.sun.management.jmxremote.port=9012) cause BindException because the parent JVM still holds the port
  2. cgroup OOMKill — memory flags like -Xms/-Xmx or -XX:MaxRAMPercentage=90 cause the child JVM to compete with the still-alive parent for container memory, potentially triggering a kernel OOMKill
  3. Lost OOM diagnostics — when the script fails for any of the above reasons, no OOME event reaches Datadog, and the original OOM exception details (stack trace, thread name) are also lost because -XX:+ExitOnOutOfMemoryError force-terminates after the handler runs

Actual output:

# java.lang.OutOfMemoryError: Metaspace
# -XX:OnOutOfMemoryError="/tmp/datadog/java/dd_oome_notifier.sh %p"
#   Executing /bin/sh -c "/tmp/datadog/java/dd_oome_notifier.sh 1"...
Agent Jar: /opt/datadog/apm/library/java/dd-java-agent.jar
Tags: host:order-664fc65797-2bclc,...
JAVA_HOME: /opt/java/openjdk
PID: 1
NOTE: Picked up JDK_JAVA_OPTIONS: -XX:MaxGCPauseMillis=4000 -XX:MinRAMPercentage=25 -XX:MaxRAMPercentage=90 -XX:MaxMetaspaceSize=128m -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9012 -Dcom.sun.management.jmxremote.rmi.port=9012 ...
Picked up JAVA_TOOL_OPTIONS: -javaagent:/opt/datadog/apm/library/java/dd-java-agent.jar ...

Caused by: java.rmi.server.ExportException: Port already in use: 9012; nested exception is:
	java.net.BindException: Address already in use
	...
Error: Failed to generate OOME event
Terminating due to java.lang.OutOfMemoryError: Metaspace

Expected Behavior

The OOME event should be sent to Datadog successfully, regardless of what JDK_JAVA_OPTIONS or JAVA_TOOL_OPTIONS contain. When the script fails, the original OOM exception details (stack trace, thread name) should still be visible in the application logs.

Reproduction Code

  1. Configure a JVM application with JMX remote monitoring via JDK_JAVA_OPTIONS:

    JDK_JAVA_OPTIONS=-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9012 -Dcom.sun.management.jmxremote.rmi.port=9012 ...
    
  2. Have dd-java-agent injected (e.g. via Admission Controller), which sets:

    JAVA_TOOL_OPTIONS=-javaagent:/opt/datadog/apm/library/java/dd-java-agent.jar -XX:OnOutOfMemoryError="/tmp/datadog/java/dd_oome_notifier.sh %p"
    
  3. Trigger an OutOfMemoryError (in our case: java.lang.OutOfMemoryError: Metaspace)

  4. The JVM invokes dd_oome_notifier.sh, which spawns a child java process:

    "$config_java_home/bin/java" -Ddd.dogstatsd.start-delay=0 -jar "$config_agent" sendOomeEvent "$config_tags"
  5. This child process inherits JDK_JAVA_OPTIONS (including JMX port flags) and JAVA_TOOL_OPTIONS (including the agent jar) from the environment.

  6. The child JVM tries to bind JMX to port 9012, which is still held by the dying parent JVM → BindException: Address already in use"Error: Failed to generate OOME event"

Metadata

Metadata

Assignees

No one assigned

    Labels

    type: bugBug report and fix

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions