Skip to content

feat(chaos): production-pattern antagonists + Java-25 thread-churn crash simulation#558

Open
jbachorik wants to merge 12 commits into
mainfrom
jb/production-pattern-antagonists
Open

feat(chaos): production-pattern antagonists + Java-25 thread-churn crash simulation#558
jbachorik wants to merge 12 commits into
mainfrom
jb/production-pattern-antagonists

Conversation

@jbachorik
Copy link
Copy Markdown
Collaborator

@jbachorik jbachorik commented May 29, 2026

What does this PR do?:

Extends the chaos reliability harness in two areas:

1. Six new production-pattern antagonists targeting failure patterns observed when profiling gRPC/Netty/Kafka-style workloads:

  • BoundedThreadPoolAntagonist (bounded-pool): 4 scheduled thread pools cross-submitting tasks; one pool torn down and recreated every 5 s — targets signal-vs-thread-end races during pool shutdown.
  • ContextHopAntagonist (context-hop): 8 self-renewing CompletableFuture chains hopping A→B→C→A across 3 pools, each stage setting/clearing a ThreadLocal — targets RefCountGuard slot contention under cross-pool handoff.
  • ConsumerGroupAntagonist (consumer-group): 16 consumer threads with 4 replaced in a burst every 3 s, simulating Kafka rebalance — targets ProfiledThread recycling and signal-vs-thread-end race window.
  • HiddenClassChurnAntagonist (hidden-class-churn): generates hidden classes via MethodHandles.Lookup.defineHiddenClass (Java 15+, reflective no-op on older JDKs) — targets StringDictionary concurrent eviction racing hidden-class GC.
  • DirectMemoryAntagonist (direct-memory): ring-buffer + burst ByteBuffer.allocateDirect() allocation — targets liveness-table overflow under high off-heap churn.
  • WeakRefWaveAntagonist (weakref-wave): alternating fill/drop phases with 10k WeakReference<byte[]> and a concurrent reader — targets jweak ref leak and liveness-table clearAll() race.

2. Crash simulation for the confirmed Java-25 thread-churn × recording-dump UAF (PROF quality escape from logs-backend internal services):

  • DumpStormAntagonist (dump-storm): spawns 96 short-lived, uniquely-named threads each with a distinct stack shape, maximising churn in the thread-name table and call-trace storage concurrent with recording dumps — targets Recording::switchChunk/writeCpool, updateJavaThreadNames → ThreadInfo::set, Dictionary::clear.
  • stress_threadLifecycle_ut.cpp (Layer-1 gtest): drives ProfiledThread, ThreadFilter, CallTraceStorage, and Dictionary concurrently without a JVM so ASan/TSan can locate the UAF at its origin. Runs under the existing buildGtestAsan/buildGtestTsan targets.
  • Chaos CI matrix extended with CHAOS_JDK dimension (21.0.3-tem + 25.0.3-tem) — all crashes confirmed on Java 25 in production.

CI fixes (introduced with the JDK matrix):

  • REASON dotenv key now includes CHAOS_JDK slug to prevent key collisions when two JDK legs of the same job both fail.
  • tee replaced with direct redirect so chaos_check.sh exit status is not swallowed.
  • JDK version verified after sdk use; emits FAIL:wrong JDK if install silently fell back.

Motivation:

Running the profiler against logs-backend production services (gRPC servers, Kafka consumers, Netty pipelines) reveals two classes of crashes our existing tests miss:

  1. Framework-driven thread lifecycle patterns not present in synthetic benchmarks (covered by the 6 production-pattern antagonists).
  2. A confirmed profiler UAF on Java 25 under high thread churn × concurrent recording dump — ~200 crashes/7 days on Datadog's own staging services. Disabling the profiler removes the crashes; the crash_datadog attribution tag undercounted the real impact by ~8× because most crashes surface in JVM/JFR code rather than profiler frames. DumpStormAntagonist + the JDK-25 chaos cell reproduce this scenario end-to-end.

Additional Notes:

All antagonists compile under --release 8 and use only JDK APIs. HiddenClassChurnAntagonist uses reflection for the Java 15+ defineHiddenClass API and prints a skip message on older JDKs.

Investigation and design: doc/plans/2026-05-29-logs-backend-crash-simulation-design.md. A follow-up plan for JFR-seeded antagonist calibration is at doc/plans/2026-05-29-jfr-seeded-antagonist-calibration.md.

How to test the change?:

The chaos cell is the primary test. Antagonists run under all three allocators × both archs × both configs × both JDKs in the scheduled reliability pipeline.

Local smoke test (no profiler attached — verifies no immediate crash or hang):

./gradlew :ddprof-stresstest:chaosJar
java -jar ddprof-stresstest/build/libs/chaos.jar \
  --duration 30s \
  --antagonists bounded-pool,context-hop,consumer-group,hidden-class-churn,direct-memory,weakref-wave,dump-storm

Expected: exits 0, last line [chaos] completed cleanly.

ASan gtest (Linux, requires Docker):

./utils/run-docker-tests.sh --config=asan --gtest --jdk=25 --mount

Expected: stress_threadLifecycle_ut compiles and all three cases pass under ASan.

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.
  • JIRA: [JIRA-XXXX]

@jbachorik jbachorik added the AI label May 29, 2026
@datadog-datadog-prod-us1

This comment has been minimized.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds six new Antagonist implementations to the chaos reliability harness that simulate production-grade gRPC/Netty/Kafka workload patterns (bounded thread pool recycling, cross-pool context hopping, consumer-group rebalances, hidden-class churn, direct-memory allocation churn, and weak-reference waves), and wires them into the scheduled CI chaos cell.

Changes:

  • Six new antagonist classes under ddprof-stresstest/src/chaos/java/com/datadoghq/profiler/chaos/.
  • Main.create(...) dispatch extended with cases for the six new antagonist names.
  • .gitlab/reliability/chaos_check.sh ANTAGONISTS lists updated to include the new entries for both the profiler and profiler+tracer configs.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file
File Description
ddprof-stresstest/.../BoundedThreadPoolAntagonist.java 4 cross-submitting scheduled pools; one is torn down and recreated every 5 s.
ddprof-stresstest/.../ContextHopAntagonist.java 8 self-renewing CompletableFuture chains hopping A→B→C→A across 3 pools, toggling a ThreadLocal.
ddprof-stresstest/.../ConsumerGroupAntagonist.java 16 spinning "consumer" threads with bursts of 4 simultaneous replacements every 3 s.
ddprof-stresstest/.../HiddenClassChurnAntagonist.java Reflectively calls defineHiddenClass (Java 15+) on ASM-generated bytecode; skips on older JDKs.
ddprof-stresstest/.../DirectMemoryAntagonist.java Ring-buffer + burst direct ByteBuffer allocation with OOM recovery.
ddprof-stresstest/.../WeakRefWaveAntagonist.java Fill/drop waves of 10k WeakReference<byte[]> plus a concurrent reader.
ddprof-stresstest/.../Main.java Adds switch cases mapping the six new names to constructors.
.gitlab/reliability/chaos_check.sh Adds the six new antagonist names to both config variants.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 29, 2026

CI Test Results

Run: #26665226882 | Commit: 988856d | Duration: 12m 18s (longest job)

All 32 test jobs passed

Status Overview

JDK glibc-aarch64/debug glibc-amd64/debug musl-aarch64/debug musl-amd64/debug
8 - - -
8-ibm - - -
8-j9 - -
8-librca - -
8-orcl - - -
11 - - -
11-j9 - -
11-librca - -
17 - -
17-graal - -
17-j9 - -
17-librca - -
21 - -
21-graal - -
21-librca - -
25 - -
25-graal - -
25-librca - -

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Summary: Total: 32 | Passed: 32 | Failed: 0


Updated: 2026-05-29 22:40:27 UTC

@jbachorik jbachorik marked this pull request as ready for review May 29, 2026 09:00
@jbachorik jbachorik requested a review from a team as a code owner May 29, 2026 09:00
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: faeefcb5ab

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Comment thread docs/sphinx/specs/2026-05-29-generate-hidden-classes-in-the-lookup.md Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.

@jbachorik jbachorik changed the title feat(chaos): add 6 production-pattern antagonists feat(chaos): production-pattern antagonists + Java-25 thread-churn crash simulation May 29, 2026
jbachorik and others added 11 commits May 29, 2026 23:22
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…oin; fix spec typo

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…× dump)

Layer 1: C++ gtest stress reproducer (stress_threadLifecycle_ut) drives
ProfiledThread, ThreadFilter, CallTraceStorage, and Dictionary concurrently
without a JVM so ASan/TSan can locate the UAF at its origin.

Layer 2: DumpStormAntagonist chaos antagonist spawns 96 short-lived,
uniquely-named threads to stress the thread-name/dump path under a real JVM.
Chaos CI matrix extended with CHAOS_JDK dimension (21 + 25).

CI fixes: REASON dotenv key includes CHAOS_JDK to prevent key collisions;
tee replaced with redirect to preserve exit status; JDK version verified
after sdk use to catch silent wrong-JDK fallback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jbachorik jbachorik force-pushed the jb/production-pattern-antagonists branch from fba26a3 to bc52f9a Compare May 29, 2026 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants