feat(chaos): production-pattern antagonists + Java-25 thread-churn crash simulation#558
feat(chaos): production-pattern antagonists + Java-25 thread-churn crash simulation#558jbachorik wants to merge 12 commits into
Conversation
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
Adds six new Antagonist implementations to the chaos reliability harness that simulate production-grade gRPC/Netty/Kafka workload patterns (bounded thread pool recycling, cross-pool context hopping, consumer-group rebalances, hidden-class churn, direct-memory allocation churn, and weak-reference waves), and wires them into the scheduled CI chaos cell.
Changes:
- Six new antagonist classes under
ddprof-stresstest/src/chaos/java/com/datadoghq/profiler/chaos/. Main.create(...)dispatch extended with cases for the six new antagonist names..gitlab/reliability/chaos_check.shANTAGONISTSlists updated to include the new entries for both theprofilerandprofiler+tracerconfigs.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
ddprof-stresstest/.../BoundedThreadPoolAntagonist.java |
4 cross-submitting scheduled pools; one is torn down and recreated every 5 s. |
ddprof-stresstest/.../ContextHopAntagonist.java |
8 self-renewing CompletableFuture chains hopping A→B→C→A across 3 pools, toggling a ThreadLocal. |
ddprof-stresstest/.../ConsumerGroupAntagonist.java |
16 spinning "consumer" threads with bursts of 4 simultaneous replacements every 3 s. |
ddprof-stresstest/.../HiddenClassChurnAntagonist.java |
Reflectively calls defineHiddenClass (Java 15+) on ASM-generated bytecode; skips on older JDKs. |
ddprof-stresstest/.../DirectMemoryAntagonist.java |
Ring-buffer + burst direct ByteBuffer allocation with OOM recovery. |
ddprof-stresstest/.../WeakRefWaveAntagonist.java |
Fill/drop waves of 10k WeakReference<byte[]> plus a concurrent reader. |
ddprof-stresstest/.../Main.java |
Adds switch cases mapping the six new names to constructors. |
.gitlab/reliability/chaos_check.sh |
Adds the six new antagonist names to both config variants. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
CI Test ResultsRun: #26665226882 | Commit:
Status Overview
Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled Summary: Total: 32 | Passed: 32 | Failed: 0 Updated: 2026-05-29 22:40:27 UTC |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: faeefcb5ab
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…oin; fix spec typo Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…× dump) Layer 1: C++ gtest stress reproducer (stress_threadLifecycle_ut) drives ProfiledThread, ThreadFilter, CallTraceStorage, and Dictionary concurrently without a JVM so ASan/TSan can locate the UAF at its origin. Layer 2: DumpStormAntagonist chaos antagonist spawns 96 short-lived, uniquely-named threads to stress the thread-name/dump path under a real JVM. Chaos CI matrix extended with CHAOS_JDK dimension (21 + 25). CI fixes: REASON dotenv key includes CHAOS_JDK to prevent key collisions; tee replaced with redirect to preserve exit status; JDK version verified after sdk use to catch silent wrong-JDK fallback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fba26a3 to
bc52f9a
Compare
What does this PR do?:
Extends the chaos reliability harness in two areas:
1. Six new production-pattern antagonists targeting failure patterns observed when profiling gRPC/Netty/Kafka-style workloads:
BoundedThreadPoolAntagonist(bounded-pool): 4 scheduled thread pools cross-submitting tasks; one pool torn down and recreated every 5 s — targets signal-vs-thread-end races during pool shutdown.ContextHopAntagonist(context-hop): 8 self-renewingCompletableFuturechains hopping A→B→C→A across 3 pools, each stage setting/clearing aThreadLocal— targets RefCountGuard slot contention under cross-pool handoff.ConsumerGroupAntagonist(consumer-group): 16 consumer threads with 4 replaced in a burst every 3 s, simulating Kafka rebalance — targets ProfiledThread recycling and signal-vs-thread-end race window.HiddenClassChurnAntagonist(hidden-class-churn): generates hidden classes viaMethodHandles.Lookup.defineHiddenClass(Java 15+, reflective no-op on older JDKs) — targets StringDictionary concurrent eviction racing hidden-class GC.DirectMemoryAntagonist(direct-memory): ring-buffer + burstByteBuffer.allocateDirect()allocation — targets liveness-table overflow under high off-heap churn.WeakRefWaveAntagonist(weakref-wave): alternating fill/drop phases with 10kWeakReference<byte[]>and a concurrent reader — targets jweak ref leak and liveness-tableclearAll()race.2. Crash simulation for the confirmed Java-25 thread-churn × recording-dump UAF (PROF quality escape from logs-backend internal services):
DumpStormAntagonist(dump-storm): spawns 96 short-lived, uniquely-named threads each with a distinct stack shape, maximising churn in the thread-name table and call-trace storage concurrent with recording dumps — targetsRecording::switchChunk/writeCpool,updateJavaThreadNames → ThreadInfo::set,Dictionary::clear.stress_threadLifecycle_ut.cpp(Layer-1 gtest): drivesProfiledThread,ThreadFilter,CallTraceStorage, andDictionaryconcurrently without a JVM so ASan/TSan can locate the UAF at its origin. Runs under the existingbuildGtestAsan/buildGtestTsantargets.CHAOS_JDKdimension (21.0.3-tem+25.0.3-tem) — all crashes confirmed on Java 25 in production.CI fixes (introduced with the JDK matrix):
REASONdotenv key now includesCHAOS_JDKslug to prevent key collisions when two JDK legs of the same job both fail.teereplaced with direct redirect sochaos_check.shexit status is not swallowed.sdk use; emitsFAIL:wrong JDKif install silently fell back.Motivation:
Running the profiler against logs-backend production services (gRPC servers, Kafka consumers, Netty pipelines) reveals two classes of crashes our existing tests miss:
crash_datadogattribution tag undercounted the real impact by ~8× because most crashes surface in JVM/JFR code rather than profiler frames.DumpStormAntagonist+ the JDK-25 chaos cell reproduce this scenario end-to-end.Additional Notes:
All antagonists compile under
--release 8and use only JDK APIs.HiddenClassChurnAntagonistuses reflection for the Java 15+defineHiddenClassAPI and prints a skip message on older JDKs.Investigation and design:
doc/plans/2026-05-29-logs-backend-crash-simulation-design.md. A follow-up plan for JFR-seeded antagonist calibration is atdoc/plans/2026-05-29-jfr-seeded-antagonist-calibration.md.How to test the change?:
The chaos cell is the primary test. Antagonists run under all three allocators × both archs × both configs × both JDKs in the scheduled reliability pipeline.
Local smoke test (no profiler attached — verifies no immediate crash or hang):
Expected: exits 0, last line
[chaos] completed cleanly.ASan gtest (Linux, requires Docker):
Expected:
stress_threadLifecycle_utcompiles and all three cases pass under ASan.For Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance.