[SPARK-57134][SDP] Implement SCD2 Batch Processor; Preprocess Microbatch by AnishMahto · Pull Request #56208 · apache/spark

AnishMahto · 2026-05-29T16:38:59Z

What changes were proposed in this pull request?

Preamble:

The SCD type 2 flow is a foreachBatch streaming query on an input change-data-feed, and is responsible for reconciling the incoming change data onto some target table that follows SCD2 replication semantics.

SCD2 flows also maintain an "auxiliary" table to keep track of early-arriving/out-of-order received events state. Each microbatch will need to reconcile against this auxiliary table as well, and update the auxiliary table's state appropriately for future microbatches.

Preprocess Microbatch

For SCD2, preprocessing the microbatch is all about getting it in the right shape, aligned with the shape of the target table the microbatch will be merged into + the shape that SCD2 itself as a standard demands.

That is:

The microbatch must have a start-at and end-at columns projected as per the SCD2 standard, to indicate that a historical/alive record was active between those sequence stamps
The microbatch will have the operational CDC metadata column projected, which is needed to reconcile late arriving events/bookkeeping
As per the Spark AutoCDC API, the microbatch should project down to just the user-specified column selection
Implement the part of the core SCD2 microbatch processor that does this microbatch preprocessing.

Why are the changes needed?

To support AutoCDC SCD2 transformations, as per the approved SPIP: https://lists.apache.org/thread/j6sj9wo9odgdpgzlxtvhoy7szs0jplf7

Does this PR introduce any user-facing change?

No. New feature.

How was this patch tested?

Scd2BatchProcessorSuite

Was this patch authored or co-authored using generative AI tooling?

Co-written with Claude Opus 4.7

AnishMahto · 2026-05-29T18:57:06Z

@jose-torres

AnishMahto · 2026-05-29T20:41:30Z

+/**
+ * Concept: run of upsert events.
+ *
+ * A run is a maximal sequence of consecutive upsert events (in sorted order by sequencing)


Just a heads up; I explain a bunch of concepts in this scaladoc so readers have context on the startAt, endAt, and recordStartAt columns I introduce below, but none of these concepts are actually actively used in this PR.

jose-torres

In general, it's a bit hard to understand the concrete abstraction that this PR implements: it does suchandsuch list of transformations, and does it correctly as far as I can tell, but how do we know it's the right list? Since this is all inside a new component, I'm OK with proceeding as is (after handling the duplicates question), but the structure may mean we have to go back and revisit parts of this in future PRs.

jose-torres · 2026-05-29T22:32:00Z

+   *
+   * Step ordering is load-bearing: the row-extension steps reference user data columns that
+   * target-column selection is allowed to drop, so selection runs last. Unlike SCD1, no per-key
+   * deduplication step is needed - SCD2 preserves every event as part of the row's history.


Does it also preserve full-event duplicates (which would eventually map to a START == END row)?

jose-torres · 2026-05-29T22:35:55Z

+        Row(1, null, 20L, 20L, Row(20L))
+      )
+    )
+  }


Since the SCD1 equivalent preprocessing drops duplicates, I think we should have a case explicitly confirming that duplicates are not dropped here. (Or perhaps confirming that full row duplicates are dropped, if that's the intended behavior.)

AnishMahto added 2 commits May 29, 2026 16:48

SCD2 preprocess microbatch

c162b1b

add scaladocs

67c76a8

AnishMahto force-pushed the SPARK-57134-SCD2-preprocess-microbatch branch from e22e3d8 to 67c76a8 Compare May 29, 2026 16:49

AnishMahto commented May 29, 2026

View reviewed changes

fixing indenting

14dc1b9

jose-torres reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57134][SDP] Implement SCD2 Batch Processor; Preprocess Microbatch#56208

[SPARK-57134][SDP] Implement SCD2 Batch Processor; Preprocess Microbatch#56208
AnishMahto wants to merge 3 commits into
apache:masterfrom
AnishMahto:SPARK-57134-SCD2-preprocess-microbatch

AnishMahto commented May 29, 2026 •

edited

Loading

Uh oh!

AnishMahto commented May 29, 2026

Uh oh!

AnishMahto May 29, 2026

Uh oh!

jose-torres left a comment

Uh oh!

jose-torres May 29, 2026

Uh oh!

jose-torres May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AnishMahto commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

AnishMahto commented May 29, 2026

Uh oh!

AnishMahto May 29, 2026

Choose a reason for hiding this comment

Uh oh!

jose-torres left a comment

Choose a reason for hiding this comment

Uh oh!

jose-torres May 29, 2026

Choose a reason for hiding this comment

Uh oh!

jose-torres May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AnishMahto commented May 29, 2026 •

edited

Loading