Skip to content

[SPARK-57154][INFRA] Harden notify_test_workflow against fork run/check-run race#56212

Open
viirya wants to merge 1 commit into
apache:masterfrom
viirya:SPARK-57154-notify-race
Open

[SPARK-57154][INFRA] Harden notify_test_workflow against fork run/check-run race#56212
viirya wants to merge 1 commit into
apache:masterfrom
viirya:SPARK-57154-notify-race

Conversation

@viirya
Copy link
Copy Markdown
Member

@viirya viirya commented May 29, 2026

What changes were proposed in this pull request?

This PR makes notify_test_workflow.yml resilient to two timing races between
the upstream pull_request_target notify run and the fork's CI:

  1. When listing the fork's workflow runs, instead of blindly taking the most
    recent run (workflow_runs[0]) and throwing if its head_sha does not match
    the PR head SHA, the script now retries (up to 3 times, 3s apart) looking for
    the run whose head_sha matches the PR head SHA. The listing endpoint orders
    by most recent, so the run for the just-pushed SHA may not be registered yet
    and a stale run from a previous push could be returned.

  2. When resolving the Run / Check changes check-run id (used only to render a
    Check-run view link instead of the Actions view, see SPARK-37879), a missing
    check-run no longer throws. The check-run materializes later than the workflow
    run, especially when the matrix is queued, so this is now best-effort: if it
    cannot be found, the Build check is still created pointing at the Actions
    run URL.

Behavior is otherwise preserved: when no runs exist at all, the action_required
("workflow run detection failed") check is still created; when runs exist but
none match the PR head SHA, the script still throws so a fresh notify run handles
the newer commit.

Why are the changes needed?

Previously these races caused the notify run to throw, leaving the PR with no
Build check at all. Because the scheduled update_build_status.yml only syncs
existing Build checks, a PR that hit this race had no status reported and no
way for the updater to recover until the next push. Creating the check (falling
back to the Actions URL when needed) lets the updater take over.

Does this PR introduce any user-facing change?

No. CI infrastructure only.

How was this patch tested?

Static verification: the embedded actions/github-script body passes
node --check, and the workflow YAML parses. The behavior paths (matching run
found, runs present but unmatched SHA, no runs, check-run present, check-run
absent) were traced against the existing control flow.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.8)

…heck-run race

### What changes were proposed in this pull request?

This PR makes `notify_test_workflow.yml` resilient to two timing races between
the upstream `pull_request_target` notify run and the fork's CI:

1. When listing the fork's workflow runs, instead of blindly taking the most
   recent run (`workflow_runs[0]`) and throwing if its `head_sha` does not match
   the PR head SHA, the script now retries (up to 3 times, 3s apart) looking for
   the run whose `head_sha` matches the PR head SHA. The listing endpoint orders
   by most recent, so the run for the just-pushed SHA may not be registered yet
   and a stale run from a previous push could be returned.

2. When resolving the `Run / Check changes` check-run id (used only to render a
   Check-run view link instead of the Actions view, see SPARK-37879), a missing
   check-run no longer throws. The check-run materializes later than the workflow
   run, especially when the matrix is queued, so this is now best-effort: if it
   cannot be found, the `Build` check is still created pointing at the Actions
   run URL.

Behavior is otherwise preserved: when no runs exist at all, the `action_required`
("workflow run detection failed") check is still created; when runs exist but
none match the PR head SHA, the script still throws so a fresh notify run handles
the newer commit.

### Why are the changes needed?

Previously these races caused the notify run to `throw`, leaving the PR with no
`Build` check at all. Because the scheduled `update_build_status.yml` only syncs
existing `Build` checks, a PR that hit this race had no status reported and no
way for the updater to recover until the next push. Creating the check (falling
back to the Actions URL when needed) lets the updater take over.

### Does this PR introduce _any_ user-facing change?

No. CI infrastructure only.

### How was this patch tested?

Static verification: the embedded `actions/github-script` body passes
`node --check`, and the workflow YAML parses. The behavior paths (matching run
found, runs present but unmatched SHA, no runs, check-run present, check-run
absent) were traced against the existing control flow.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.8)

Co-authored-by: Claude Code
Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @viirya .

If possible, could you make this contribution to the sub-projects, too, please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants