Skip to content

prio-queue: use cascade-down sift for faster extract-min#2132

Draft
spkrka wants to merge 1 commit into
gitgitgadget:masterfrom
spkrka:cascade-sift-down
Draft

prio-queue: use cascade-down sift for faster extract-min#2132
spkrka wants to merge 1 commit into
gitgitgadget:masterfrom
spkrka:cascade-sift-down

Conversation

@spkrka
Copy link
Copy Markdown

@spkrka spkrka commented May 30, 2026

Summary

Replace the standard sift-down in prio_queue_get() with a cascade-down approach that halves the number of comparisons and reduces data movement from 3 copies per level (swap) to 1 copy per level.

The standard extract-min places the last array element at the root, then sifts it down. At each level this requires two comparisons (left vs right child, then element vs winner) and, when the element is larger, a swap (three 16-byte copies).

The cascade approach promotes the smaller child into the vacant root slot at each level — one comparison and one copy. The vacancy sinks to a leaf, where the last array element is placed and sifted up if needed — typically zero levels since the last array element tends to be large.

In the common case, work per extract drops from 2d comparisons + 3d copies to d comparisons + d copies. The sift-up phase can add work when the last element is smaller than ancestors of the leaf vacancy, but this is rare in practice.

prio_queue_replace() is simplified to a plain get+put sequence. This is semantically equivalent: the old implementation wrote to slot 0 and sifted down, which has the same observable effect as removing the root and inserting a new element. No caller observes queue state between the two operations. Replace is only called from pop_most_recent_commit() (fetch-pack, object-name, walker) and show-branch — none of which appear in any hot path.

Performance

Profiling git rev-list --count on a 2.5M-commit monorepo shows sift_down_root dropping from 8.2% to 0.4% of total runtime, effectively eliminated as significant overhead.

Synthetic benchmark

10 rounds of 10M put+get cycles, CPU-pinned, median of 3 runs, same compiler and Makefile flags.

Ascending keys (git's typical pattern — parents have lower priority than children):

queue width baseline cascade speedup
10 4.32s 3.97s 1.09x
100 7.95s 6.49s 1.23x
1,000 11.30s 9.66s 1.17x
10,000 16.34s 14.15s 1.16x
100,000 21.43s 18.66s 1.15x

Descending keys (worst case — last element always sinks to leaf in both approaches):

queue width baseline cascade speedup
10 4.84s 4.78s 1.01x
100 9.43s 9.20s 1.03x
1,000 15.28s 14.71s 1.04x
10,000 23.61s 23.49s 1.01x
100,000 29.16s 28.22s 1.03x

No regressions in any scenario.

End-to-end benchmarks

All benchmarks use git-bench with 1 warmup run followed by 10 timed runs. Each configuration is built from the same source tree and tested on the same repo in alternating order.

Large monorepo (2.5M commits, wide DAG, 8.4K remote branches) — range of ~200 first-parent / ~434 total commits:

Command baseline cascade speedup
rev-list --count A..B 2.08s 1.97s 1.03x
log --oneline A..B 2.06s 2.01s 1.04x

linux kernel (1.4M commits) — range v5.0..v6.0 (311K commits):

Command baseline cascade speedup
rev-list --count v5.0..v6.0 455ms 440ms 1.04x

git.git (81K commits) — range v2.0.0..v2.45.0 (37K commits):

Command baseline cascade speedup
rev-list --count v2.0..v2.45 81ms 80ms ~1.00x

The improvement scales with DAG width: wider DAGs produce larger priority queues, amplifying the per-level savings. In small or narrow repositories the priority queues stay shallow and the sift-down cost is already negligible, so the change is not noticeable.

Testing

Existing test coverage is thorough:

  • t/unit-tests/u-prio-queue.c (7 tests): verifies ordering invariants for sorted put/get, interleaved operations, replace, empty-queue edge cases, and LIFO mode at heap depths of 3-4.
  • t6600-test-reach.sh (50 tests): exercises real commit-graph traversals through merge-base, is-ancestor, rev-list topo-order, for-each-ref ahead-behind, and branch --merged/--contains.

All tests pass.

@spkrka spkrka force-pushed the cascade-sift-down branch 8 times, most recently from a45f027 to 0a3a2b0 Compare May 31, 2026 08:20
Replace the standard sift-down in prio_queue_get() with a
cascade-down approach.

The standard approach places the last array element at the root,
then sifts it down.  At each level this requires two comparisons
(left vs right child, then element vs winner) and, when the
element is larger, a swap (three 16-byte copies).

The cascade approach instead promotes the smaller child into the
vacant root slot at each level — one comparison and one copy.
The vacancy sinks to a leaf, where the last array element is
placed and sifted up if needed — typically zero levels since the
last array element tends to be large.

In the common case, work per extract drops from 2d comparisons
+ 3d copies to d comparisons + d copies: roughly half the
comparisons and a third of the data movement.  The sift-up phase
can add work when the last element is smaller than ancestors of
the leaf vacancy, but this is rare in practice.

Simplify prio_queue_replace() to a plain get+put sequence.  This
is semantically equivalent: the old implementation wrote to slot 0
and sifted down, which has the same observable effect as removing
the root and inserting a new element.  No caller observes queue
state between the two operations.  The previous implementation
shared sift_down_root() with get, but the cascade approach no
longer accommodates that cleanly since sift_down_root() now
expects the element to reinsert at queue->array[queue->nr], left
there by prio_queue_get() after decrementing nr.  This is fine in
practice: replace is only called from pop_most_recent_commit()
(fetch-pack, object-name, walker) and show-branch — none of
which appear in any hot path.

A synthetic benchmark (10 rounds of 10M put+get cycles, ascending
integer keys, CPU-pinned, median of 3 runs, same compiler and
Makefile flags) shows consistent improvement across all queue
sizes, with no regressions:

    queue width       baseline    cascade    speedup
    ------------------------------------------------
             10        4.32s      3.97s      1.09x
            100        7.95s      6.49s      1.23x
          1,000       11.30s      9.66s      1.17x
         10,000       16.34s     14.15s      1.16x
        100,000       21.43s     18.66s      1.15x

With descending keys (worst case — the last element always sinks
to a leaf in both approaches) the cascade still wins slightly
(1-4%) by replacing swaps with copies, and never regresses.

In end-to-end git commands the improvement is modest because
sift_down_root is only ~8% of total runtime.  Profiling
rev-list --count on a 2.5M-commit monorepo shows sift_down_root
dropping from 8.2% to 0.4% of total runtime.  The improvement
scales with DAG width: wider DAGs produce larger priority queues,
amplifying the per-level savings.  In small or narrow repos the
queues stay shallow and the effect is negligible.

Signed-off-by: Kristofer Karlsson <krka@spotify.com>
@spkrka spkrka force-pushed the cascade-sift-down branch from 0a3a2b0 to 9ca2fab Compare May 31, 2026 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant