Skip to content

gh-100687: Reduce frequency of overallocation in long_add#100688

Open
mdickinson wants to merge 12 commits intopython:mainfrom
mdickinson:leaner-long-add
Open

gh-100687: Reduce frequency of overallocation in long_add#100688
mdickinson wants to merge 12 commits intopython:mainfrom
mdickinson:leaner-long-add

Conversation

@mdickinson
Copy link
Member

@mdickinson mdickinson commented Jan 2, 2023

Addresses #100687; see the issue for discussion and rationale.

This PR does some extra work to get a tighter upper bound on the number of digits needed of addition of two like-signed multidigit ints (or subtraction of two oppositely-signed multidigit ints).

Here's a benchmark script, tracking both memory allocations and performance in the case that's most significant, where both of the addends have exactly two digits (the case of single digit ints takes a fast path and doesn't use x_add)

import random
import timeit
import tracemalloc

def sum_pairs(pairs):
    for x, y in pairs:
        x + y

int_samples = 10**5
timeit_samples = 1000
seed = 12345

# Create test pairs, keeping track of memory allocations
tracemalloc.start()
snapshot1 = tracemalloc.take_snapshot()
random.seed(seed)
test_pairs = [
    (random.randrange(2**30, 2**60), random.randrange(2**30, 2**60))
    for _ in range(int_samples)
]
snapshot2 = tracemalloc.take_snapshot()
stats = snapshot2.compare_to(snapshot1, "lineno")
tracemalloc.stop()

# Get iteration time
time = timeit.timeit(stmt="sum_pairs(test_pairs)", globals=globals(), number=timeit_samples) / (timeit_samples * int_samples)

# Report
for stat in stats:
    print(stat)
print(f"Time per iteration (ns): {time * 1e9:.1f}")

Some sample results on my macOS / Intel machine. (cpython is main, cpython-modified is this branch; both were built with --enable-optimizations). Surprisingly, I actually get a minor speed increase where I was expecting a decrease.

mdickinson@lovelace python % ./cpython/python.exe time_int_sum.py         
/Users/mdickinson/Repositories/python/cpython/Lib/random.py:314: size=7031 KiB (+7031 KiB), count=200000 (+200000), average=36 B
/Users/mdickinson/Repositories/python/time_int_sum.py:18: size=6250 KiB (+6250 KiB), count=99984 (+99984), average=64 B
/Users/mdickinson/Repositories/python/cpython/Lib/tracemalloc.py:560: size=312 B (+312 B), count=2 (+2), average=156 B
/Users/mdickinson/Repositories/python/cpython/Lib/tracemalloc.py:423: size=312 B (+312 B), count=2 (+2), average=156 B
/Users/mdickinson/Repositories/python/cpython/Lib/tracemalloc.py:558: size=56 B (+56 B), count=1 (+1), average=56 B
Time per iteration (ns): 39.7
mdickinson@lovelace python % ./cpython-modified/python.exe time_int_sum.py
/Users/mdickinson/Repositories/python/cpython-modified/Lib/random.py:314: size=6250 KiB (+6250 KiB), count=200000 (+200000), average=32 B
/Users/mdickinson/Repositories/python/time_int_sum.py:18: size=6250 KiB (+6250 KiB), count=99984 (+99984), average=64 B
/Users/mdickinson/Repositories/python/cpython-modified/Lib/tracemalloc.py:560: size=312 B (+312 B), count=2 (+2), average=156 B
/Users/mdickinson/Repositories/python/cpython-modified/Lib/tracemalloc.py:423: size=312 B (+312 B), count=2 (+2), average=156 B
/Users/mdickinson/Repositories/python/cpython-modified/Lib/tracemalloc.py:558: size=56 B (+56 B), count=1 (+1), average=56 B
Time per iteration (ns): 37.5

To do:

  • More benchmarking, particularly the case of two digits + single digit; we should also check subtractions
  • Verify that there unit tests that exercise the case where we still overallocate

@eendebakpt
Copy link
Contributor

I tested the PR with the pidigits benchmark (from the pyperformance package). Performance impact is neutral:

Mean +- std dev: [main] 209 ms +- 1 ms -> [pr] 209 ms +- 1 ms: 1.00x slower
Not significant!

(some profiling with valgrind shows that only 11% is spend in the x_add method, and within x_add most time is spend in the for loops performing the actual additions)

In the pidigits calculation there are roughly 156200 calls to x_add for which only 760 times the extra_digit is set (and zero times a false positive).

@mdickinson mdickinson marked this pull request as draft October 25, 2023 07:59
@mdickinson
Copy link
Member Author

This needs a couple more changes thanks to the recent changes in long format. Converting to draft while I investigate.

@mdickinson
Copy link
Member Author

Converting to draft while I investigate.

Should be good to go now. I've removed the special-case handling of normalization: it seems safer to call long_normalize just like everything else does, in case normalization rules change in the future.

@mdickinson mdickinson marked this pull request as ready for review October 25, 2023 09:06
@mdickinson
Copy link
Member Author

mdickinson commented Oct 25, 2023

@eendebakpt Thank you for looking into this, and for the suggestions and test! I fear that my changes have invalidated your benchmarks, but I suspect the take-away that performance isn't significantly affected either way remains valid.

@eendebakpt
Copy link
Contributor

@eendebakpt Thank you for looking into this, and for the suggestions and test! I fear that my changes have invalidated your benchmarks, but I suspect the take-away that performance isn't significantly affected either way remains valid.

The pidigits benchmark results are indeed unaffected. The computation time is dominated by the actual addition (size_a is large for the additions in pidigits). The gain of this PR is in the slightly improved memory allocation for small size_a >= 2.

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@mdickinson
Copy link
Member Author

PR is stale; I suggest closing. I'll do that in a week or two if there are no objections.

@skirpichev
Copy link
Member

skirpichev commented Dec 18, 2025

Why? This looks good for me too.

PR has conflicts, that seems easy to fix. Probably it worth also to update benchmark results after #100688 (comment) (though, I doubt it will affect something).

Edit:

diff after conflict resolution
diff --git a/Lib/test/test_long.py b/Lib/test/test_long.py
index b48a8812a1a..c23d141ad7d 100644
--- a/Lib/test/test_long.py
+++ b/Lib/test/test_long.py
@@ -1698,6 +1698,13 @@ class MyInt(int):
         # GH-117195 -- This shouldn't crash
         object.__sizeof__(1)
 
+    def test_long_add_overallocate(self):
+        # see gh-100688
+        x = (MASK//2) * (MASK+1)
+        x2 = (MASK//2 + 1) * (MASK+1)
+        z = x + x2
+        self.assertEqual(x + x2, MASK * (MASK + 1))
+
     def test_hash(self):
         # gh-136599
         self.assertEqual(hash(-1), -2)
diff --git a/Misc/NEWS.d/next/Core_and_Builtins/2023-01-02-14-54-12.gh-issue-100688.yenIag.rst b/Misc/NEWS.d/next/Core_and_Builtins/2023-01-02-14-54-12.gh-issue-100688.yenIag.rst
new file mode 100644
index 00000000000..06c11e28c61
--- /dev/null
+++ b/Misc/NEWS.d/next/Core_and_Builtins/2023-01-02-14-54-12.gh-issue-100688.yenIag.rst
@@ -0,0 +1,2 @@
+Reduce frequency of overallocation in some cases of multidigit integer
+additions and subtractions. Thanks Pieter Eendebak for contributions.
diff --git a/Objects/longobject.c b/Objects/longobject.c
index 8ba1fd65078..4b5ba6bc61b 100644
--- a/Objects/longobject.c
+++ b/Objects/longobject.c
@@ -3751,6 +3751,10 @@ x_add(PyLongObject *a, PyLongObject *b)
     Py_ssize_t i;
     digit carry = 0;
 
+    /* There are fast paths for cases where a and b both have at most a single
+       digit, so if we end up here then at least one of them is multi-digit. */
+    assert(size_a >= 2 || size_b >= 2);
+
     /* Ensure a is the larger of the two: */
     if (size_a < size_b) {
         { PyLongObject *temp = a; a = b; b = temp; }
@@ -3758,7 +3762,15 @@ x_add(PyLongObject *a, PyLongObject *b)
             size_a = size_b;
             size_b = size_temp; }
     }
-    z = long_alloc(size_a+1);
+
+    /* Allocate sufficient space for the result. In the majority of cases
+       we allocate exactly the right number of digits, but in the (relatively
+       rare) case where the sum of the topmost digits is exactly PyLong_MASK,
+       we'll sometimes end up overallocating by a single digit. */
+    digit top_sum = a->long_value.ob_digit[size_a - 1]
+        + (size_b == size_a ? b->long_value.ob_digit[size_b - 1] : (digit)0);
+    int extra_digit = top_sum >= PyLong_MASK;
+    z = long_alloc(size_a + extra_digit);
     if (z == NULL)
         return NULL;
     for (i = 0; i < size_b; ++i) {
@@ -3771,7 +3783,10 @@ x_add(PyLongObject *a, PyLongObject *b)
         z->long_value.ob_digit[i] = carry & PyLong_MASK;
         carry >>= PyLong_SHIFT;
     }
-    z->long_value.ob_digit[i] = carry;
+    assert(carry == 0 || extra_digit);
+    if (extra_digit) {
+        z->long_value.ob_digit[i] = carry;
+    }
     return long_normalize(z);
 }
 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants