gh-100687: Reduce frequency of overallocation in long_add#100688
gh-100687: Reduce frequency of overallocation in long_add#100688mdickinson wants to merge 12 commits intopython:mainfrom
Conversation
|
I tested the PR with the (some profiling with valgrind shows that only 11% is spend in the In the |
Co-authored-by: Pieter Eendebak <pieter.eendebak@gmail.com>
add test for long addition with over allocation
|
This needs a couple more changes thanks to the recent changes in long format. Converting to draft while I investigate. |
Should be good to go now. I've removed the special-case handling of normalization: it seems safer to call |
|
@eendebakpt Thank you for looking into this, and for the suggestions and test! I fear that my changes have invalidated your benchmarks, but I suspect the take-away that performance isn't significantly affected either way remains valid. |
The |
|
PR is stale; I suggest closing. I'll do that in a week or two if there are no objections. |
|
Why? This looks good for me too. PR has conflicts, that seems easy to fix. Probably it worth also to update benchmark results after #100688 (comment) (though, I doubt it will affect something). Edit: diff after conflict resolutiondiff --git a/Lib/test/test_long.py b/Lib/test/test_long.py
index b48a8812a1a..c23d141ad7d 100644
--- a/Lib/test/test_long.py
+++ b/Lib/test/test_long.py
@@ -1698,6 +1698,13 @@ class MyInt(int):
# GH-117195 -- This shouldn't crash
object.__sizeof__(1)
+ def test_long_add_overallocate(self):
+ # see gh-100688
+ x = (MASK//2) * (MASK+1)
+ x2 = (MASK//2 + 1) * (MASK+1)
+ z = x + x2
+ self.assertEqual(x + x2, MASK * (MASK + 1))
+
def test_hash(self):
# gh-136599
self.assertEqual(hash(-1), -2)
diff --git a/Misc/NEWS.d/next/Core_and_Builtins/2023-01-02-14-54-12.gh-issue-100688.yenIag.rst b/Misc/NEWS.d/next/Core_and_Builtins/2023-01-02-14-54-12.gh-issue-100688.yenIag.rst
new file mode 100644
index 00000000000..06c11e28c61
--- /dev/null
+++ b/Misc/NEWS.d/next/Core_and_Builtins/2023-01-02-14-54-12.gh-issue-100688.yenIag.rst
@@ -0,0 +1,2 @@
+Reduce frequency of overallocation in some cases of multidigit integer
+additions and subtractions. Thanks Pieter Eendebak for contributions.
diff --git a/Objects/longobject.c b/Objects/longobject.c
index 8ba1fd65078..4b5ba6bc61b 100644
--- a/Objects/longobject.c
+++ b/Objects/longobject.c
@@ -3751,6 +3751,10 @@ x_add(PyLongObject *a, PyLongObject *b)
Py_ssize_t i;
digit carry = 0;
+ /* There are fast paths for cases where a and b both have at most a single
+ digit, so if we end up here then at least one of them is multi-digit. */
+ assert(size_a >= 2 || size_b >= 2);
+
/* Ensure a is the larger of the two: */
if (size_a < size_b) {
{ PyLongObject *temp = a; a = b; b = temp; }
@@ -3758,7 +3762,15 @@ x_add(PyLongObject *a, PyLongObject *b)
size_a = size_b;
size_b = size_temp; }
}
- z = long_alloc(size_a+1);
+
+ /* Allocate sufficient space for the result. In the majority of cases
+ we allocate exactly the right number of digits, but in the (relatively
+ rare) case where the sum of the topmost digits is exactly PyLong_MASK,
+ we'll sometimes end up overallocating by a single digit. */
+ digit top_sum = a->long_value.ob_digit[size_a - 1]
+ + (size_b == size_a ? b->long_value.ob_digit[size_b - 1] : (digit)0);
+ int extra_digit = top_sum >= PyLong_MASK;
+ z = long_alloc(size_a + extra_digit);
if (z == NULL)
return NULL;
for (i = 0; i < size_b; ++i) {
@@ -3771,7 +3783,10 @@ x_add(PyLongObject *a, PyLongObject *b)
z->long_value.ob_digit[i] = carry & PyLong_MASK;
carry >>= PyLong_SHIFT;
}
- z->long_value.ob_digit[i] = carry;
+ assert(carry == 0 || extra_digit);
+ if (extra_digit) {
+ z->long_value.ob_digit[i] = carry;
+ }
return long_normalize(z);
}
|
Addresses #100687; see the issue for discussion and rationale.
This PR does some extra work to get a tighter upper bound on the number of digits needed of addition of two like-signed multidigit ints (or subtraction of two oppositely-signed multidigit ints).
Here's a benchmark script, tracking both memory allocations and performance in the case that's most significant, where both of the addends have exactly two digits (the case of single digit ints takes a fast path and doesn't use
x_add)Some sample results on my macOS / Intel machine. (
cpythonismain,cpython-modifiedis this branch; both were built with--enable-optimizations). Surprisingly, I actually get a minor speed increase where I was expecting a decrease.To do: