We currently specialize `TO_BOOL` for many common types. This avoids the overhead of API calls, but we still need to load either `True` or `False`, then test against `True` or `False` The additional cost of having to load and compare with `Py_True` and `Py_False` is expensive for what are often quite simple operations. E.g `_TO_BOOL_LIST` is 10 instructions (AArch64 linux) but only half of that is performing the comparison. We can breakdown `_TO_BOOL_FOO` into `_TO_BOOL_BIT_FOO; _BIT_TO_BOOL` and then optimize `_BIT_TO_BOOL; _GUARD_IS_TRUE_POP` to `_GUARD_IS_TRUE_BIT_POP`. Where the "bit" versions produce a single bit boolean (0 for False, 1 for True). Whereas `_TO_BOOL_LIST` is 10 instructions, hypothetical _TO_BOOL_BIT_LIST` would only be 5 instructions. We already optimize `_GUARD_IS_TRUE_BIT_POP` to `_GUARD_BIT_IS_SET_POP` reducing the number of machine instructions from 5 to 2, but replacing it with `_GUARD_IS_TRUE_BIT_POP` would reduce it to a single machine instruction and remove the need for the replication in `_GUARD_BIT_IS_SET_POP`. We can also replace many of the comparisons with a "bit" form, e.g. replacing `_COMPARE_OP_FLOAT` with `_COMPARE_OP_BIT_FLOAT` would reduce the code size from 19 to 13 instructions (21 to 14 accounting for the following guard as well). [ Specializing for the actual operation, can further reduce the stencil size to 8 instructions ] -------------------- All instructions sizes are for the variant with all inputs in outputs in registers. <!-- gh-linked-prs --> ### Linked PRs * gh-149418 <!-- /gh-linked-prs -->