Skip to content

gh-96859: [argparse] Avoid O(N^2) behavior while consuming optionals #96904

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

alexeypa
Copy link

@alexeypa alexeypa commented Sep 18, 2022

The loop consuming optional and positional arguments in ArgumentParser._parse_known_args() rescans option_string_indices on every iteration of the loop in order to compute the next option index:

next_option_string_index = min([
    index
    for index in option_string_indices
    if index >= start_index])

This becomes very slow when parsing a very long list of arguments, e.g. when parsing a response file.

This commit refactors the loop to scan the option indices list only once.

…onals

The loop consuming optional and positional arguments in ArgumentParser._parse_known_args()
rescans option_string_indices on every iteration of the loop in order to compute the next
option index:

```py
next_option_string_index = min([
    index
    for index in option_string_indices
    if index >= start_index])
```

This becomes very slow when parsing a very long list of arguments, e.g. when parsing
a response file.

This commit refactors the loop to scan the option indices list only once.
@bedevere-bot
Copy link

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

@ghost
Copy link

ghost commented Sep 18, 2022

All commit authors signed the Contributor License Agreement.
CLA signed

@kumiDa
Copy link
Contributor

kumiDa commented Sep 18, 2022

@bitdancer, please share your reviews.

Copy link
Contributor

@DanielNoord DanielNoord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code itself looks good, although I had some trouble trying to understand the new flow based on the old flow. Having no particular experience with the code didn't help.

I had one question and some comments about comments. Those should be updated a little bit.

Copy link
Author

@alexeypa alexeypa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, what is the etiquette around commits addressing review feedback that are not really meaningful changes by themselves? Should I squash them before the PR is merged?

@alexeypa
Copy link
Author

Tests / Windows (x86) (pull_request) Failing after 31m
https://github.com/python/cpython/actions/runs/3091293898/jobs/5001192381

This does not seem related to the changes on the PR. Also the latest commit only touched comments.

@AlexWaygood
Copy link
Member

Tests / Windows (x86) (pull_request) Failing after 31m
https://github.com/python/cpython/actions/runs/3091293898/jobs/5001192381

This does not seem related to the changes on the PR. Also the latest commit only touched comments.

Try closing and reopening the PR to get the check to re-run.

@AlexWaygood
Copy link
Member

BTW, what is the etiquette around commits addressing review feedback that are not really meaningful changes by themselves? Should I squash them before the PR is merged?

Please don't! All CPython PRs are squash-merged anyway, so you don't need to; and PR authors squashing commits together makes it harder for reviewers to see what changed between commits due to the GitHub UI :)

@AlexWaygood AlexWaygood reopened this Sep 20, 2022
@alexeypa
Copy link
Author

A friendly ping to bring some attention to this PR.

@rhettinger, would you be the best person to do a core review? You listed as the maintainer of argparse.py at the top of the file. Please let me know if I should ping someone else.

@AlexWaygood AlexWaygood added the performance Performance or resource usage label Oct 22, 2022
@hpaulj
Copy link

hpaulj commented Nov 4, 2022

I used to contribute a lot to argparse as paul.j3, but haven't followed it much since the github move. So this will be a general comment.

Backward compatibility has been a major issue; I can think of several changes that either had to be removed, or have remained and continue to cause problem. Thus I favor bug fixes, and feature enhancements that have little chance of causing problems. I discourage core changes that aren't essential.

This looks like a core change simply for the sake of performance. I'd vote against it, unless it has been thoroughly tested, not just read through by one or two people. I don't think there are enough argparse developers to do that well.

Most users don't use argparse for large inputs. If performance came up on StackOverFlow, I'd tell them to read the input(s) from a file, and reserve argparse for short lists - file names and controls, not data.

@alexeypa
Copy link
Author

alexeypa commented Nov 7, 2022

Thanks for your feedback, @hpaulj. The change is indeed fixes a performance issue and attempts to not modify the existing behavior in any way. The test_argparse.py suite looks pretty comprehensive and tests a number is of cases involving append actions and a mix of optional/positional arguments. However it does not test the mix of argparse and 3rd party apps using it - it is not an appcompat test suite. I don't believe I can provide any evidence showing that this change does not introduce breakage with in a large body of 3rd party code. If this means this change does not meet the bar - so be it.

If performance came up on StackOverFlow, I'd tell them to read the input(s) from a file, and reserve argparse for short lists - file names and controls, not data.

The specific case this issue came up in was parsing a response file generated by a build system. It was a reasonable legitimate case of passing diverse command line parameters via a file due to size constraints. I'd agree this is a corner case.

@hpaulj
Copy link

hpaulj commented Nov 7, 2022

From a quick read of your changes it looks like you are replacing the set, option_string_indices with list option_tuples and option_tuple_index.

Years ago I added parse_intermixed_args. I created a whole new pair of functions rather than add some sort of parameter and alternative parsing path in the default parse_args. You might consider something similar with a parse_faster_args. At least until we get more feedback, leave the current parser as the default, but provide interested users with an alternative.

A down side to that is it won't get as much beta testing. But I don't have a good feel for how many problems get resolved with Python beta releases.

@alexeypa
Copy link
Author

alexeypa commented Nov 9, 2022

You might consider something similar with a parse_faster_args

Hm, this is an interesting suggestion...

_parse_known_args() is 200+ lines long. It would not be wise just clone it and apply this patch to the new version. We can probably refactor _parse_known_args() to make it more manageable - move sub-functions out, maybe break it up into smaller functions. It should be possible to do it and demonstrate that all these changes are just moving existing code around.

After this refactoring re-implementing the core parsing loop and exposing it as a new top-level parsing function should be relatively simple. Having two parallel functions also allow cross-checking one against the other.

I'm not sure this plan has lower risk of introducing a backward compatibility issue. The refactoring alone is going to be pretty substantial in terms of the size of the patch...

@alexeypa
Copy link
Author

Closing this PR due to lack of interest.

@alexeypa alexeypa closed this Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting core review performance Performance or resource usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants