Skip to content

[SPARK-57130][BUILD] make-distribution.sh copies only git-tracked files for python#56186

Open
pan3793 wants to merge 2 commits into
apache:masterfrom
pan3793:SPARK-57130
Open

[SPARK-57130][BUILD] make-distribution.sh copies only git-tracked files for python#56186
pan3793 wants to merge 2 commits into
apache:masterfrom
pan3793:SPARK-57130

Conversation

@pan3793
Copy link
Copy Markdown
Member

@pan3793 pan3793 commented May 28, 2026

What changes were proposed in this pull request?

make-distribution.sh copies only git-tracked files for python folder, when git and cpio commands are available and under a git repo, instead of raw cp.

Why are the changes needed?

I find that sometimes make-distribution.sh produces an unreasonably large tarball because it copies the entire python folder to the dist directory, which may contain generated files, e.g., compiled PySpark docs.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Run dev/make-distribution.sh manually.

Also tested the performance of the new command, on macOS, cpio is slightly slower than raw cp, but good enough.

$ time git ls-files -z "$PWD/python" | cpio -0pdm "target"
42452 blocks
git ls-files -z "$PWD/python"  0.01s user 0.01s system 76% cpu 0.027 total
cpio -0pdm "target"  0.05s user 1.10s system 77% cpu 1.480 total
$ rm -rf target/python
$ time cp -r "$PWD/python" "target" 
cp -r "$PWD/python" "target"  0.02s user 0.56s system 78% cpu 0.731 total

on Linux, cpio is faster

$ time git ls-files -z "$PWD/python" | cpio -0pdm "target"
46385 blocks
git ls-files -z "$PWD/python"  0.01s user 0.01s system 81% cpu 0.022 total
cpio -0pdm "target"  0.05s user 1.02s system 84% cpu 1.260 total
$ rm -rf target/python
$ time cp -r "$PWD/python" "target"
cp -r "$PWD/python" "target"  0.02s user 0.57s system 73% cpu 0.807 total

Was this patch authored or co-authored using generative AI tooling?

Generated-by: DeepSeek V4 Pro.

Comment thread dev/make-distribution.sh
cp "$SPARK_HOME/README.md" "$DISTDIR"
cp -r "$SPARK_HOME/bin" "$DISTDIR"
cp -r "$SPARK_HOME/python" "$DISTDIR"
if command -v git && command -v cpio && git rev-parse --git-dir 2>/dev/null; then
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep the console clean, how about redirecting the stdout/stderr of command -v to /dev/null?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it follows the existing command -v pattern, e.g., in line 128 if [ $(command -v git) ]; then

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Let's fix all the occurrences of such pattern in the separate PR if we should do it.

Comment thread dev/make-distribution.sh Outdated
cp -r "$SPARK_HOME/bin" "$DISTDIR"
cp -r "$SPARK_HOME/python" "$DISTDIR"
if command -v git && command -v cpio && git rev-parse --git-dir 2>/dev/null; then
git ls-files -z "$SPARK_HOME/python" | cpio -pdm "$DISTDIR"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about whether cpio behaves the same way in GNU and BSD. Did you confirm this script works on macOS and Linux?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my testing (also confirmed by asking LLM) shows the BSD cpio has different behavior when involving symlinks, but I think it does not affect the make-distribution.sh case

Copy link
Copy Markdown
Member

@sarutak sarutak May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried your change and I noticed:

  • On Linux environment, cpio only copies .coveragerc to $DISTDIR/python. -0 option seems required, which should work with cpio on macOS.
  • If we run make-distribution.sh on a directory other than $SPARK_HOME, files under python are't copied to $DISTDIR.
Suggested change
git ls-files -z "$SPARK_HOME/python" | cpio -pdm "$DISTDIR"
(cd "$SPARK_HOME" && git ls-files -z python | cpio -0pdm "$DISTDIR")

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarutak thanks for checking, yes, -0 should be used

 Operation modifiers valid in copy-out and copy-pass modes:

  -0, --null                 Filenames in the list are delimited by null

but I think cd is not required as it runs cd "$SPARK_HOME" in lien 169, I tested it by running the make-distribution.sh under $SPARK_HOME/common, it works fine, am I missing something?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. It was a misunderstanding on my part about cd $SPARK_HOME. Adding -0 option to cpio is enough.

Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @pan3793 and @sarutak .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants