[SPARK-57130][BUILD] make-distribution.sh copies only git-tracked files for python#56186
[SPARK-57130][BUILD] make-distribution.sh copies only git-tracked files for python#56186pan3793 wants to merge 2 commits into
make-distribution.sh copies only git-tracked files for python#56186Conversation
| cp "$SPARK_HOME/README.md" "$DISTDIR" | ||
| cp -r "$SPARK_HOME/bin" "$DISTDIR" | ||
| cp -r "$SPARK_HOME/python" "$DISTDIR" | ||
| if command -v git && command -v cpio && git rev-parse --git-dir 2>/dev/null; then |
There was a problem hiding this comment.
To keep the console clean, how about redirecting the stdout/stderr of command -v to /dev/null?
There was a problem hiding this comment.
it follows the existing command -v pattern, e.g., in line 128 if [ $(command -v git) ]; then
There was a problem hiding this comment.
Got it. Let's fix all the occurrences of such pattern in the separate PR if we should do it.
| cp -r "$SPARK_HOME/bin" "$DISTDIR" | ||
| cp -r "$SPARK_HOME/python" "$DISTDIR" | ||
| if command -v git && command -v cpio && git rev-parse --git-dir 2>/dev/null; then | ||
| git ls-files -z "$SPARK_HOME/python" | cpio -pdm "$DISTDIR" |
There was a problem hiding this comment.
I'm concerned about whether cpio behaves the same way in GNU and BSD. Did you confirm this script works on macOS and Linux?
There was a problem hiding this comment.
my testing (also confirmed by asking LLM) shows the BSD cpio has different behavior when involving symlinks, but I think it does not affect the make-distribution.sh case
There was a problem hiding this comment.
I tried your change and I noticed:
- On Linux environment,
cpioonly copies.coveragercto$DISTDIR/python.-0option seems required, which should work withcpioon macOS. - If we run
make-distribution.shon a directory other than$SPARK_HOME, files underpythonare't copied to$DISTDIR.
| git ls-files -z "$SPARK_HOME/python" | cpio -pdm "$DISTDIR" | |
| (cd "$SPARK_HOME" && git ls-files -z python | cpio -0pdm "$DISTDIR") |
There was a problem hiding this comment.
@sarutak thanks for checking, yes, -0 should be used
Operation modifiers valid in copy-out and copy-pass modes:
-0, --null Filenames in the list are delimited by null
but I think cd is not required as it runs cd "$SPARK_HOME" in lien 169, I tested it by running the make-distribution.sh under $SPARK_HOME/common, it works fine, am I missing something?
There was a problem hiding this comment.
Sorry. It was a misunderstanding on my part about cd $SPARK_HOME. Adding -0 option to cpio is enough.
What changes were proposed in this pull request?
make-distribution.shcopies only git-tracked files forpythonfolder, whengitandcpiocommands are available and under a git repo, instead of rawcp.Why are the changes needed?
I find that sometimes
make-distribution.shproduces an unreasonably large tarball because it copies the entirepythonfolder to thedistdirectory, which may contain generated files, e.g., compiled PySpark docs.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Run
dev/make-distribution.shmanually.Also tested the performance of the new command, on macOS,
cpiois slightly slower than rawcp, but good enough.on Linux,
cpiois fasterWas this patch authored or co-authored using generative AI tooling?
Generated-by: DeepSeek V4 Pro.