assert: fix NotSubset error messages using %#v instead of %q (fixes #1800)#1888
Merged
Merged
Conversation
NotSubset was formatting the subset and list values with %%q, which
produces garbage output for non-string element types. For example,
a []bool{true} subset would render as [%%!q(bool=true)] instead of
[]bool{true}.
Subset already uses %%#v (Go-syntax representation) which works
correctly for all types. Align NotSubset to use the same format
for all five of its Fail call sites.
Updated the existing TooLongToPrint tests and the shared Subset/
NotSubset table-driven test to match the new %%#v representation.
Added TestNotSubsetFormatsNonStringElementsCorrectly as a direct
regression test for the reported case ([]bool).
Fixes stretchr#1800
Author
|
Friendly ping — this PR is ready for review. All tests pass ✅. Fixes the |
dolmen
approved these changes
May 11, 2026
This was referenced May 11, 2026
adithya-s-k
added a commit
to huggingface/Repo2RLEnv
that referenced
this pull request
May 21, 2026
Sweeping the 38 launch repos surfaced two leak patterns that slipped past
the v0.8.1 + first-pass strip:
- Manual close marker in the PR title: " (fixes #N)" — caught
stretchr/testify#1888.
- Dependabot release notes embedding `https://redirect.github.com/...`
URLs that point straight at the linked PR/issue — caught 6 cells
(gin, jsonschema×2, urfave/cli, expressjs, chronotope).
Title squash-suffix regex now also matches parenthesized closes-style
markers (`(closes|fixes|resolves|see|refs) #N` with optional comma list).
GH-URL regex now matches any `[subdomain.]github.com` host so redirector
domains are stripped too.
Re-ran the sweep after the fix — zero leaks remain across all 127
emitted instructions. Findings + per-pattern verification in
docs/release_notes/v0.8.3/findings-pr_diff.md.
Adds 2 tests to tests/test_pipeline_pr_diff.py covering both real-world
patterns. Full suite at 649 passing.
adithya-s-k
added a commit
to huggingface/Repo2RLEnv
that referenced
this pull request
May 26, 2026
… dataset (#40) * pr_diff: broaden instruction info-leak strip PR descriptions frequently link back to the answer the agent is supposed to produce — adding noise that lets a frontier model shortcut by fetching the linked artifact. We now strip: - Closes / Fixes / Resolves #N (already; extended to multi: '#1, #2') - See / Refs / Follow-up to #N - Markdown issue links [#N](url) - Bare github.com /pull/, /issues/, /commit/ URLs - Co-authored-by / Signed-off-by / Reviewed-by / Acked-by trailer lines - "(#N)" squash-merge suffix on the title 12 new unit tests in tests/test_pipeline_pr_diff.py covering each pattern + the end-to-end _build_instruction shape. Full suite at 647 passing. This is the v0.8.3 Arc 1 optimization landed alongside the full pr_diff sweep. * pr_diff: harden info-leak strip after sweep findings Sweeping the 38 launch repos surfaced two leak patterns that slipped past the v0.8.1 + first-pass strip: - Manual close marker in the PR title: " (fixes #N)" — caught stretchr/testify#1888. - Dependabot release notes embedding `https://redirect.github.com/...` URLs that point straight at the linked PR/issue — caught 6 cells (gin, jsonschema×2, urfave/cli, expressjs, chronotope). Title squash-suffix regex now also matches parenthesized closes-style markers (`(closes|fixes|resolves|see|refs) #N` with optional comma list). GH-URL regex now matches any `[subdomain.]github.com` host so redirector domains are stripped too. Re-ran the sweep after the fix — zero leaks remain across all 127 emitted instructions. Findings + per-pattern verification in docs/release_notes/v0.8.3/findings-pr_diff.md. Adds 2 tests to tests/test_pipeline_pr_diff.py covering both real-world patterns. Full suite at 649 passing. * pr_diff: handle markdown-link forms of issue refs and GH URLs Sergio's review on the PR surfaced two additional patterns that the first-pass strip didn't handle cleanly: 1. `Closes [#1234](url)` — the markdown-link form of a Closes/Fixes/ Resolves ref. The bare-#N regex didn't match, so the keyword was left orphaned in the output (only the [#N](url) part was stripped by the issue-link regex). 2. `[descriptive text](https://github.com/x/y/pull/N)` — markdown link whose URL points at a GH pull/issues/commit, with any link text (not just `[#N]`). The bare-URL regex stripped the URL but left empty `[text]()` brackets in the prose. Two new regexes — _CLOSES_MD_RE / _REFS_MD_RE for the closes-style markdown variants, _MD_GH_URL_RE for the descriptive-text variant — run BEFORE the piece-wise regexes so composite patterns are stripped whole rather than fragmented. 4 new unit tests cover both cases (single, multi-list, descriptive text, see-with-markdown). Suite at 638 passing. * pr_diff: emit Harbor-runnable env with SWE-RL diff-similarity verifier The whole point of Repo2RLEnv is verifiable RL envs that LLMs can solve. v0.8.1's pr_diff was text-only — instruction.md + solution/patch.diff, nothing harbor-runnable. The reward was supposed to be computed by the consumer externally. That fails the bar. This change makes pr_diff produce a fully Harbor-runnable env where the verifier IS the SWE-RL-style sequence-similarity score against the oracle diff. Same reward function as repo2rlenv.reward (in fact, the verifier's embedded Python is kept in lockstep with reward.py — pure stdlib, ~30 lines). New helpers in src/repo2rlenv/pipelines/pr_diff.py: - `build_pr_diff_environment_dockerfile(repo_url, base_commit, oracle_diff)` - FROM python:3.12-slim + git - shallow clone of the repo @ base_commit - base64-bakes the oracle diff into /verifier/oracle.patch - no bootstrap LLM — ~30s build per task - `build_pr_diff_eval_script(base_commit)` - captures git diff <base_commit> after the agent's edits - embeds the diff-similarity Python (base64) — needs only python3 + git - writes the score to /logs/verifier/reward.txt `PRDiffOptions.emit_harbor_env: bool = True` default-on. Set False to get the v0.8.1 text-only output (consumer computes reward). End-to-end smoke on pallets/click PR #3508: - harbor run -a oracle → reward 1.000 (28s) - harbor run -a claude-code -m anthropic/claude-sonnet-4-6 → reward 0.710 (4m32s) Both via harbor run on the emitted task as-is. The oracle case proves the verifier is wired correctly; the Sonnet case proves the agent path works end-to-end with a real partial-credit reward. 6 new unit tests on the Dockerfile + eval-script builders (base64 encoding, special-char patches, difflib in inline Python, reward.txt path). 1 existing test updated to reflect the new instruction wording. Suite at 644 passing. * pr_diff: 6-component reward + LLM judge + calibration metadata Replace the single-scalar diff-similarity reward with the SWE-RL-paper- style multi-component approach. Lifts pr_diff from "coarse training signal" to "evaluable single-task env" while keeping the cheap/multi- language property. New module: src/repo2rlenv/pipelines/_pr_diff_verifier.py - Pure-stdlib in-container verifier (read at gen time, base64-baked into tests/test.sh, decoded back to a file at run time). - Reviewable + unit-testable as ordinary Python — no more opaque base64-blob diffs in test.sh. 6 components (sum = 1.0 default weights): format_valid 0.05 (predicted text parses as unified diff) size_sanity 0.05 (min(o_loc, p_loc) / max — rampage guard) file_targeting 0.10 (F1, NOT Jaccard — recall matters more than punishing extras; explained in module) region_overlap 0.20 (predicted hunks overlap oracle hunks with 5-line slack — strongest spatial signal) similarity 0.20 (SequenceMatcher over +/- lines ONLY — fixes v0.8.1 context-credit inflation) llm_judge 0.40 (Haiku scores semantic correctness; graceful degradation on API failure, remaining weights re-normalize) Verifier outputs reward.txt (single float, harbor reads this) AND reward.json (full breakdown for downstream inspection / re-weighting). Gen-time additions in pr_diff.py: - Quality filter: drops test-only / docs-only / revert / trivially- small diffs / instruction-too-thin candidates before emission. - Calibration baseline: empty-patch reward stamped in task.toml.metadata.reward_calibration.baseline_reward. Consumers compute calibrated = (raw - baseline) / (1 - baseline) for cross-task comparability. - Difficulty bucket: trivial/small/medium/large by oracle LOC. - PRDiffOptions.min_loc_changed: int = 3 (filter knob). Dockerfile now pre-installs claude-code via npm at build time (apt + npm layers cacheable across all pr_diff tasks). This makes harbor's claude-code agent-setup robust at 25-parallel-container scale — no more per-container curl-install flakiness. End-to-end smoke on pallets/click PR #3508: - harbor run -a oracle: reward 1.000 (8s with cached layers) - harbor run -a claude-code -m claude-sonnet-4-6 --ve ANTHROPIC_API_KEY=$X: final 0.98 (5m), all 5 deterministic = 1.0, llm_judge = 0.95 Test coverage: 37 new tests in tests/test_pr_diff_verifier.py covering each component, judge graceful degradation, weight redistribution. Suite at 681 passing. * pr_diff: stage new files before git diff in verifier `git diff <base>` only sees TRACKED files. New files added by the oracle patch (or the agent) are untracked and silently absent from the predicted.patch, causing oracle runs to score < 1.0 for any PR that introduces new files. Found by the 25-env pilot: 3 of 13 oracle runs scored 0.4-0.79 instead of 1.0 — all 3 were PRs that added new files (gradio added 3 new files in a 4-file PR, scoring 0.40 instead of 1.0). Fix: `git add -A; git diff --cached <base>` — `-A` stages all new files so they appear in the predicted diff just like the oracle. * pr_diff: bake claude-code at build time, not at agent-setup Pilot at concurrency=25 (then 12) had 100% Sonnet failure with AgentSetupTimeoutError after 360s. Root cause: harbor's claude-code adapter runs `curl claude.ai/install.sh | bash` in EVERY container at agent-setup time. N parallel ~80MB downloads saturate local bandwidth and time out. Fix: pre-install claude-code in the Dockerfile at BUILD time using the same install.sh route harbor uses. Docker's content-addressable layer cache means the install runs ONCE per image-build and is shared across all per-task images (the install RUN line is identical between tasks). When harbor's adapter re-runs the install script per-container, it finds the binary already in /root/.local/bin and short-circuits the network step. Also drops the nodejs/npm route which was a dead end — harbor's adapter only uses npm on Alpine images; on python:3.12-slim it always uses curl install.sh regardless of pre-installed binaries. Single-task smoke after the fix: 0.858 reward in 3m57s (was 4m56s before the bake; same task, all 6 components fire including llm_judge=0.92). * pr_diff: retune verifier weights after 23-task pilot Sonnet pilot ran 23 tasks through the full env (oracle 21/21=1.000, sonnet mean 0.634, range 0.16-0.98 = healthy eval distribution). Fed the per-task component data to Sonnet for reward-engineering analysis; its grounded recommendations are now the defaults. Changes: format_valid 0.05 → 0.00 (was 1.0 on EVERY trial; pure dead weight) size_sanity 0.05 → 0.08 (useful outlier detector) file_targeting 0.10 → 0.12 (leading indicator, least correlated) region_overlap 0.20 → 0.20 (unchanged; strongest spatial signal) similarity 0.20 → 0.10 (~0.85 correlation with region_overlap; double-counts positional accuracy AND penalizes alternative implementations) llm_judge 0.40 → 0.50 (only semantic-independent signal; diverges informatively from text-based components on e.g. tokenizers/evaluate) TOTAL = 1.00 Also adds a catastrophic-size hard cap: if size_sanity < 0.10, the final reward is clamped to ≤ 0.40. Prevents a charitable judge from inflating the score on patches that are dramatically the wrong size (prettier hit 0.011 — Sonnet wrote ~5 lines vs oracle's 500-line release-notes dump; without the cap, judge=0.25 still pulled the final to 0.23). Two new tests pin the defaults — `_DEFAULT_WEIGHTS` must sum to 1.0, format_valid must stay 0.0, similarity < region_overlap. Suite at 683 passing. * docs: pr_diff now Harbor-runnable + 6-component reward - docs/pipelines/pr_diff.md rewritten end-to-end: - Multi-component reward (6 components + LLM judge + catastrophic-size cap) - Calibration baseline + difficulty bucket metadata - Updated info-leak strip (8 pattern families) - New options table (emit_harbor_env, min_loc_changed) - New skip-reason list (test-only, docs-only, revert, diff-too-small, thin-instruction) - Consumer-side now uses harbor run with claude-code; LLM-judge env-var path documented - Reference dataset link placeholder (filled after HF push) - docs/quickstart.md updated: pr_diff is now Harbor-runnable; show the full harbor run incantation for both oracle and Sonnet adapters. - docs/pipelines/README.md table: pr_diff Sandbox column is now "thin¹" with footnote explaining the python:3.12-slim env + LLM-as-judge at verify time. - README.md table mirrors the same. Test counts left generic in docs per repo convention (specifics live in PR bodies / release notes). * pr_diff: publish 100-env reference dataset + augmented HF card Dataset live at https://huggingface.co/datasets/AdithyaSK/repo2rlenv-pr-diff - 100 verified environments - 26 source repos (Tier A SWE-bench + Tier B HF ecosystem + Tier C multi-lang) - All structurally validated (`repo2rlenv validate` passes for every task) Augmented HF dataset card (src/repo2rlenv/hub.py:_build_dataset_card): - Supports multi-repo datasets — renders "Source repos (N)" list when the push spans more than one source repo (was: single repo only). - Detects whether tasks ship environment/Dockerfile and renders the harbor-runnable recipe (oracle + claude-code) accordingly. - Adds "How it was generated" section with reproduction recipe. - Adds the LLM-as-judge `--ve ANTHROPIC_API_KEY=$X` flag to the harbor-run example (was missing — verifier-side env wasn't documented). - Tags include the pipeline name so collections can filter. Registry-integration fast-path (src/repo2rlenv/registry/integration.py): - When every task's Dockerfile FROM ref is publicly pullable (e.g. `python:3.12-slim` for pr_diff's self-contained Dockerfile), skip the image-distribution step entirely. Was: every dataset with environment/ tried to push an image to a container registry and failed for pr_diff (no bootstrap upstream image to push). Docs: - docs/pipelines/pr_diff.md gets the published HF URL. - docs/release_notes/v0.8.3/findings-pr_diff.md rewritten with the 100-env numbers, the LLM-driven reward-weight retune story, and the 8-pattern info-leak strip. * pr_diff: drop claude-code from Dockerfile (wrong layer) Earlier commit (69d1631) baked claude-code into the task's environment/Dockerfile via curl claude.ai/install.sh at build time. That was wrong: the task spec is supposed to be agent-agnostic. The Dockerfile defines the ENVIRONMENT the task runs in (repo, verifier, tools) — it shouldn't pre-install a specific agent vendor's binary. Consequences of the bake: - ~80MB of claude-code in every image, even when the agent is openhands / codex / aider / etc. - A specific Anthropic CLI on PATH could interfere with what other agents do (unlikely but possible). - Cross-vendor contamination — the dataset implicitly assumes claude-code is the runner. The original problem the bake was supposed to solve: at concurrency≥8, N parallel `curl claude.ai/install.sh` calls during harbor's claude-code agent-setup saturate local bandwidth and time out. Correct fixes: 1. Lower concurrency to ≤5 (harbor's own default is 4). Confirmed working in the latest smoke: 0.975 reward in 2m33s with no bake. 2. Use --max-retries 2 (already in run_pilot.py). 3. File an upstream harbor issue for install-script caching. Single-task smoke after the revert: 0.975, 2m33s. The 100-env reference dataset has been re-generated and re-pushed to HF Hub with the clean Dockerfile. * pr_diff: drop duplicated weights, import _DEFAULT_WEIGHTS directly Addresses sergiopaniego's PR #40 review: the no-op calibration helper had a hand-copied weights dict that drifted out of sync with the retune in _pr_diff_verifier._DEFAULT_WEIGHTS. Import the single source of truth instead. The duplicate was harmless in practice (format_valid('') = 0.0 makes the baseline 0.0 with either weight set), but it's a latent footgun the next time the weights change. * docs: richer pipeline table + pr_diff spotlight + multi-agent examples README + docs/pipelines/README: - Expanded the pipelines table with a longer "What it produces" column and a per-pipeline "LLM use" classification (at synthesis / at bootstrap / at verify) — every pipeline calls an LLM somewhere, the table now makes the location explicit. - docs/pipelines/README.md adds a "Reference dataset" column linking the published HF dataset for `pr_diff`, plus a Spotlight section that walks the task layout + 6-component reward + reproduction recipe inline. Multi-agent examples: - quickstart, pr_diff doc, and Spotlight now show claude-code AND openhands invocations side-by-side, with the full Harbor agent catalogue (25+ harnesses) inlined as a comment block. The contract is agent-agnostic — claude-code is what we used to verify the reference dataset, not a requirement. findings-pr_diff: corrected the stale claim that we bake claude-code into the Dockerfile (we don't anymore — Harbor's agent adapter installs whatever runtime its agent needs at run time). * docs: drop references to gitignored launch-side scripts findings-pr_diff used to point at plans/v083_scripts/run_pilot.py for the generation recipe, but plans/ is gitignored — those paths can't be opened by anyone reading the PR. Rewrote the recipe in terms of the public `repo2rlenv generate` CLI so the published dataset is reproducible without internal tooling. Also dropped the analyze_pilot.py mention in _pr_diff_verifier.py's _DEFAULT_WEIGHTS comment — the rationale is the point, not the specific script that produced it. * pr_diff: address external audit — fast-path safety + private-repo guard Three correctness fixes flagged by an independent audit of PR #40: 1. Self-contained Dockerfile fast path was too broad. It skipped the image-push step for ANY non-local FROM ref, including private or unqualified images (my-bootstrap:latest, company/base:dev) that a consumer couldn't actually pull/rebuild. Narrowed to an explicit allowlist of known-public Docker Hub bases (python/node/golang/...). Anything else falls through to the normal push path, which surfaces a clear error rather than silently publishing a broken dataset. 2. The fast path left stale reproducibility metadata. The emitter seeds every Dockerfile task with mode=local_only, image_visibility=private; the fast path returned without rewriting it, so published pr_diff tasks advertised "private/local_only" despite being publicly rebuildable. Now rewrites each task.toml to mode=inline_dockerfile, image_visibility=public, with the Dockerfile sha256 stamped for traceability. 3. Private repos with emit_harbor_env=True silently produced envs that fail at consumer build time (the inlined `git clone` is unauthenticated). Now fails fast at run() with a clear message pointing at emit_harbor_env=False for text-only output. Also: corrected the verifier module docstring (said "5-component" + listed pre-retune weights) and removed four unused _NORMALIZE_RE_* constants left over from an earlier normalization path. Tests: +3 (public-base fast path rewrites metadata; non-allowlisted image surfaces an error; private+emit_harbor_env fails fast). The private e2e test now passes emit_harbor_env=False explicitly. * pr_diff: support private repos via GITHUB_TOKEN build arg (not a guard) Replaces the wrong "private + emit_harbor_env → error" guard from the previous commit. Private-repo support is a first-class goal of this repo, and emit_harbor_env=True is the default for a reason — the env is the whole point. Forbidding it for private sources defeats that. The actual bug the audit caught was that the emitted Dockerfile cloned over an unauthenticated URL, so private repos failed at consumer build time. Fix: the Dockerfile now declares `ARG GITHUB_TOKEN=` (empty default) and clones via an x-access-token URL when it's set, then resets the remote to the clean URL so the token never persists in the image's git config or any layer. Public repos need no arg. Consumers building a private-repo task pass: harbor run ... --build-arg GITHUB_TOKEN=$GITHUB_TOKEN This mirrors how bootstrap already handles private repos (host-side clone with the resolved token, never embedded). Documented in docs/reference/AUTH.md + docs/pipelines/pr_diff.md. * pr_diff: reconcile docs/code with the spec contracts A compliance pass against the repo's own contract docs (SPEC.md, CLAUDE.md, the pipeline pages) surfaced two divergences introduced by the pr_diff upgrade: 1. reward_kinds emitted "diff_similarity_multi_component", which is not in SPEC.md's reward-kind set and disagreed with the README table. The reward IS a diff_similarity reward — the 6 components are how it's scored, surfaced in /logs/verifier/reward.json, not a separate kind. Emit the documented "diff_similarity"; describe the multi-component scoring in prose. Updated pr_diff.md to match. 2. Multiple docs + docstrings still called pr_diff "text-only / no sandbox", which is now only true of *generation*. The emitted task runs in Docker by default. Reconciled the prose in SPEC.md, CLAUDE.md, the pr_diff.py module + class docstrings, and the harbor emitter docstring to distinguish sandbox-free generation from Docker-runnable consumption. No behavior change — verifier scoring + emitted files are unchanged; this aligns metadata + documentation with the spec.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
assert.NotSubsetwas formatting thesubsetandlistvalues with%q, which produces garbage output for non-string element types:assert.Subsetalready uses%#v(Go-syntax representation) and works correctly for all types. This PR alignsNotSubsetto use the same format across all five of itsFailcall sites.Changes
assertions.go: Replace%qwith%#vinNotSubsetfor:len()error message (1 site)assertions_test.go:TestNotSubsetWithSliceTooLongToPrintexpected strings to%#vformatTestNotSubsetWithMapTooLongToPrintexpected strings to%#vformatTestSubsetNotSubsettable-driven cases forNotSubsetpathsTestNotSubsetFormatsNonStringElementsCorrectlyas a direct regression testBefore / After
Fixes #1800
Follow-up to #1646 where
Subsethad been adapted