Skip to content

assert: fix NotSubset error messages using %#v instead of %q (fixes #1800)#1888

Merged
dolmen merged 1 commit into
stretchr:masterfrom
nghiack7:fix/not-subset-format
May 11, 2026
Merged

assert: fix NotSubset error messages using %#v instead of %q (fixes #1800)#1888
dolmen merged 1 commit into
stretchr:masterfrom
nghiack7:fix/not-subset-format

Conversation

@nghiack7
Copy link
Copy Markdown

@nghiack7 nghiack7 commented May 9, 2026

Summary

assert.NotSubset was formatting the subset and list values with %q, which produces garbage output for non-string element types:

Error: [%!q(bool=true)] is a subset of [%!q(bool=true)]

assert.Subset already uses %#v (Go-syntax representation) and works correctly for all types. This PR aligns NotSubset to use the same format across all five of its Fail call sites.

Changes

  • assertions.go: Replace %q with %#v in NotSubset for:
    • "unsupported type" error messages (2 sites)
    • "is a subset of" failure messages (2 sites)
    • inner len() error message (1 site)
  • assertions_test.go:
    • Update TestNotSubsetWithSliceTooLongToPrint expected strings to %#v format
    • Update TestNotSubsetWithMapTooLongToPrint expected strings to %#v format
    • Update shared TestSubsetNotSubset table-driven cases for NotSubset paths
    • Add TestNotSubsetFormatsNonStringElementsCorrectly as a direct regression test

Before / After

// Before (issue #1800)
assert.NotSubset(t, []bool{true}, []bool{true})
// Error: [%!q(bool=true)] is a subset of [%!q(bool=true)]

// After
// Error: []bool{true} is a subset of []bool{true}

Fixes #1800
Follow-up to #1646 where Subset had been adapted

NotSubset was formatting the subset and list values with %%q, which
produces garbage output for non-string element types.  For example,
a []bool{true} subset would render as [%%!q(bool=true)] instead of
[]bool{true}.

Subset already uses %%#v (Go-syntax representation) which works
correctly for all types.  Align NotSubset to use the same format
for all five of its Fail call sites.

Updated the existing TooLongToPrint tests and the shared Subset/
NotSubset table-driven test to match the new %%#v representation.
Added TestNotSubsetFormatsNonStringElementsCorrectly as a direct
regression test for the reported case ([]bool).

Fixes stretchr#1800
@dolmen dolmen added pkg-assert Change related to package testify/assert enhancement: output format Enhancement related to formatting of messages labels May 10, 2026
@nghiack7
Copy link
Copy Markdown
Author

Friendly ping — this PR is ready for review. All tests pass ✅. Fixes the %q formatting bug in NotSubset error messages (#1800).

@dolmen dolmen merged commit 246de45 into stretchr:master May 11, 2026
10 checks passed
adithya-s-k added a commit to huggingface/Repo2RLEnv that referenced this pull request May 21, 2026
Sweeping the 38 launch repos surfaced two leak patterns that slipped past
the v0.8.1 + first-pass strip:

  - Manual close marker in the PR title: " (fixes #N)" — caught
    stretchr/testify#1888.
  - Dependabot release notes embedding `https://redirect.github.com/...`
    URLs that point straight at the linked PR/issue — caught 6 cells
    (gin, jsonschema×2, urfave/cli, expressjs, chronotope).

Title squash-suffix regex now also matches parenthesized closes-style
markers (`(closes|fixes|resolves|see|refs) #N` with optional comma list).
GH-URL regex now matches any `[subdomain.]github.com` host so redirector
domains are stripped too.

Re-ran the sweep after the fix — zero leaks remain across all 127
emitted instructions. Findings + per-pattern verification in
docs/release_notes/v0.8.3/findings-pr_diff.md.

Adds 2 tests to tests/test_pipeline_pr_diff.py covering both real-world
patterns. Full suite at 649 passing.
adithya-s-k added a commit to huggingface/Repo2RLEnv that referenced this pull request May 26, 2026
… dataset (#40)

* pr_diff: broaden instruction info-leak strip

PR descriptions frequently link back to the answer the agent is supposed
to produce — adding noise that lets a frontier model shortcut by fetching
the linked artifact. We now strip:

  - Closes / Fixes / Resolves #N (already; extended to multi: '#1, #2')
  - See / Refs / Follow-up to #N
  - Markdown issue links [#N](url)
  - Bare github.com /pull/, /issues/, /commit/ URLs
  - Co-authored-by / Signed-off-by / Reviewed-by / Acked-by trailer lines
  - "(#N)" squash-merge suffix on the title

12 new unit tests in tests/test_pipeline_pr_diff.py covering each pattern
+ the end-to-end _build_instruction shape. Full suite at 647 passing.

This is the v0.8.3 Arc 1 optimization landed alongside the full pr_diff
sweep.

* pr_diff: harden info-leak strip after sweep findings

Sweeping the 38 launch repos surfaced two leak patterns that slipped past
the v0.8.1 + first-pass strip:

  - Manual close marker in the PR title: " (fixes #N)" — caught
    stretchr/testify#1888.
  - Dependabot release notes embedding `https://redirect.github.com/...`
    URLs that point straight at the linked PR/issue — caught 6 cells
    (gin, jsonschema×2, urfave/cli, expressjs, chronotope).

Title squash-suffix regex now also matches parenthesized closes-style
markers (`(closes|fixes|resolves|see|refs) #N` with optional comma list).
GH-URL regex now matches any `[subdomain.]github.com` host so redirector
domains are stripped too.

Re-ran the sweep after the fix — zero leaks remain across all 127
emitted instructions. Findings + per-pattern verification in
docs/release_notes/v0.8.3/findings-pr_diff.md.

Adds 2 tests to tests/test_pipeline_pr_diff.py covering both real-world
patterns. Full suite at 649 passing.

* pr_diff: handle markdown-link forms of issue refs and GH URLs

Sergio's review on the PR surfaced two additional patterns that the
first-pass strip didn't handle cleanly:

1. `Closes [#1234](url)` — the markdown-link form of a Closes/Fixes/
   Resolves ref. The bare-#N regex didn't match, so the keyword was
   left orphaned in the output (only the [#N](url) part was stripped
   by the issue-link regex).

2. `[descriptive text](https://github.com/x/y/pull/N)` — markdown
   link whose URL points at a GH pull/issues/commit, with any link
   text (not just `[#N]`). The bare-URL regex stripped the URL but
   left empty `[text]()` brackets in the prose.

Two new regexes — _CLOSES_MD_RE / _REFS_MD_RE for the closes-style
markdown variants, _MD_GH_URL_RE for the descriptive-text variant —
run BEFORE the piece-wise regexes so composite patterns are stripped
whole rather than fragmented.

4 new unit tests cover both cases (single, multi-list, descriptive
text, see-with-markdown). Suite at 638 passing.

* pr_diff: emit Harbor-runnable env with SWE-RL diff-similarity verifier

The whole point of Repo2RLEnv is verifiable RL envs that LLMs can solve.
v0.8.1's pr_diff was text-only — instruction.md + solution/patch.diff,
nothing harbor-runnable. The reward was supposed to be computed by the
consumer externally. That fails the bar.

This change makes pr_diff produce a fully Harbor-runnable env where the
verifier IS the SWE-RL-style sequence-similarity score against the
oracle diff. Same reward function as repo2rlenv.reward (in fact, the
verifier's embedded Python is kept in lockstep with reward.py — pure
stdlib, ~30 lines).

New helpers in src/repo2rlenv/pipelines/pr_diff.py:
- `build_pr_diff_environment_dockerfile(repo_url, base_commit, oracle_diff)`
  - FROM python:3.12-slim + git
  - shallow clone of the repo @ base_commit
  - base64-bakes the oracle diff into /verifier/oracle.patch
  - no bootstrap LLM — ~30s build per task
- `build_pr_diff_eval_script(base_commit)`
  - captures git diff <base_commit> after the agent's edits
  - embeds the diff-similarity Python (base64) — needs only python3 + git
  - writes the score to /logs/verifier/reward.txt

`PRDiffOptions.emit_harbor_env: bool = True` default-on. Set False to
get the v0.8.1 text-only output (consumer computes reward).

End-to-end smoke on pallets/click PR #3508:
- harbor run -a oracle → reward 1.000 (28s)
- harbor run -a claude-code -m anthropic/claude-sonnet-4-6 → reward 0.710 (4m32s)

Both via harbor run on the emitted task as-is. The oracle case proves
the verifier is wired correctly; the Sonnet case proves the agent path
works end-to-end with a real partial-credit reward.

6 new unit tests on the Dockerfile + eval-script builders (base64
encoding, special-char patches, difflib in inline Python, reward.txt
path). 1 existing test updated to reflect the new instruction wording.
Suite at 644 passing.

* pr_diff: 6-component reward + LLM judge + calibration metadata

Replace the single-scalar diff-similarity reward with the SWE-RL-paper-
style multi-component approach. Lifts pr_diff from "coarse training
signal" to "evaluable single-task env" while keeping the cheap/multi-
language property.

New module: src/repo2rlenv/pipelines/_pr_diff_verifier.py
  - Pure-stdlib in-container verifier (read at gen time, base64-baked
    into tests/test.sh, decoded back to a file at run time).
  - Reviewable + unit-testable as ordinary Python — no more opaque
    base64-blob diffs in test.sh.

6 components (sum = 1.0 default weights):
  format_valid     0.05   (predicted text parses as unified diff)
  size_sanity      0.05   (min(o_loc, p_loc) / max — rampage guard)
  file_targeting   0.10   (F1, NOT Jaccard — recall matters more
                          than punishing extras; explained in module)
  region_overlap   0.20   (predicted hunks overlap oracle hunks
                          with 5-line slack — strongest spatial signal)
  similarity       0.20   (SequenceMatcher over +/- lines ONLY —
                          fixes v0.8.1 context-credit inflation)
  llm_judge        0.40   (Haiku scores semantic correctness;
                          graceful degradation on API failure,
                          remaining weights re-normalize)

Verifier outputs reward.txt (single float, harbor reads this) AND
reward.json (full breakdown for downstream inspection / re-weighting).

Gen-time additions in pr_diff.py:
  - Quality filter: drops test-only / docs-only / revert / trivially-
    small diffs / instruction-too-thin candidates before emission.
  - Calibration baseline: empty-patch reward stamped in
    task.toml.metadata.reward_calibration.baseline_reward. Consumers
    compute calibrated = (raw - baseline) / (1 - baseline) for
    cross-task comparability.
  - Difficulty bucket: trivial/small/medium/large by oracle LOC.
  - PRDiffOptions.min_loc_changed: int = 3 (filter knob).

Dockerfile now pre-installs claude-code via npm at build time (apt + npm
layers cacheable across all pr_diff tasks). This makes harbor's
claude-code agent-setup robust at 25-parallel-container scale — no more
per-container curl-install flakiness.

End-to-end smoke on pallets/click PR #3508:
  - harbor run -a oracle: reward 1.000 (8s with cached layers)
  - harbor run -a claude-code -m claude-sonnet-4-6 --ve ANTHROPIC_API_KEY=$X:
    final 0.98 (5m), all 5 deterministic = 1.0, llm_judge = 0.95

Test coverage: 37 new tests in tests/test_pr_diff_verifier.py covering
each component, judge graceful degradation, weight redistribution.
Suite at 681 passing.

* pr_diff: stage new files before git diff in verifier

`git diff <base>` only sees TRACKED files. New files added by the
oracle patch (or the agent) are untracked and silently absent from
the predicted.patch, causing oracle runs to score < 1.0 for any PR
that introduces new files.

Found by the 25-env pilot: 3 of 13 oracle runs scored 0.4-0.79
instead of 1.0 — all 3 were PRs that added new files (gradio added
3 new files in a 4-file PR, scoring 0.40 instead of 1.0).

Fix: `git add -A; git diff --cached <base>` — `-A` stages all new
files so they appear in the predicted diff just like the oracle.

* pr_diff: bake claude-code at build time, not at agent-setup

Pilot at concurrency=25 (then 12) had 100% Sonnet failure with
AgentSetupTimeoutError after 360s. Root cause: harbor's claude-code
adapter runs `curl claude.ai/install.sh | bash` in EVERY container at
agent-setup time. N parallel ~80MB downloads saturate local bandwidth
and time out.

Fix: pre-install claude-code in the Dockerfile at BUILD time using the
same install.sh route harbor uses. Docker's content-addressable layer
cache means the install runs ONCE per image-build and is shared across
all per-task images (the install RUN line is identical between tasks).
When harbor's adapter re-runs the install script per-container, it
finds the binary already in /root/.local/bin and short-circuits the
network step.

Also drops the nodejs/npm route which was a dead end — harbor's
adapter only uses npm on Alpine images; on python:3.12-slim it always
uses curl install.sh regardless of pre-installed binaries.

Single-task smoke after the fix: 0.858 reward in 3m57s (was 4m56s
before the bake; same task, all 6 components fire including
llm_judge=0.92).

* pr_diff: retune verifier weights after 23-task pilot

Sonnet pilot ran 23 tasks through the full env (oracle 21/21=1.000,
sonnet mean 0.634, range 0.16-0.98 = healthy eval distribution). Fed
the per-task component data to Sonnet for reward-engineering analysis;
its grounded recommendations are now the defaults.

Changes:
  format_valid    0.05 → 0.00   (was 1.0 on EVERY trial; pure dead weight)
  size_sanity     0.05 → 0.08   (useful outlier detector)
  file_targeting  0.10 → 0.12   (leading indicator, least correlated)
  region_overlap  0.20 → 0.20   (unchanged; strongest spatial signal)
  similarity      0.20 → 0.10   (~0.85 correlation with region_overlap;
                                 double-counts positional accuracy AND
                                 penalizes alternative implementations)
  llm_judge       0.40 → 0.50   (only semantic-independent signal;
                                 diverges informatively from text-based
                                 components on e.g. tokenizers/evaluate)
  TOTAL = 1.00

Also adds a catastrophic-size hard cap: if size_sanity < 0.10, the
final reward is clamped to ≤ 0.40. Prevents a charitable judge from
inflating the score on patches that are dramatically the wrong size
(prettier hit 0.011 — Sonnet wrote ~5 lines vs oracle's 500-line
release-notes dump; without the cap, judge=0.25 still pulled the
final to 0.23).

Two new tests pin the defaults — `_DEFAULT_WEIGHTS` must sum to 1.0,
format_valid must stay 0.0, similarity < region_overlap. Suite at 683
passing.

* docs: pr_diff now Harbor-runnable + 6-component reward

- docs/pipelines/pr_diff.md rewritten end-to-end:
  - Multi-component reward (6 components + LLM judge + catastrophic-size cap)
  - Calibration baseline + difficulty bucket metadata
  - Updated info-leak strip (8 pattern families)
  - New options table (emit_harbor_env, min_loc_changed)
  - New skip-reason list (test-only, docs-only, revert, diff-too-small, thin-instruction)
  - Consumer-side now uses harbor run with claude-code; LLM-judge env-var path documented
  - Reference dataset link placeholder (filled after HF push)

- docs/quickstart.md updated: pr_diff is now Harbor-runnable; show the full
  harbor run incantation for both oracle and Sonnet adapters.

- docs/pipelines/README.md table: pr_diff Sandbox column is now "thin¹" with
  footnote explaining the python:3.12-slim env + LLM-as-judge at verify time.

- README.md table mirrors the same.

Test counts left generic in docs per repo convention (specifics live in PR
bodies / release notes).

* pr_diff: publish 100-env reference dataset + augmented HF card

Dataset live at https://huggingface.co/datasets/AdithyaSK/repo2rlenv-pr-diff
- 100 verified environments
- 26 source repos (Tier A SWE-bench + Tier B HF ecosystem + Tier C multi-lang)
- All structurally validated (`repo2rlenv validate` passes for every task)

Augmented HF dataset card (src/repo2rlenv/hub.py:_build_dataset_card):
- Supports multi-repo datasets — renders "Source repos (N)" list when the
  push spans more than one source repo (was: single repo only).
- Detects whether tasks ship environment/Dockerfile and renders the
  harbor-runnable recipe (oracle + claude-code) accordingly.
- Adds "How it was generated" section with reproduction recipe.
- Adds the LLM-as-judge `--ve ANTHROPIC_API_KEY=$X` flag to the harbor-run
  example (was missing — verifier-side env wasn't documented).
- Tags include the pipeline name so collections can filter.

Registry-integration fast-path (src/repo2rlenv/registry/integration.py):
- When every task's Dockerfile FROM ref is publicly pullable (e.g.
  `python:3.12-slim` for pr_diff's self-contained Dockerfile), skip the
  image-distribution step entirely. Was: every dataset with
  environment/ tried to push an image to a container registry and
  failed for pr_diff (no bootstrap upstream image to push).

Docs:
- docs/pipelines/pr_diff.md gets the published HF URL.
- docs/release_notes/v0.8.3/findings-pr_diff.md rewritten with the 100-env
  numbers, the LLM-driven reward-weight retune story, and the 8-pattern
  info-leak strip.

* pr_diff: drop claude-code from Dockerfile (wrong layer)

Earlier commit (69d1631) baked claude-code into the task's
environment/Dockerfile via curl claude.ai/install.sh at build time.
That was wrong: the task spec is supposed to be agent-agnostic. The
Dockerfile defines the ENVIRONMENT the task runs in (repo, verifier,
tools) — it shouldn't pre-install a specific agent vendor's binary.

Consequences of the bake:
- ~80MB of claude-code in every image, even when the agent is
  openhands / codex / aider / etc.
- A specific Anthropic CLI on PATH could interfere with what other
  agents do (unlikely but possible).
- Cross-vendor contamination — the dataset implicitly assumes
  claude-code is the runner.

The original problem the bake was supposed to solve: at concurrency≥8,
N parallel `curl claude.ai/install.sh` calls during harbor's claude-code
agent-setup saturate local bandwidth and time out. Correct fixes:

1. Lower concurrency to ≤5 (harbor's own default is 4). Confirmed
   working in the latest smoke: 0.975 reward in 2m33s with no bake.
2. Use --max-retries 2 (already in run_pilot.py).
3. File an upstream harbor issue for install-script caching.

Single-task smoke after the revert: 0.975, 2m33s. The 100-env
reference dataset has been re-generated and re-pushed to HF Hub with
the clean Dockerfile.

* pr_diff: drop duplicated weights, import _DEFAULT_WEIGHTS directly

Addresses sergiopaniego's PR #40 review: the no-op calibration helper
had a hand-copied weights dict that drifted out of sync with the
retune in _pr_diff_verifier._DEFAULT_WEIGHTS. Import the single source
of truth instead.

The duplicate was harmless in practice (format_valid('') = 0.0 makes
the baseline 0.0 with either weight set), but it's a latent footgun
the next time the weights change.

* docs: richer pipeline table + pr_diff spotlight + multi-agent examples

README + docs/pipelines/README:
- Expanded the pipelines table with a longer "What it produces" column
  and a per-pipeline "LLM use" classification (at synthesis / at
  bootstrap / at verify) — every pipeline calls an LLM somewhere, the
  table now makes the location explicit.
- docs/pipelines/README.md adds a "Reference dataset" column linking
  the published HF dataset for `pr_diff`, plus a Spotlight section
  that walks the task layout + 6-component reward + reproduction
  recipe inline.

Multi-agent examples:
- quickstart, pr_diff doc, and Spotlight now show claude-code AND
  openhands invocations side-by-side, with the full Harbor agent
  catalogue (25+ harnesses) inlined as a comment block. The contract
  is agent-agnostic — claude-code is what we used to verify the
  reference dataset, not a requirement.

findings-pr_diff: corrected the stale claim that we bake claude-code
into the Dockerfile (we don't anymore — Harbor's agent adapter
installs whatever runtime its agent needs at run time).

* docs: drop references to gitignored launch-side scripts

findings-pr_diff used to point at plans/v083_scripts/run_pilot.py for
the generation recipe, but plans/ is gitignored — those paths can't
be opened by anyone reading the PR. Rewrote the recipe in terms of
the public `repo2rlenv generate` CLI so the published dataset is
reproducible without internal tooling.

Also dropped the analyze_pilot.py mention in
_pr_diff_verifier.py's _DEFAULT_WEIGHTS comment — the rationale is
the point, not the specific script that produced it.

* pr_diff: address external audit — fast-path safety + private-repo guard

Three correctness fixes flagged by an independent audit of PR #40:

1. Self-contained Dockerfile fast path was too broad. It skipped the
   image-push step for ANY non-local FROM ref, including private or
   unqualified images (my-bootstrap:latest, company/base:dev) that a
   consumer couldn't actually pull/rebuild. Narrowed to an explicit
   allowlist of known-public Docker Hub bases (python/node/golang/...).
   Anything else falls through to the normal push path, which surfaces
   a clear error rather than silently publishing a broken dataset.

2. The fast path left stale reproducibility metadata. The emitter seeds
   every Dockerfile task with mode=local_only, image_visibility=private;
   the fast path returned without rewriting it, so published pr_diff
   tasks advertised "private/local_only" despite being publicly
   rebuildable. Now rewrites each task.toml to mode=inline_dockerfile,
   image_visibility=public, with the Dockerfile sha256 stamped for
   traceability.

3. Private repos with emit_harbor_env=True silently produced envs that
   fail at consumer build time (the inlined `git clone` is
   unauthenticated). Now fails fast at run() with a clear message
   pointing at emit_harbor_env=False for text-only output.

Also: corrected the verifier module docstring (said "5-component" +
listed pre-retune weights) and removed four unused _NORMALIZE_RE_*
constants left over from an earlier normalization path.

Tests: +3 (public-base fast path rewrites metadata; non-allowlisted
image surfaces an error; private+emit_harbor_env fails fast). The
private e2e test now passes emit_harbor_env=False explicitly.

* pr_diff: support private repos via GITHUB_TOKEN build arg (not a guard)

Replaces the wrong "private + emit_harbor_env → error" guard from the
previous commit. Private-repo support is a first-class goal of this
repo, and emit_harbor_env=True is the default for a reason — the env
is the whole point. Forbidding it for private sources defeats that.

The actual bug the audit caught was that the emitted Dockerfile cloned
over an unauthenticated URL, so private repos failed at consumer build
time. Fix: the Dockerfile now declares `ARG GITHUB_TOKEN=` (empty
default) and clones via an x-access-token URL when it's set, then
resets the remote to the clean URL so the token never persists in the
image's git config or any layer. Public repos need no arg.

Consumers building a private-repo task pass:
    harbor run ... --build-arg GITHUB_TOKEN=$GITHUB_TOKEN

This mirrors how bootstrap already handles private repos (host-side
clone with the resolved token, never embedded). Documented in
docs/reference/AUTH.md + docs/pipelines/pr_diff.md.

* pr_diff: reconcile docs/code with the spec contracts

A compliance pass against the repo's own contract docs (SPEC.md,
CLAUDE.md, the pipeline pages) surfaced two divergences introduced by
the pr_diff upgrade:

1. reward_kinds emitted "diff_similarity_multi_component", which is not
   in SPEC.md's reward-kind set and disagreed with the README table.
   The reward IS a diff_similarity reward — the 6 components are how
   it's scored, surfaced in /logs/verifier/reward.json, not a separate
   kind. Emit the documented "diff_similarity"; describe the
   multi-component scoring in prose. Updated pr_diff.md to match.

2. Multiple docs + docstrings still called pr_diff "text-only / no
   sandbox", which is now only true of *generation*. The emitted task
   runs in Docker by default. Reconciled the prose in SPEC.md,
   CLAUDE.md, the pr_diff.py module + class docstrings, and the
   harbor emitter docstring to distinguish sandbox-free generation
   from Docker-runnable consumption.

No behavior change — verifier scoring + emitted files are unchanged;
this aligns metadata + documentation with the spec.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement: output format Enhancement related to formatting of messages pkg-assert Change related to package testify/assert

Projects

None yet

Development

Successfully merging this pull request may close these issues.

assert.NotSubset incorrecly formats with %q

2 participants