Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up

All HF Hub posts

HannesVonEssen 
posted an update 3 days ago
view post
Post
4694
📣 Add architecture visualization to model card!

🌟 For all creators out there: add a model visualization to your model card to capture your audience's attention!

🖱️ When clicked, it opens an interactive view with multiple levels of granularity!

1️⃣ Paste url at https://hfviewer.com/model-card-embed
2️⃣ Paste generated code in your README.md!
3️⃣ ✨
juiceb0xc0de 
posted an update 3 days ago
view post
Post
1360
Introducing the Gemma-4-E2B Brain Atlas, an interactive neural census of every layer, every head, 16 behavior categories in Google's flagship 2B model. We ran 184,320 probe prompts across 35 layers × 8 components and mapped what came back.

The Brain Atlas is an interactive tool that lets you explore the internal behavior of Google's Gemma-4-E2B model layer by layer, head by head. Pick a behavior category, pick a layer, and see exactly which components light up and which go quiet. The dataset is fully queryable if you want to go deeper.

The mapping combines multiple single-direction techniques run in parallel across every layer and component. Activation taxonomy (classifying each neuron by how broadly it fires across prompt categories), coactivation pair analysis (which neurons lock together and on what topics), F-stat behavioral separation (one-way ANOVA per feature across 16 behavior categories), per-head specificity scoring, and a full compliance probe pipeline using SVD, sparse decomposition, and variance analysis.

Here's what I found when I ran it.

The sharpest behavioral signal isn't at the output. It's Layer 0. Up projection hits F=22.7, nearly 2x anything in the final third of the network. The model does its behavioral sorting before it's barely started, then spends the next 34 layers… doing what exactly?

The gate has a lifecycle. 70% dormant at L1, highest in the model. Brutal sparsification at L23–26 (>58% silent). Then reopens. The final five layers are the most alive gates anywhere. The model's last act is a gate flare.
Layer 4 routes 5 projections to dim 448. One layer. One dimension. That's a topology highway.

Zero specialist neurons. Not one. 1.2M neurons analyzed. None fires exclusively on a single category. This model distributes everything.

🧠 Space: juiceb0xc0de/gemma-4-e2b-brain-atlas
📊 Dataset (1.3M rows, fully queryable): juiceb0xc0de/gemma-4-e2b-atlas
RiverRider 
posted an update 1 day ago
view post
Post
2606
Natural Language Autoencoders: A Window into Latent Structure

I introduced a concise mathematical formulation of the P versus NP question into the SRT-NLA-AV-v1 demonstration:

P vs NP asks whether every problem whose solution can be verified in polynomial time (NP) can also be solved in polynomial time (P). Integer factorization — given N = p·q where p and q are large primes (p < q) — is in NP but widely believed not to be in P.

The resulting activation verbalization (best-of-N, reranked by AR fidelity) surfaced:

“This article originally appeared in the August 2016 edition of CACM. A new method of proving computational hardness of problems, known as multilinearization, can improve efficiency, reduce complexity and simplify proofs. In this article, I describe multilinearization and its application to several key problems, from the discrete logarithm and factoring to RSA and elliptic-curve discrete logarithms.”

What emerges is not a literal restatement, but a structured articulation of the model’s internal associations: hardness proofs, algebraic techniques, and the cryptographic implications that orbit this foundational question in computational complexity.

The demo offers a compelling interface for exploring these latent representations.

Explore it here:
RiverRider/srt-nla-av-v1-demo

Recommended: Best-of-N sampling with round-trip evaluation for highest fidelity.
espejelomar 
posted an update 1 day ago
view post
Post
4169
Sharing WorldForge with @abdelstark

It's an open-source Python project for evaluating and replaying robotics and world-model workflows.

The useful part is not only calling a model. WorldForge records the run, validates action shapes, translates outputs into actions, and keeps replay artifacts you can inspect later.

The current demo uses LeRobot + LeWorldModel on PushT through the official loader:

stable_worldmodel.policy.AutoCostModel("pusht/lewm")

The harness also has replay-only paths for Cosmos-Policy and GR00T-style outputs, so you can inspect the provider contract from saved artifacts without keeping a GPU server online.

Try it:

pip install worldforge-ai
uv run --extra harness worldforge-harness --flow robotics-compare

Repo: https://github.com/AbdelStark/worldforge
Docs: https://abdelstark.github.io/worldforge/

Pre-1.0, MIT, and actively looking for contributors. Good areas:
- robotics provider adapters
- replay artifacts
- eval flows
- docs & first-run demos

Good first issues: https://github.com/AbdelStark/worldforge/contribute

If you're building robot policy evals or model adapters, would love a PR — or an issue describing what's missing.
Reubencf 
posted an update about 19 hours ago
view post
Post
1593
I have improved my Portfolio please do check it out
Reubencf/Portfolio
  • 2 replies
·
alvarobartt 
posted an update 2 days ago
view post
Post
2586
Latest hf-mem release added a breakdown of Mixture-of-Experts (MoE) memory usage!

TL; DR MoEs can be misleading to reason about from active parameters alone, since each token only activates a subset of experts, while the serving setup still needs to account for the full resident memory footprint.

🧠 hf-mem now splits MoE memory into base model weights, routed experts, and KV cache
🏗️ Dense models usually load and use most weights every forward pass, while MoEs load many experts but only route each token to a few of them
⚡ Active params isn't the same as memory footprint, especially for sparse architectures
📦 Runtime memory is about what is used per request/token, while loading memory also includes the expert weights that need to be resident
📚 KV cache can still dominate depending on context length, batch size, and concurrency
🔀 Expert Parallelism (EP) helps shard experts across accelerators when expert weights dominate
🚀 Data Parallelism (DP) + EP is often a good fit for throughput-oriented MoE serving

Check the repository at https://github.com/alvarobartt/hf-mem
fffiloni 
posted an update 2 days ago
view post
Post
2716
I built HF Radio on Hugging Face Spaces 📻
fffiloni/HF-Radio

A live community radio for AI-generated songs, powered by tracks created with ACE-Step.

You can tune in, discover community-made songs in many languages, vote on what sounds good, and mark your real favorites as Bangers.

The more people listen, vote, and create, the better the station gets.

Under the hood, it connects a few Hugging Face pieces together:

Spaces for the live app, HF buckets for community tracks, OAuth for signed-in listeners, server-side streaming with ffmpeg, hourly playlist refreshes, moderation, jingles, and community feedback loops.

It’s not just a playlist.

It’s a shared taste experiment:
new songs get a shot every hour, and the community helps decide what deserves another spin.

Come listen.
Find weird gems.
Support the Bangers.
Shape the radio.

—> fffiloni/HF-Radio
erikkaum 
posted an update 2 days ago
view post
Post
2521
Releasing my first kernel 🔥 MaxSim

Late-interaction retrieval (ColBERT / PyLate) bottlenecks on materializing the full similarity matrix. This kernel avoids it by using tiled scoring with simdgroup_matrix (Metal) and WMMA.

The result is 3–5× speedup compared to naive PyTorch baseline 🔥

Benchmarks:
- SmallRerank (B=32, C=10): up to 3.2× (M3 Pro) / 2.8× (A100)
- HeavyRerank (B=32, C=100): up to 3.8× (M3 Pro) / 5.3× (A100)
- LongDocStress (Ld=1024): up to 6.2× (L4)

Try it out 👇
https://huggingface.co/kernels/erikkaum/maxsim
salma-remyx 
posted an update 3 days ago
view post
Post
11417
The space of possible improvements for your AI model is large while evaluation is costly.

So I was excited to discover the ICML 2026 paper from Kobalczyk, Lin, Letham, Zhao, Balandat, and Bakshy titled "LILO: Bayesian Optimization with Natural Language Feedback."

The method learns efficiently from expert preferences, balancing exploration and exploitation in a principled way with Bayesian Optimization for expensive-to-evaluate black-box objectives.

Experimenting with the technique, I trained a Gaussian Process proxy model on the implicit preferences in my code repo's commit history at VQASynth.

The result: I used the model's preference scores to re-rank candidate papers recommended based on my interests in spatial reasoning and multimodal data synthesis.

Semantic relevance is a high-recall method for finding arXiv papers personalized to your interests. Adding contributor preferences, extracted from the merge history of your code offers a high-precision filter.

So what's next? I'm using the model to synthesize a larger volume of preference data to finetune an open-weight coding model with DPO and LoRA. Tuning Coding Agents via Implicit Preference Distillation

arXiv: https://arxiv.org/pdf/2510.17671
Substack: https://remyxai.substack.com/p/lilo-and-myx
VQASynth: https://github.com/remyxai/VQASynth
  • 1 reply
·
kanaria007 
posted an update 1 day ago
view post
Post
129
✅ Article highlight: Honest Benchmarking for Governed Intelligence Platforms (art-60-241, v0.1)

TL;DR:
This article argues that benchmark results should be published as bounded observations, not inflated into platform claims.

A governed benchmark should not quietly turn “we measured this result under these conditions” into “therefore this platform is more governed, safer, or more production-ready.” Honest benchmarking separates reproducibility, comparability, and disclosability—and keeps benchmark outcomes distinct from stronger governance or platform-readiness claims.

Read:
kanaria007/agi-structural-intelligence-protocols

Why it matters:
• prevents benchmark scores from being laundered into governance-readiness claims
• distinguishes reproducible results from truly comparable rankings
• makes public benchmark language respect disclosure floors and evidence class
• gives a clean way to publish strong numbers without overclaiming what they mean

What’s inside:
• the separation between reproducibility, comparability, and disclosability
• the rule that a benchmark result is not the same thing as a platform claim
• a benchmark disclosure profile that sets the publication floor
• a governed benchmark pack that binds runtime, toolchain, policy surface, evidence class, and results
• a comparability declaration and benchmark publication report that state what public reading is actually supportable

Key idea:
Do not say:

“we ranked higher, therefore we are better governed.”

Say:

“this governed benchmark pack produced these results under this disclosed runtime, toolchain, policy, and evidence surface; this comparability declaration defines what we are and are not fairly comparable to; and this publication report states exactly what public reading is supportable without inflating benchmark observations into stronger platform claims.”