Single-Cell Genomics Techniques

Explore top LinkedIn content from expert professionals.

Summary

Single-cell genomics techniques allow researchers to study the genetic material of individual cells, giving a much clearer picture of cell diversity and function compared to traditional methods that analyze groups of cells together. These advanced tools reveal how each cell behaves and interacts within tissues, helping to uncover complexities in health, development, and disease.

Explore unique cell profiles: Use single-cell sequencing to reveal rare and important cell types that might be hidden in bulk analyses.
Map cell trajectories: Apply lineage and spatial mapping techniques to track how cells develop and change in real time, offering insights into disease progression and tissue organization.
Integrate multiple data types: Combine genomic and epigenomic data from single cells to create a richer, multi-layered understanding of cell identity and behavior in both healthy and diseased states.

Summarized by AI based on LinkedIn member posts

Rujuta Shinde

AI × Genomics × Scientific Thinking | Bioinformatician @The Lundquist Institute | Turning Data into Discovery | Exploring tradeoffs, assumptions, real-world data work | Sharing what I learn along the way

6,145 followers 1y Edited
Report this post
🧬 𝗜’𝘃𝗲 𝘄𝗼𝗿𝗸𝗲𝗱 𝘄𝗶𝘁𝗵 𝗯𝘂𝗹𝗸 𝗥𝗡𝗔-𝘀𝗲𝗾 𝗳𝗼𝗿 𝗮 𝘄𝗵𝗶𝗹𝗲 - 𝗯𝘂𝘁 𝗻𝗼𝘄 𝗜’𝗺 𝗱𝗶𝘃𝗶𝗻𝗴 𝗱𝗲𝗲𝗽 𝗶𝗻𝘁𝗼 𝘀𝗰𝗥𝗡𝗔-𝘀𝗲𝗾, 𝗮𝗻𝗱 𝘁𝗵𝗶𝗻𝗴𝘀 𝗮𝗿𝗲 𝘀𝘁𝗮𝗿𝘁𝗶𝗻𝗴 𝘁𝗼 𝗰𝗹𝗶𝗰𝗸. When I joined 𝗧𝗵𝗲 𝗟𝘂𝗻𝗱𝗾𝘂𝗶𝘀𝘁 𝗜𝗻𝘀𝘁𝗶𝘁𝘂𝘁𝗲 about three months ago, I began working on single-nucleus RNA-seq (snRNA-seq) data from Down Syndrome lung samples. Although I’d done some scRNA-seq analysis in the past, most of my experience had been in bulk RNA-seq. So I decided to go back to the basics and truly understand the single-cell world - from the ground up. At first, it felt overwhelming. All the clustering, QC steps, UMAPs felt like a maze of unfamiliar terms and tools. But over the last few weeks, I’ve been following tutorials and reading up - and slowly, it’s starting to make sense. 🥤 𝗧𝗵𝗲 𝗮𝗻𝗮𝗹𝗼𝗴𝘆 𝘁𝗵𝗮𝘁 𝗰𝗵𝗮𝗻𝗴𝗲𝗱 𝗲𝘃𝗲𝗿𝘆𝘁𝗵𝗶𝗻𝗴 𝗳𝗼𝗿 𝗺𝗲: Imagine making a smoothie with all your fruits - mango, banana, strawberry. That’s 𝗯𝘂𝗹𝗸 𝗥𝗡𝗔-𝘀𝗲𝗾 - you get the average flavor, but not what each fruit (cell) contributed. Now imagine tasting each fruit individually. That’s 𝘀𝗰𝗥𝗡𝗔-𝘀𝗲𝗾 - you get to see what each individual cell is expressing. 🍓🍌🥭 That clarity is what makes single-cell so powerful - especially in complex tissues and disease states. 🧭 𝗧𝗵𝗲 𝘀𝗶𝗺𝗽𝗹𝗶𝗳𝗶𝗲𝗱 𝘀𝗰𝗥𝗡𝗔-𝘀𝗲𝗾 𝘄𝗼𝗿𝗸𝗳𝗹𝗼𝘄 : 1. 𝗧𝗶𝘀𝘀𝘂𝗲 𝗱𝗶𝘀𝘀𝗼𝗰𝗶𝗮𝘁𝗶𝗼𝗻 → break tissue down into single cells 2. 𝗦𝗶𝗻𝗴𝗹𝗲-𝗰𝗲𝗹𝗹 𝗶𝘀𝗼𝗹𝗮𝘁𝗶𝗼𝗻 → using droplets, plates, or microwells 3. 𝗥𝗡𝗔 𝗰𝗮𝗽𝘁𝘂𝗿𝗲 & 𝗹𝗶𝗯𝗿𝗮𝗿𝘆 𝗽𝗿𝗲𝗽 → convert RNA into cDNA 4. 𝗦𝗲𝗾𝘂𝗲𝗻𝗰𝗶𝗻𝗴 → generate transcriptomic reads 5. 𝗤𝗖 & 𝗱𝗼𝘄𝗻𝘀𝘁𝗿𝗲𝗮𝗺 𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝘀 → filter, cluster, and interpret 🔬 𝗧𝗲𝗰𝗵𝗻𝗼𝗹𝗼𝗴𝗶𝗲𝘀 𝗜’𝘃𝗲 𝗯𝗲𝗲𝗻 𝗹𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗮𝗯𝗼𝘂𝘁: • 𝗦𝗺𝗮𝗿𝘁-𝘀𝗲𝗾 (𝗽𝗹𝗮𝘁𝗲-𝗯𝗮𝘀𝗲𝗱): Cells are sorted into individual wells using FACS. Great for full-length transcript capture but lower throughput. • 𝟭𝟬𝘅 𝗚𝗲𝗻𝗼𝗺𝗶𝗰𝘀 (𝗱𝗿𝗼𝗽𝗹𝗲𝘁-𝗯𝗮𝘀𝗲𝗱): Cells and barcoded beads are encapsulated in droplets - super popular and high-throughput. • 𝗦𝗣𝗟𝗶𝗧-𝘀𝗲𝗾 (𝗰𝗼𝗺𝗯𝗶𝗻𝗮𝘁𝗼𝗿𝗶𝗮𝗹 𝗶𝗻𝗱𝗲𝘅𝗶𝗻𝗴): Barcodes are added across rounds of pooling/splitting - no need for physical isolation. • 𝗕𝗗 𝗥𝗵𝗮𝗽𝘀𝗼𝗱𝘆 (𝗺𝗶𝗰𝗿𝗼𝘄𝗲𝗹𝗹-𝗯𝗮𝘀𝗲𝗱): Cells are captured in tiny wells with barcoded beads - scalable and relatively simple. Learning like this, piece by piece, and connecting it to real datasets has made everything feel more meaningful - and honestly, exciting. Not every step is smooth. But that’s what makes the process worth sharing. If you’re also diving into scRNA-seq, or have been through this learning curve, I’d love to hear what helped you most. :)
No more previous content

No more next content
5 Comments
Like Comment
Xinru Qiu, PhD

Computational Biologist | FM Evaluation & Perturbation Biology | Target Discovery from Single-Cell & Spatial Data | Causal Inference | Immunology & Inflammation

2,736 followers 6mo
Report this post
🧮 Optimal Transport & Wasserstein Distance in Single-Cell Biology: A Mathematical Framework I've just published a comprehensive guide covering the mathematical foundations and computational applications of optimal transport theory for single-cell data analysis. 📚 Key Method Categories: 📈 Trajectory Inference: Waddington-OT 🔗 Multi-Omics Alignment: SCOT, Pamona 🧬 Perturbation Prediction: CellOT 🌊 Stochastic Modeling: GENOT, PRESCIENT 🗺️ Atlas-Scale Integration: moscot 📈 Evolution Timeline: 2019: Waddington-OT (trajectory inference paradigm) 2020-2021: SCOT, PRESCIENT (multi-omics + generative approaches) 2022: Pamona, OT-scOmics (partial alignment + similarity metrics) 2023: CellOT, TIGON (neural OT, separating growth from transport) 2024: GENOT, GRouNdGAN (conditional flows, causal GAN) 2025: Labeled GWOT, moscot (label constraints, atlas-scale mapping) 📊 Resources Included: Comparison table: 11 methods from trajectory inference to 1.7M-scale integration Mathematical foundations: Monge-Kantorovich theory, entropic regularization, Gromov-Wasserstein Software ecosystem guide (POT, moscot, Waddington-OT, SCOT, Pamona, CellOT) Common pitfalls: data normalization, hyperparameter tuning (ε: 0.001-0.1), ground cost selection, validation strategies This is part of my AI4Bio Learning Hub (https://lnkd.in/gS3ivaR5) where I share technical deep dives as a Computational Immunologist working at the intersection of single-cell genomics, AI, and therapeutic development. 📖 Full guide: https://lnkd.in/gMnaEZGA 💬 Spot an error? Have suggestions? Working on OT methods? I'd love your feedback to keep this resource accurate and comprehensive!

∫ Optimal Transport & Wasserstein Distance in Single-Cell Biology xqiu625.github.io

3 Comments
Like Comment
Jack (Jie) Huang MD, PhD

Chief Scientist I Founder and CEO I President at AASE I Vice President at ABDA I Visit Professor I Editors

36,253 followers 1y
Report this post
🟥 Lineage Mapping via Single-Cell ATAC-Seq and Methylome A deeper understanding of how stem cells differentiate into different cell types requires tools that can capture the molecular mechanisms that guide these transitions. Because traditional lineage tracing methods rely primarily on genetic markers, their understanding of the underlying molecular mechanisms is limited. Combining single-cell ATAC sequencing and single-cell DNA methylation analysis, two powerful epigenomic technologies, can provide complementary perspectives on chromatin accessibility and DNA methylation at single-cell resolution. Single-cell ATAC sequencing primarily reveals open chromatin regions and can identify active enhancers, promoters, and transcription factor binding sites that drive lineage-specific gene expression. At the same time, because DNA methylation patterns are inherited during cell division and are often lineage-specific, single-cell methylome analysis can also provide a stable record of cell identity and lineage history. Combining these technologies can provide a multi-layered approach to track cell fate trajectories and map differentiation hierarchies. This integrated strategy has been shown to be particularly effective in dynamic systems such as hematopoiesis, where multipotent progenitor cells differentiate into a range of blood and immune cells. By analyzing thousands of single cells at different developmental stages, researchers can reconstruct branching lineage trees and pinpoint key regulatory events associated with fate decisions. In addition, this dual-omics approach can reveal intermediate cell states and rare cell populations that transcriptomics cannot reveal. In addition to developmental biology, this method is currently being applied to disease areas such as leukemia and solid tumors, where abnormal lineage selection leads to pathological changes. In addition, it also provides a valuable benchmark for evaluating stem cell differentiation in regenerative medicine and cell therapy development. In summary, the combination of single-cell ATAC-seq and methylome profiling provides a powerful toolkit for analyzing epigenetic control of lineage specification, allowing scientists to gain a deeper understanding of cell identity, memory, and fate transitions in health and disease states. References [1] Leif Ludwig et al., Cell 2019 (DOI: 10.1016/j.cell.2019.01.022) [2] Hanqing Liu et al., Nature 2023 (https://lnkd.in/e77cmT3s) #SingleCellEpigenomics #ATACseq #Methylome #LineageTracing #CellFateMapping #StemCellBiology #DevelopmentalBiology #Hematopoiesis #EpigeneticMemory #ChromatinAccessibility #RegenerativeMedicine #SingleCellOmics #PrecisionBiology #Epigenetics #CancerResearch #CSTEAMBiotech
No more previous content

No more next content
1 Comment
Like Comment
Fabian Theis

Director @Helmholtz Munich’s Computational Health Department & Professor at TU Munich

13,633 followers 6mo
Report this post
🧬 Our new paper “Nicheformer: a foundation model for single-cell and spatial omics” is out now in Nature Methods! 👉 Paper https://lnkd.in/dnGb5sPF This work, led by Alejandro Tejada and Anna Schaar, introduces Nicheformer, a transformer-based foundation model that connects single-cell and spatial transcriptomics to better understand how cells are organized within tissues. Many thanks to everyone in the lab and to our collaborators who generously shared spatial datasets early on! 🔍 What we did - We assembled SpatialCorpus-110M, a harmonized resource of about 57 million dissociated single cells and 53 million spatially resolved cells across 73 tissues from human and mouse. - Nicheformer is pre-trained on single-cell RNA-seq data together with in silico dissociated spatial samples, learning broad gene-expression representations across tissues, species, and technologies. - In a fine-tuning stage, the model is then trained on genuine spatial data (e.g., from brain, lung, or liver) to learn spatial context explicitly. We therefore recommend using the spatially fine-tuned versions when applying Nicheformer to specific tissues. - The model enables niche label or tissue region prediction, and even cell density prediction from dissociated data, for instance revealing differential spatial signatures in lung cancer. 🧩 Why this matters Single-cell omics has revolutionized molecular biology, but once cells are dissociated, the spatial relationships that define tissue structure are lost. Yet, traces of spatial organization remain in gene expression - something previously leveraged by approaches such as CellPhoneDB, NovoSpaRc, or NCEM. Nicheformer builds on this insight by embedding such traces in a foundation-model framework, allowing spatial structure to be re-inferred or transferred across datasets. 📈 Key results - Nicheformer outperforms earlier foundation models (Geneformer, scGPT, UCE) and non-transformer embeddings such as autoencoders (scVI) or linear methods (PCA) on spatial and region-label prediction. - Even linear probing of frozen Nicheformer embeddings captures stronger spatial signal than prior models. - This also shows that spatial context prediction is sufficiently complex, in contrast to potentially simpler cell line genetic perturbation modeling, where simpler often linear models often remain competitive. - The model can transfer spatial information onto dissociated scRNA-seq datasets, enabling contextualized analysis where spatial data are unavailable. - Attention maps show interpretable layer-wise structure, from broad gene attention in early layers to context-specific focus later on, reflecting biologically relevant organization. 🔭 What’s next By encoding spatial architecture during pretraining, we aim to further connect single-cell and spatial data to inform more integrative, multi-scale representations of tissue organization and disease.
No more previous content

No more next content
14 Comments
Like Comment
Francesca Finotello

Associate Professor, Leading the group of Computational Biomedicine at the University of Innsbruck, Austria

1,357 followers 6mo
Report this post
🚀 Preprint alert omnideconv: a unifying framework for using and benchmarking single-cell-informed deconvolution of bulk RNA-seq “Second-generation” deconvolution methods estimate cell-type composition from bulk RNA-seq by learning cell-type-specific signatures directly from scRNA-seq. This flexibility is powerful, but it also makes validation and application-specific optimization tricky, because signatures are derived "on the fly" from the input single-cell data. To facilitate the usage, assessment, and optimization of these second-generation methods, we built omnideconv: an ecosystem of tools and resources to unlock deconvolution for any cell type, tissue, or organism. It includes: - omnideconv (R package): uniform access to multiple second-gen methods - deconvExplorer (web app): interactively inspect signatures & results - deconvData: curated validation datasets - SimBu (R package): simulate pseudo-bulk RNA-seq with controlled mixtures & realistic mRNA content - deconvBench (Nextflow): comprehensive, reproducible benchmarking Using this ecosystem, we assembled >5000 real and simulated RNA-seq samples with matched ground truth to benchmark methods across different scenarios and address key open challenges, e.g., how biological and technical biases affect performance. 📄 Expanded study: https://lnkd.in/dvi3c_7m 🌐 Project: https://omnideconv.org/ I hope omnideconv resources and study can be of help to a wide community! 🚀 Huge thanks to Lorenzo Merotto and Alexander Dietrich for leading this effort, and to Markus List for this fantastic collaboration! 💫
No more previous content

No more next content
18 Comments
Like Comment
Lei Guo

Computational Biologist at UT Southwestern Medical Center

8,224 followers 3w
Report this post
New tutorial on NGS101.com 🚀 Part 13 of my single-cell RNA-seq series is live: RNA Velocity Analysis with scVelo https://lnkd.in/gkTSa5qt Every analysis we've done so far — clustering, cell type annotation, trajectory, CellChat — captured a static snapshot of gene expression. RNA velocity goes one step further and tells you where each cell is headed. The secret? Unspliced pre-mRNA. When a gene is actively being turned on, unspliced RNA accumulates faster than it can be spliced. When a gene is shutting down, the opposite happens. By comparing spliced and unspliced counts for thousands of genes simultaneously, scVelo assigns each cell a velocity vector — a directional arrow showing its predicted future transcriptional state. In Part 13 you'll learn how to: → Generate spliced and unspliced count matrices with STARsolo → Transfer cell type labels and UMAP coordinates from your Seurat object into Python → Understand AnnData — Python's answer to the Seurat object → Fit scVelo's dynamical model to recover transcription, splicing, and degradation rates per gene → Visualize RNA velocity as directional stream plots on your UMAP → Compute latent time — a root-free, kinetics-based pseudotime → Identify genes with significantly different transcriptional kinetics between conditions This is also the first Python tutorial in the series. If you've only worked in R, don't worry — every step is explained from scratch and the biological context stays exactly the same. --- 🧪 Want to learn RNA-seq analysis hands-on? Cohort 1 of my Live RNA-seq Workshop for Absolute Beginners is now closed — but you can join the waitlist for the next cohort: https://lnkd.in/gXGEtBdH Spots fill fast. Get on the list early. #SingleCellRNAseq #RNAVelocity #scVelo #Bioinformatics #ComputationalBiology #NGS101 #scRNAseq #Python
No more previous content

No more next content
12 Comments
Like Comment
ZHAOHUI (MARVIN) MAN

Bioinformatician | Data Scientist | Computational Biologist | Clinical Informatics | Open to H-1B Cap-Exempt & Global Opportunities (U.S. Preferred)

5,901 followers 3mo
Report this post
🧬 𝗡𝗲𝘄 𝗕𝗿𝗲𝗮𝗸𝘁𝗵𝗿𝗼𝘂𝗴𝗵 𝗶𝗻 𝗦𝗶𝗻𝗴𝗹𝗲-𝗖𝗲𝗹𝗹 𝗥𝗡𝗔-𝘀𝗲𝗾 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀! Excited to share a powerful new tool for the single-cell community: 𝘀𝗰𝗔𝗨𝗥𝗔, a graph-based contrastive learning framework that's transforming how we cluster single-cell transcriptomic data. 🎯 𝗪𝗵𝘆 𝘁𝗵𝗶𝘀 𝗺𝗮𝘁𝘁𝗲𝗿𝘀: Single-cell RNA-seq clustering remains challenging due to high dimensionality, sparsity, and technical noise. scAURA tackles these issues head-on with an innovative approach combining: ✅ Adaptive k-nearest neighbor graph construction ✅ Debiased contrastive learning with alignment & uniformity ✅ Self-supervised clustering optimization ✅ Robust handling of dropout noise 📊 𝗜𝗺𝗽𝗿𝗲𝘀𝘀𝗶𝘃𝗲 𝗿𝗲𝘀𝘂𝗹𝘁𝘀: Tested across 18 benchmark datasets spanning 6 sequencing platforms, scAURA: • Outperformed 13 SOTA methods in 9 datasets (ARI) and 8 datasets (NMI) • Achieved best average ranking: 2.28 (ARI) and 2.39 (NMI) • Maintained stable performance even under 50% simulated dropout • Showed 31.4% improvement in challenging datasets like Camp-Brain 🔬 𝗥𝗲𝗮𝗹-𝘄𝗼𝗿𝗹𝗱 𝗮𝗽𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻: In an Alzheimer's disease dataset, scAURA successfully: • Identified distinct cell types (microglia, neurons, oligodendrocytes, etc.) • Discovered novel cell type-specific marker genes • Inferred potential transcriptional regulators (ZBTB43, ZNF711, ELK1, etc.) 🔗 𝗖𝗵𝗲𝗰𝗸 𝗶𝘁 𝗼𝘂𝘁: Code & data: https://lnkd.in/egYQBMr2 Paper: https://lnkd.in/eiV7z-nz Huge kudos to Jubair Ibn Malik Rifat, Sarthak Engala, and Serdar Bozdag from the University of North Texas for this excellent work! #SingleCell #scRNAseq #Bioinformatics #MachineLearning #ComputationalBiology #DataScience #GraphNeuralNetworks #CellClustering #AlzheimersResearch
No more previous content

No more next content
10 Comments
Like Comment
Dino Di Carlo

14,888 followers 3mo
Report this post
🎶 Much like people, cells are defined by how they influence others. The symphony of biology arises from a series of synchronized cellular duets. So much of biology (and especially immunology) emerges from brief, context-dependent encounters: a dendritic cell priming a T cell, a cytotoxic T cell confronting a tumor cell, a stem cell listening to its niche, or a neuron negotiating with microglia. Yet single-cell RNA-seq mostly gives us solo recordings, and while spatial profiling shows who’s on stage together, proximity and co-expression don’t reliably reveal the duet itself: who changed whom, in what direction, and how quickly. Today we’re sharing our Cell-Cell-seq preprint: a workflow to profile defined interacting cell pairs (“dyads”) at single-cell transcriptomic resolution, using Nanovials to confine two cells, synchronize contact onset, protect fragile conjugates, and plug directly into droplet-based scRNA-seq. https://lnkd.in/gaVZKG-H In a prostate tumor–T cell model, we capture thousands of dyads and see broad functional and transcriptional heterogeneity, including transient activation programs such as immediate early genes (IEGs), that are often blurred in standard bulk co-culture. To make dyad transcriptomes interpretable (they’re inherently “mixed”), we introduce accessible analysis pieces, including pseudo-mixing (an empirical null distribution assembled from non-interacting transcriptomes assembled in silico to form pseudo-dyads) and ccRepair (to correct compositional dilution while preserving true cross-cell coordination). Launching the Billion Cell×Cell Project: I see this preprint as an early score for a much larger effort: building a causal atlas of cell–cell communication. This ambitious project will map out not just which cells are present or adjacent, but what actually changes when two cells meet (and how variable those changes are across many encounters). One duet at a time, you can start to reconstruct the symphony. A deliberate goal here is lowering the barrier to entry: • Nanovials are available through Partillion Bioscience • The experimental workflow interfaces with standard droplet scRNA-seq pipelines (no bespoke instrument required) • The computational approaches are generalizable and easily applied. If your science depends on cell–cell encounters, from immune synapses to stem-cell niches, wound repair, or neuron–glia crosstalk, what duets should we score next? And beyond RNA, what matter most: proteins, chromatin, secreted signals, perturbations, or functional outcomes like killing, migration, or differentiation? Huge thanks to our co-authors: SEVANA BAGHDASARIAN Qingyang Wang Justin Langerman , Zhiyuan Mao, Heather Wright Caitlin Gee , Donghui Cheng, Jami McLaughlin, John K. Lee, Xiaojing Chen, K. Christopher Garcia , Jingyi Jessica Li Owen N. Witte Kathrin Plath #Immunology #CancerResearch #ComputationalBiology #CellTherapy #OpenScience #SystemsBiology
No more previous content

No more next content
10 Comments
Like Comment
Olivier Elemento

Director, Englander Institute for Precision Medicine & Associate Director, Institute for Computational Biomedicine

10,493 followers 5mo
Report this post
🔬 Why We Should Stop Throwing Away the "Clumps" in Single-Cell Analysis To study tumors at single-cell resolution, scientists routinely dissociate tissue into individual cells using enzymes. Any clumps that survive get filtered out as "doublets"—assumed to be artifacts—before flow cytometry or sequencing. A new Nature paper from Daniel Peeper's lab shows this filtering discards the most valuable cells. CD8+ T cells physically clustered with tumor cells or APCs are 9-fold more likely to be tumor-reactive than singlets. These aren't random aggregates—they're cells caught mid-synapse, actively engaging their targets. The clustered T cells are enriched for CD39+PD-1+, the molecular signature of tumor reactivity. They maintain a TCF7+ stem-like phenotype—exactly the durable responders you want for adoptive cell therapy. And remarkably, they survive brief enzymatic digestion. I think this reframes how we should approach TIL isolation for therapy. The authors show that simply relaxing doublet gates and expanding clusters separately yields a population with dramatically higher tumor-killing capacity. It also makes a strong case for spatial methods. Technologies like Xenium, MERSCOPE, and CosMx now measure 20,000+ genes at single-cell resolution, plus protein panels—all while preserving tissue architecture and cell-cell contacts. No dissociation, no lost clusters, no guessing which cells were neighbors. I think we're approaching an inflection point where dissociation-based workflows become the exception rather than the rule. 📄 https://lnkd.in/e9BAWynu Sofia Ibañez Molero, Johanna Veldman, Daniel Peeper — The Netherlands Cancer Institute
No more previous content

No more next content
6 Comments
Like Comment
🎯 Ming "Tommy" Tang

Director of Bioinformatics | Cure Diseases with Data | Author of From Cell Line to Command Line | AI x bioinformatics | >130K followers, >30M impressions annually across social platforms| Educator YouTube @chatomics

66,276 followers 11mo
Report this post
Everyone talks about KNN like it’s a simple algorithm. But in practice, especially in single-cell RNA-seq—it’s art, not science. 1/ KNN (k-nearest neighbors) is simple: Classify a point based on the majority of its k closest neighbors. Only one hyperparameter: k. So why is it tricky? 2/ Because choosing k isn’t obvious. Small k = low bias, high variance Large k = low variance, high bias Classic tradeoff. No free lunch. 3/ In machine learning, we try to find a sweet spot where we don't overfit or underfit. But in real data—especially biological data—this balance is delicate. 4/ Take single-cell RNA-seq analysis: We build a kNN graph to model cell similarity. But this isn’t raw data—we build it on PCA-reduced space. Now we have two knobs: how many PCs what value of k 5/ Let’s talk PCs first. Too few PCs → miss structure Too many PCs → include noise How many should you include? Default in Seurat is 50. But that's arbitrary. For simple 3k PBMC, 15 PCs may suffice. I have worked with really complicated neuron single cell datasets using even 100 PCs. 6/ Some use elbow plots. Others use jackstraw or permutation tests. Here’s a post I wrote on that: https://lnkd.in/ejMqVxmT No method is perfect. Some of it is still feeling. 7/ Okay—back to k. In Seurat/Scanpy, default is k=20. But should you use that? It depends: How many total cells? How rare are the cell types? 8/ If you have a rare cell type with only 50 cells in your dataset... Using k=20 may be too aggressive. You’re pulling in neighbors from other cell types. That creates noise. 9/ But if your dataset has 100k cells, using k=5 may over-cluster. Your clusters may fragment 10/ So what’s the answer? There is no one-size-fits-all. This is where art comes in. You try k=10, k=30, k=50. Look at the plots. Compare DE genes. Validate with known markers. 11/ Machine learning in bioinformatics isn’t just grid search + accuracy metrics. You need biological sanity. Ask: Are rare populations preserved? Do known markers align? Does k affect downstream conclusions? 12/ Also, interpret carefully. KNN isn’t magic. It gives you proximity, not truth. It builds a scaffold. You have to decorate it with biological understanding. 13/ Key takeaways: KNN is simple in theory, nuanced in practice PCA + KNN = double dose of parameter tuning (I did not even talk about the resolution parameter...) Always visualize, validate, and question defaults 14/ In science, we want everything to be rigorous. But some steps require judgment, not just stats. KNN in single-cell is one of them. Know the math. Trust your eyes. Ask good questions. That’s where the art lives. I hope you've found this post helpful. Follow me for more. Subscribe to my FREE newsletter chatomics to learn bioinformatics https://lnkd.in/erw83Svn
No more previous content

No more next content
13 Comments
Like Comment

Single-Cell Genomics Techniques

Summary

More in Bioinformatics for Drug Discovery

Explore categories