Rethinking Monocular 3D through Spherical Vision Monocular 3D estimation has long been limited by rigid assumptions about camera geometry. UniK3D redefines that boundary by viewing vision itself as a spherical function. Instead of predicting depth in flat pixel space, it models every possible camera through spherical harmonics, learning a continuous field of rays that can represent any projection from pinhole to panoramic. This removes the need for intrinsics or calibration and enables a single model to recover metric 3D geometry from any image. At its core, UniK3D is not just a model for depth but a way of expressing how cameras perceive space as smooth variations on a sphere. It shifts the problem from calibration to understanding.
Innovations in Depth Estimation Techniques
Explore top LinkedIn content from expert professionals.
Summary
Innovations in depth estimation techniques are transforming how machines perceive the 3D structure of the world using images or video. Depth estimation describes the process of determining the distance to objects in a scene, which is crucial for robotics, autonomous vehicles, augmented reality, and more.
- Explore new camera models: Consider emerging approaches that model camera vision with spherical functions or combine multiple optical systems to capture richer scene information.
- Utilize AI for transparency: Take advantage of modern video diffusion models and synthetic datasets to improve depth prediction for transparent objects like glass and water, which have traditionally been challenging for algorithms.
- Balance speed and quality: Weigh the trade-offs between faster video-based depth estimation methods and the consistency or accuracy of their results, and explore refinement steps if higher quality is needed.
-
-
Transparent objects are one of computer vision's longest-standing challenges. Refraction, reflection, and transmission break the assumptions behind stereo, time-of-flight, and even modern monocular depth systems. The result? Robots that hesitate and algorithms that hallucinate holes where glass should be. But what if the solution has been hiding in plain sight? Modern video diffusion models generate incredibly convincing transparent objects—glass, water, plastic—seamlessly. They've already learned the physics. We just need to ask the right question. This paper does exactly that. Introducing DKT: Diffusion Knows Transparency. Instead of building a depth estimator from scratch, the team starts with a video diffusion model and repurposes it using lightweight LoRA adapters. The model becomes a video-to-video translator: RGB in, depth and normals out. The secret sauce? Concatenating RGB and (noisy) depth latents in the DiT backbone, then co-training on a massive new synthetic dataset called TransPhy3D (11k sequences, 1.32M frames) plus existing data. This yields temporally consistent predictions for videos of any length, even in-the-wild. The numbers are compelling: - Zero-shot SOTA on ClearPose, DREDS, and TransPhy3D-Test - A 1.3B parameter model running at 0.17s/frame - Real robot grasping experiments show measurable gains The broader claim is fascinating: generative priors can be repurposed, efficiently and label-free, into robust perception. Everything is publicly available on Hugging Face—models, interactive demos, the full dataset, and code. This feels less like a one-off paper and more like a template for the future: using foundation models that understand physics to bootstrap specialized perception systems. Links: Paper: https://lnkd.in/er_Y8Y_g Demo: https://lnkd.in/eyXFrZvb Models: https://lnkd.in/eZHjqx_n Dataset: https://lnkd.in/eqh9xSMp
-
Today I gave the 3D Estimation lecture as part of my course on Image Processing and Computer Vision. The core of the lecture focused on Structure from Motion (SfM), a method that jointly estimates 3D scene structure and camera poses from a collection of 2D images. The process typically begins with feature detection, followed by feature matching across image pairs. After estimating relative poses using epipolar constraints and RANSAC, initial 3D points are triangulated. As more images are added, incremental SfM tracks new features and re-optimizes the entire scene via bundle adjustment, minimizing reprojection error over camera intrinsics, extrinsics, and 3D point locations. The result is a sparse 3D point cloud and a globally consistent set of camera poses. To connect the classical with the modern, I ended with an introduction to Gaussian Splatting, a recent technique for high-quality, real-time novel view synthesis. Gaussian Splatting uses a set of 3D Gaussians to represent the scene, where each Gaussian has a mean (3D position), covariance (defining its shape and orientation), opacity, and color. Given a novel camera viewpoint, the Gaussians are projected onto the image plane and rendered via a differentiable splatting process. The initial 3D positions of the Gaussians are typically obtained via SfM, after which the representation is optimized end-to-end using gradient-based methods to match rendered views with the training images. In the context of autonomous driving, Gaussian Splatting can be used to build high-fidelity reconstructions of urban environments from multi-sensor data collected by vehicles to generate novel views for simulation, map verification, or synthetic training data generation. Gaussian Splatting illustrates how classical pipelines continue to provide essential foundations for modern, learning-based approaches. #3destimation #novelviewsynthesis #computervision Images borrowed from Kerbl et al. and huggingface
-
Visualized with Rerun, I’ve integrated video-based depth estimation into my robot-training pipeline to make data collection as accessible as possible—without requiring specialized hardware. Traditionally, achieving accurate depth requires multiple calibrated, time‑synchronized cameras or expensive sensors like Intel RealSense, which are cost‑prohibitive and tricky to set up. With this new addition, I’m one step closer to supporting just a common phone or webcam. Originally, I ran VGGT over multiview video sequences, which delivers good consistency across views but suffers from: 1. **Low throughput**. On an RTX 5090, processing eight 640×480 @ 30 FPS streams with a runtime of 40 seconds takes about 15 minutes—so a 5 minute recording requires nearly 2 hours of compute. 2. **Occasional catastrophic failures**, where VGGT’s predictions collapse and consistency is lost. To speed things up, I evaluated Video Depth Anything (VDA), which trades off some accuracy for performance: 1. **5× faster**. VDA processes the same 5 minute video in ~20 minutes instead of 2 hours. 2. **Greater robustness**. I’ve seen no large-scale failures. The downside is noticeably poorer multiview consistency and temporal stability. To bridge the gap, I’m exploring a splatting refinement step—using DN‑Splatter to optimize the initial depth maps with a photometric-rendering loss and Pearson-depth loss, which should improve consistency. But then we're back to trading off runtime. I'm still not sure what the right solution to this is Overall, this experiment is promising but not yet a substitute for dedicated depth sensors. A more reliable interim solution might be using an iPhone or iPad with LiDAR—still more accessible than multiple RealSense units, though less universal than plain webcams. I’ll release this code alongside the rest of our Gradio annotation pipeline. From here, I'll be testing on some self-collected data instead of using the wonderful baseline HOCAP dataset!
-
🚨SIGGRAPH 2024 Paper Alert 🚨 ➡️Paper Title: Split-Aperture 2-in-1 Computational Cameras 🌟Few pointers from the paper 🎯While conventional cameras offer versatility for applications ranging from amateur photography to autonomous driving, computational cameras allow for domain-specific adaptation. Cameras with co-designed optics and image processing algorithms enable high-dynamic-range image recovery, depth estimation, and hyperspectral imaging through optically encoding scene information that is otherwise undetected by conventional cameras. 🎯 However, this optical encoding creates a challenging inverse reconstruction problem for conventional image recovery, and often lowers the overall photographic quality. Thus computational cameras with domain-specific optics have only been adopted in a few specialized applications where the captured information cannot be acquired in other ways. 🎯 In this work, authors have investigated a method that combines two optical systems into one to tackle this challenge. They splitted the aperture of a conventional camera into two halves: one which applies an application-specific modulation to the incident light via a diffractive optical element to produce a coded image capture, and one which applies no modulation to produce a conventional image capture. 🎯Co-designing the phase modulation of the split aperture with a dual-pixel sensor allowed them to simultaneously capture these coded and uncoded images without increasing physical or computational footprint. 🎯With an uncoded conventional image alongside the optically coded image in hand, they investigated image reconstruction methods that are conditioned on the conventional image, making it possible to eliminate artifacts and compute costs that existing methods struggle with. 🎯They assessed the proposed method with 2-in-1 cameras for optical high-dynamic-range reconstruction, monocular depth estimation, and hyperspectral imaging, comparing favorably to all tested methods in all applications. 🏢Organization: Princeton University, KAUST (King Abdullah University of Science and Technology), Saudi Arabia 🧙Paper Authors: Zheng Shi, Ilya Chugunov, Mario Bijelic, Geoffroi Côté, Jiwoon Yeom, Qiang Fu, Hadi AMATA , Wolfgang Heidrich, Felix Heide 1️⃣Read the Full Paper here: https://lnkd.in/gzDDpGgU 2️⃣Project Page: https://lnkd.in/gFkkqee2 3️⃣Code: https://lnkd.in/guw8AaBK 🎥 Be sure to watch the attached Demo Video -Sound on 🔊🔊 🎵 Music by Yevgeniy Sorokin from Pixabay Find this Valuable 💎 ? ♻️REPOST and teach your network something new Follow me 👣, Naveen Manwani, for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements. #SIGGRAPH2024
-
Last month, one team faced a nightmare scenario. Our photogrammetry reconstruction of a historic cathedral completely failed. 150 high-res images, perfect lighting conditions, but the complex Gothic arches and repetitive stone patterns confused every algorithm we tried. Traditional photogrammetry relies on feature matching between images. When surfaces are repetitive or lack texture, the matching fails catastrophically. We were looking at weeks of manual cleanup or starting over. 𝗧𝗵𝗲𝗻 𝗜 𝗿𝗲𝗺𝗲𝗺𝗯𝗲𝗿𝗲𝗱 𝗗𝗲𝗽𝘁𝗵𝗔𝗻𝘆𝘁𝗵𝗶𝗻𝗴 𝘃2. Instead of requiring multiple images with overlapping features, this AI model extracts depth information from single photographs. What took our photogrammetry pipeline 6 hours and failed, DepthAnything v2 processed in 5 minutes per image. 𝗧𝗵𝗲 𝗯𝗿𝗲𝗮𝗸𝘁𝗵𝗿𝗼𝘂𝗴𝗵: 𝗔𝗜 𝘂𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱𝘀 𝗱𝗲𝗽𝘁𝗵 𝗹𝗶𝗸𝗲 𝗵𝘂𝗺𝗮𝗻𝘀 𝗱𝗼. While traditional algorithms look for matching pixels, DepthAnything v2 recognizes depth cues—shadows, perspective, relative object sizes, atmospheric effects. It works on challenging surfaces that break conventional methods. Here's what's changing in 3D reconstruction: → 𝗦𝗶𝗻𝗴𝗹𝗲-𝗶𝗺𝗮𝗴𝗲 𝗽𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴 eliminates complex multi-view requirements → 𝗔𝗜 𝗱𝗲𝗽𝘁𝗵 𝗲𝘀𝘁𝗶𝗺𝗮𝘁𝗶𝗼𝗻 handles textureless and repetitive surfaces → 𝗜𝗻𝘀𝘁𝗮𝗻𝘁 𝗿𝗲𝘀𝘂𝗹𝘁𝘀 from what used to take hours or days → 𝗥𝗲𝘀𝗰𝘂𝗲 𝗰𝗮𝗽𝗮𝗯𝗶𝗹𝗶𝘁𝘆 for failed traditional reconstructions 𝗠𝘆 𝗽𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝗼𝗻: 𝗠𝗼𝗻𝗼𝗰𝘂𝗹𝗮𝗿 𝗱𝗲𝗽𝘁𝗵 𝗲𝘀𝘁𝗶𝗺𝗮𝘁𝗶𝗼𝗻 𝘄𝗶𝗹𝗹 𝗯𝗲𝗰𝗼𝗺𝗲 𝗮 𝗸𝗲𝘆 𝗺𝗲𝘁𝗵𝗼𝗱 𝗳𝗼𝗿 𝟯𝗗 𝗰𝗮𝗽𝘁𝘂𝗿𝗲 𝘄𝗶𝘁𝗵𝗶𝗻 𝟭𝟴 𝗺𝗼𝗻𝘁𝗵𝘀. The implications are massive. Architectural documentation, heritage preservation, e-commerce 3D models, AR content creation—all democratized through AI that understands spatial relationships like human vision. It's a fundamental shift from geometric matching to semantic understanding of 3D space. Complete tutorial with Python implementation here: 👉 https://lnkd.in/ekridph7 What 3D reconstruction challenges are you facing that traditional methods can't solve?
-
🌑🌑Navigation In Complete Darkness Using A Monocular Camera🌑🌑 Check out our latest work, "AsterNav: Autonomous Aerial Robot Navigation In Darkness Using Passive Computation," published in the IEEE Robotics and Automation Letters (IEEE RA-L) by Deepak Singh and Shreyas Khobragade, where we use nature-inspired custom apertures to obtain passive depth cues for accurate metric depth estimation using a cheap Raspberry Pi camera. No REAL data or Calibration NEEDED! #sim2real #computationalimaging Code is Open-Source! The paper videos are shot at max ISO using 8000$ filming setup from Nikon! 📄 PDF: https://lnkd.in/gACQnWBn 🌎 Project Website: https://lnkd.in/gaUxxta2 💻 Code: https://lnkd.in/gtXVzjic 📹 Full Video: https://lnkd.in/gSXH5VyD Autonomous aerial navigation in absolute darkness is crucial for post-disaster search and rescue operations, which often occur from disaster-zone power outages. Yet, due to resource constraints, tiny aerial robots, perfectly suited for these operations, are unable to navigate in the darkness to find survivors safely. In this paper, we present an autonomous aerial robot for navigation in the dark by combining an Infra-Red (IR) monocular camera with a large-aperture coded lens and structured light without external infrastructure like GPS or motion-capture. Our approach obtains depth-dependent defocus cues (each structured light point appears as a pattern that is depth dependent), which acts as a strong prior for our AsterNet deep depth estimation model. The model is trained in simulation by generating data using a simple optical model and transfers directly to the real world without any fine-tuning or retraining. AsterNet runs onboard the robot at 20 Hz on an NVIDIA Jetson Orin Nano. Furthermore, our network is robust to changes in the structured light pattern and relative placement of the pattern emitter and IR camera, leading to simplified and cost-effective construction. We successfully evaluate and demonstrate our proposed depth navigation approach AsterNav using depth from AsterNet in many real-world experiments using only onboard sensing and computation, including dark matte obstacles and thin ropes (diameter 6.25mm), achieving an overall success rate of 95.5% with unknown object shapes, locations and materials. To the best of our knowledge, this is the first work on monocular, structured-light-based quadrotor navigation in absolute darkness. P.S. Shreyas Khobragade is on the job market! Hire him :) Kudos to Deepak Singh, Shreyas Khobragade for the amazing work. #pearwpi Worcester Polytechnic Institute #dronesforgood #searchandrescue #SAR
AsterNav: Autonomous Aerial Robot Navigation In Darkness Using Passive Computation (IEEE RA-L 2026)
https://www.youtube.com/
-
I’m a huge fan of the work coming out of Princeton’s Computational Imaging Lab One of my favorites is “Shakes on a Plane” which estimates depth from a single camera burst (≈ 42 RAW frames) plus gyroscope data, without any pre-training or extra sensors They also learn an implicit image, and although it looks a bit like a NeRF variant, there’s no ray-marching here Instead, they enforce photometric consistency via simple reprojection with learned camera poses, which drives the depth into a plausible state. Depth is represented as a global plane plus learned per-pixel offsets rather than a full 3-D occupancy field, which is totally sensible in this context It’s admittedly slow right now (~15 min per burst) and relies on a couple of small MLPs, but that feels like an implementation choice, not a fundamental limitation
-
Big news for the 3D computer vision community! 🙌 ByteDance released Depth Anything 3 on Hugging Face 🔥. This is the world's most powerful model for 3D understanding: it predicts spatially consistent geometry (depth and ray maps) from an arbitrary number of visual inputs, with or without known camera poses. In other words, it allows you to reconstruct a 3D scene just from 2D inputs. DA3 extends monocular depth estimation to any-view scenarios, hence the model can take in single images, multi-view images, and video. Interestingly, the authors reveal two key insights: - A plain transformer (e.g., vanilla DINO) is enough. No specialized architecture is required. - A single depth-ray representation objective is enough. The model does not require a complex multi-task training. Three series of models have been released: the main DA3 series, a monocular metric estimation series, and a monocular depth estimation series. Metric estimation, also called absolute estimation, determines the distance in meters relative to the camera, whereas monocular depth estimation determines the distance relative among the pixels. The authors also released a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. DA3 sets a new state-of-the-art across all 10 tasks, surpassing prior SOTA, Meta's VGGT, by an average of 35.7% in camera pose accuracy and 23.6% in geometric accuracy. Furthermore, DA3 facilitates SLAM (Simultaneous Localization and Mapping) and 3D Gaussian Splatting by providing a robust and generalizable method for predicting spatially consistent geometry from various visual inputs. Links: - Models: https://lnkd.in/eFFHJhJx - Paper: https://lnkd.in/ewtxy7p6 - Demo: https://lnkd.in/e7Qr3tnG - Code: https://lnkd.in/e89B6JpR
-
Depth estimation from a single RGB image is the perfect deep learning problem because it is geometrically ill-posed. Solving it requires tapping into world priors and learned expectations to hallucinate the most probable reality, just as humans do. In the past couple of years, discriminative and generative models have battled it out to see who can do it best, treating the task respectively as regression in one case and as image-to-image translation in the other. While both are fast and capable of generalizing in the wild because the pretrained features are rich, diffusion-based approaches tend to generate better details at the cost of inferior global geometry. However, Pixel-Perfect Depth (NeurIPS 2025) showed that when relying on a latent diffusion process, there is simply no way to avoid flying points in the predictions, no matter how good the model is. In other words, despite the already existing superiority in the details, the predictive quality of generative depth models was always upper-bounded by the lossy compression of the VAE. Ouch! The solution? Bringing back diffusion in pixel space and introducing key technical updates in the Diffusion Transformer architecture. In particular, their Semantics-Prompted DiT also allows integrating discriminative features into the denoising process, achieving the best of both worlds. The result is a generative model with incredible pixel-aligned depth and global geometry even better than the leading discriminative backbones. Full breakdown of the architecture in the comments, enjoy!
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development