Object-Centric Learning Methods in Robotics

Explore top LinkedIn content from expert professionals.

Summary

Object-centric learning methods in robotics help robots understand, interact with, and manipulate individual objects by focusing on their properties and relationships, rather than relying solely on the robot’s overall surroundings. This approach allows robots to learn behaviors through observing and engaging with objects, making their actions more adaptable and intuitive in real-world environments.

Emphasize active interaction: Encourage robots to explore and learn about objects by interacting with them directly, rather than depending on static information or predefined models.
Prioritize diverse demonstrations: Gather a wide range of real-world scenarios and examples to teach robots how to handle different objects and tasks, improving their adaptability.
Use modular learning strategies: Break down complex tasks into smaller steps focused on object attributes and actions, helping robots build up their skills in a more manageable and scalable way.

Summarized by AI based on LinkedIn member posts

Daily Papers

Machine Learning Engineer at Hugging Face

11,992 followers 1mo
Report this post
Humans can rearrange objects in cluttered spaces without GPS or 3D maps—we just look, push, and adjust. Teaching robots to do the same has remained a challenge. Researchers from NYU's AI4CE Lab are now sharing EgoPush, a framework that enables mobile robots to perform long-horizon, multi-object rearrangement using only a single egocentric camera. No global state estimation. No external tracking. Just pure visual perception. The core idea is elegantly human-inspired: instead of tracking absolute positions (which fail when objects move or hide), EgoPush learns relative spatial relationships between objects. Here's how it works: First, it designs an object-centric latent space that encodes relations rather than poses. A privileged RL teacher learns from sparse keypoints but with a crucial twist—the teacher's observations are restricted to visually accessible cues. This forces it to develop active perception behaviors that the student policy can actually recover from its partial viewpoint. Second, for long-horizon credit assignment, EgoPush decomposes rearrangement into stage-level subproblems using temporally decayed completion rewards. This keeps learning focused and efficient. The results? EgoPush significantly outperforms end-to-end RL baselines in simulation and demonstrates zero-shot sim-to-real transfer on real hardware. What makes this particularly engaging is the interactive Unity WebGL demo the team built. You can explore the framework directly in your browser, watching how the robot reasons about object relationships in real-time. Paper: https://lnkd.in/epCFCaP2 Project page: https://lnkd.in/eUSPCEzX (Uploading trained models and datasets to Hugging Face would help the robotics community build on this work more easily!)

1 Comment
Like Comment
Yiannis Aloimonos

Professor of Computational Vision and Intelligence at University of Maryland

6,435 followers 1y
Report this post
LLMs in the service of Active Perception Perception and action (e.g. planning and control) are fundamental to intelligence, yet they are often studied separately in robotics. Traditionally, perception transforms sensory signals into symbolic representations, while action relies on symbolic models to generate movement. This approach assumes perception provides a static 3D model for planning, but it does not reflect biological reality. In nature, perception and action co-evolve, forming an interwoven process. For example, when cutting a tomato, one must first perceive and locate the knife, then grasp the knife, then perceive the tomato, then bring the knife to the tomato, perform the cutting motion, and finally verify the cut. These interleaved sequences of perception and action can be captured in Perception-Action-Coordination (PAC) programs (consisting of both perceptual and motor functions), which serve as modular, compositional building blocks of intelligence, enabling counterfactual reasoning and control. By shifting from traditional planning to PAC programs, we can integrate active perception and visual reasoning, which Large Language Models (LLMs) can help structure. Unlike natural language, programs have a clear grammar, making them ideal for LLMs. A key application is in attribute learning, where robots learn about object properties (e.g., weight, size) by interacting with them rather than relying on static datasets, which do not scale. Vision-Language Models (VLMs) may align linguistic instructions with visual information but fail to grasp non-visual attributes like weight. To address this, the proposed framework combines LLMs, VLMs, and robotic control functions to generate PAC programs that actively explore object attributes. These programs invoke sensory, manipulation, and navigation functions, allowing robots to reason beyond visual perception and understand an object’s properties through interaction. This approach moves towards more intelligent robots that can see, think, and act to explore their environments. The work, titled “Discovering Object Attributes by Prompting Large Language Models with Perception-Action APIs” by @Angelos Mavrogiannis, @Dehao Yuan, and @Yiannis Aloimonos, will be presented at ICRA 2025 in Atlanta. Project Page: https://lnkd.in/eff8grGq arXiv: https://lnkd.in/e2dDEFz3 #AI #Robotics #LLM #VLM #ActivePerception

4 Comments
Like Comment
Supriya Rathi

110k+ | India #1. World #10 | Physical-AI | Podcast Host - SRX Robotics | Connecting founders, researchers, & markets | DM to post your research | DeepTech

112,779 followers 6mo
Report this post
Diverse in-flight object catching using a quadruped robot with a basket. The objective is to accurately predict the impact point, defined as the object's landing position. This task poses two key challenges: the absence of public datasets capturing diverse objects under unsteady aerodynamics, which are essential for training reliable predictors; and the difficulty of accurate early-stage impact point prediction when trajectories appear similar across objects. #research: https://lnkd.in/datda8aM #authors: Huy Nguyen Ngoc, Kazuki Shibata, Takamitsu Matsubara Nara Institute of Science and Technology (NAIST), Robot Learning Laboratory To overcome these issues, they construct a real-world dataset of 8,000 trajectories from 20 objects, providing a foundation for advancing in-flight object catching under complex aerodynamics. They then propose the Discriminative Impact Point Predictor (DIPP), consisting of two modules: (i) a Discriminative Feature Embedding (DFE) that separates trajectories by dynamics to enable early-stage discrimination and generalization, and (ii) an Impact Point Predictor (IPP) that estimates the impact point from these features. Two IPP variants are implemented: an Neural Acceleration Estimator (NAE)-based method that predicts trajectories and derives the impact point, and a Direct Point Estimator (DPE)-based method that directly outputs it. Experimental results show that this dataset is more diverse and complex than existing dataset, and that this method outperforms baselines on both 15 seen and 5 unseen objects. Furthermore, they show that improved early-stage prediction enhances catching success in simulation and demonstrate the effectiveness of this approach through real-world experiments.

5 Comments
Like Comment
Ahsen Khaliq

ML @ Hugging Face

36,021 followers 1y
Report this post
Robot See Robot Do Imitating Articulated Object Manipulation with Monocular 4D Reconstruction Humans can learn to manipulate new objects by simply watching others; providing robots with the ability to learn from such demonstrations would enable a natural interface specifying new behaviors. This work develops Robot See Robot Do (RSRD), a method for imitating articulated object manipulation from a single monocular RGB human demonstration given a single static multi-view object scan. We first propose 4D Differentiable Part Models (4D-DPM), a method for recovering 3D part motion from a monocular video with differentiable rendering. This analysis-by-synthesis approach uses part-centric feature fields in an iterative optimization which enables the use of geometric regularizers to recover 3D motions from only a single video. Given this 4D reconstruction, the robot replicates object trajectories by planning bimanual arm motions that induce the demonstrated object part motion. By representing demonstrations as part-centric trajectories, RSRD focuses on replicating the demonstration's intended behavior while considering the robot's own morphological limits, rather than attempting to reproduce the hand's motion. We evaluate 4D-DPM's 3D tracking accuracy on ground truth annotated 3D part trajectories and RSRD's physical execution performance on 9 objects across 10 trials each on a bimanual YuMi robot. Each phase of RSRD achieves an average of 87% success rate, for a total end-to-end success rate of 60% across 90 trials. Notably, this is accomplished using only feature fields distilled from large pretrained vision models -- without any task-specific training, fine-tuning, dataset collection, or annotation.

3 Comments
Like Comment
Aaron Prather

Director, Robotics & Autonomous Systems Program at ASTM International

84,858 followers 1y
Report this post
Robot models, especially those trained with lots of data, have recently shown impressive skills in handling objects and moving around in real-world environments. Some studies have shown that with enough training data from a specific environment, robots can adapt their actions to different situations within that space. However, these robots often need extra fine-tuning when introduced to new environments, unlike language or vision models that can work immediately in new settings without additional adjustments. This new research conducted by New York University, Hello Robot Inc, and Meta introduces Robot Utility Models (RUMs), a new way to train and deploy robots that can adapt to new environments without any extra training. To build RUMs, the team developed tools to quickly gather data for tasks involving object manipulation, like opening drawers or picking up items. They used this data to train robots using multi-modal imitation learning and tested the system on a basic robot model called Hello Robot Stretch. The robots achieved an average success rate of 90% in new environments with new objects. The robots were trained for specific tasks such as opening cabinets, picking up napkins, and repositioning fallen items. These utility models also worked well with different robots and camera setups without needing more data or adjustments. Key lessons learned include the importance of quality training data, the need for diverse demonstrations, and the value of strategies that allow robots to retry tasks to improve their performance. 📝 Research Paper: https://lnkd.in/dHN3CctB 📊 Project Page: https://lnkd.in/dq3KC5nU #robotics #reseach

2 Comments
Like Comment
Sina Pourghodrat (PhD)

Surgical Robotics Engineer

9,449 followers 6mo
Report this post
🚀Amazon FAR (Frontier AI & Robotics) introduce OmniRetarget: teaching humanoids to interact with objects and their environment, just like humans do 𝘏𝘦𝘳𝘦’𝘴 𝘵𝘩𝘦 𝘮𝘢𝘪𝘯 𝘪𝘥𝘦𝘢 (𝘴𝘪𝘮𝘱𝘭𝘪𝘧𝘪𝘦𝘥): In robotics, teaching humanoid complex skills means showing them how humans move and interact — but just copying human motions (or using them as kinematic references) don’t work cleanly. Human body vs. robot body: not the same shape, not the same joints, not the same kinematics. On top of that, interactions (touching objects, walking on surfaces) are often lost or distorted during retargeting (the process of adapting human motions to robot bodies). But OmniRetarget fixes these. What is OmniRetarget? A system that converts human motion + human scenes into robot-compatible motion while preserving interactions (contacts, spatial relations) with objects and terrain. Uses an interaction mesh to model where contacts happen (hand touching box, feet on ground) and keeps them consistent when mapping to a robot. From one demonstration (a recording of a human performing the task), it can generate many variations: different robots, object positions, terrains. Why it’s better than older approaches? Older methods often ignore interaction preservation, leading to artifacts like foot sliding or unrealistic motions. OmniRetarget enforces both robot limits (joints, geometry) and real interactions (which part touches what) at the same time. Produces 8+ hours of high-quality trajectories, beating baselines in realism and consistency. Trained reinforcement learning (RL) policies can now perform long, complex tasks (up to 30 seconds) on a physical humanoid (Unitree G1). 📖Open-source contribution They are releasing the OmniRetarget Dataset — over 8 hours of humanoid loco-manipulation and interaction data — freely available on Hugging Face: [https://lnkd.in/eYBn2hfe] Why this matters? Robots don’t just need to move, they must interact with the world. High-quality, interaction-aware data has been a major bottleneck. OmniRetarget makes this data available to the community, helping researchers and companies build humanoids that can operate in cluttered, object-rich environments. 📖 Full paper: https://lnkd.in/ej2But4W 👩💻GitHub: https://lnkd.in/ejmUahtr 👩🔬 Authors: Lujie Yang, Xiaoyu Huang, Zhen Wu, Angjoo Kanazawa, Pieter Abbeel, Carmelo Sferrazza, C. Karen Liu, Rocky Duan, Guanya Shi Thank you, Lujie Yang, for giving permission to use the video: Video: Unitree G1 humanoid carries a chair, climbs, leaps, and rolls, all in real time, using only its own body senses (no vision or LiDAR). A big step toward agile, human-like loco-manipulation.

1 Comment
Like Comment
Pascal Biese

AI Lead at PwC </> Daily AI highlights for 80k+ experts 📲🤗

84,840 followers 1y
Report this post
Symbolic AI meets Deep RL: A new framework for interpretable agents Deep reinforcement learning's (RL's) reliance on shortcut learning often prevents generalization to even slightly different environments. Symbolic methods using object-centric states have tried to address this, but comparing them to pixel-based deep agents isn't entirely fair. SCoBots is a new framework that breaks down RL tasks into intermediate steps, using interpretable object-centric concepts to make decisions. This helps us understand 𝘸𝘩𝘺 an agent takes certain actions. The key is combining the strengths of neural networks and symbolic AI: 1. Learning object-centric representations from raw pixel data 2. Applying object-centric RL 3. Distilling policies into explainable rules The first end-to-end SCoBot was put to the test on Atari games, with promising results for both performance and interpretability. While there's still work to be done, SCoBots are a step towards RL agents we can understand and trust, even in complex environments. ↓ Liked this post? Join my newsletter with 50k+ readers that breaks down all you need to know about the latest LLM research: llmwatch.com 💡

7 Comments
Like Comment
Akshet Patel 🤖

Robotics Engineer | Creator

52,968 followers 10mo
Report this post
1. Scan 2. Demo 3. Track 4. Render 5. Train models 6. Deploy What if robots could learn new tasks from just a smartphone scan and a single human demonstration, without needing physical robots or complex simulations? [⚡Join 2400+ Robotics enthusiasts - https://lnkd.in/dYxB9iCh] A paper by Justin Yu, Letian (Max) Fu, Huang Huang, Karim El-Refai, Rares Andrei Ambrus, Richard Cheng, Muhammad Zubair Irshad, and Ken Goldberg from the University of California, Berkeley and Toyota Research Institute Introduces a scalable approach for generating robot training data without dynamics simulation or robot hardware. "Real2Render2Real: Scaling Robot Data Without Dynamics Simulation or Robot Hardware" • Utilises a smartphone-captured object scan and a single human demonstration video as inputs • Reconstructs detailed 3D object geometry and tracks 6-DoF object motion using 3D Gaussian Splatting • Synthesises thousands of high-fidelity, robot-agnostic demonstrations through photorealistic rendering and inverse kinematics • Generates data compatible with vision-language-action models and imitation learning policies • Demonstrates that models trained on this data can match the performance of those trained on 150 human teleoperation demonstrations • Achieves a 27× increase in data generation throughput compared to traditional methods This approach enables scalable robot learning by decoupling data generation from physical robot constraints. It opens avenues for democratising robot training data collection, allowing broader participation using accessible tools. If robots can be trained effectively without physical hardware or simulations, how will this transform the future of robotics? Paper: https://lnkd.in/emjzKAyW Project Page: https://lnkd.in/evV6UkxF #RobotLearning #DataGeneration #ImitationLearning #RoboticsResearch #ICRA2025

13 Comments
Like Comment
Chris Paxton

AI + Robotics Research Scientist

8,795 followers 4mo
Report this post
Reasoning over long horizons would allow robots to generalize better to unseen environments and settings zero-shot. One mechanism for this kind of reasoning would be world models, but traditional video world models still tend to struggle with long horizons, and are very data intensive to train. But what if instead of predicting images about the future, we predicted just the symbolic information necessary for reasoning? Nishanth Kumar tells us about Pixels to Predicates, a method for symbol grounding which allows a VLM to plan sequences of robot skills to achieve unseen goals in previously unseen settings. To find out more, watch episode #44 of RoboPapers with Michael Cho and Chris Paxton now! Abstract: Our aim is to learn to solve long-horizon decision-making problems in complex robotics domains given low-level skills and a handful of short-horizon demonstrations containing sequences of images. To this end, we focus on learning abstract symbolic world models that facilitate zero-shot generalization to novel goals via planning. A critical component of such models is the set of symbolic predicates that define properties of and relationships between objects. In this work, we leverage pretrained vision language models (VLMs) to propose a large set of visual predicates potentially relevant for decision-making, and to evaluate those predicates directly from camera images. At training time, we pass the proposed predicates and demonstrations into an optimization-based model-learning algorithm to obtain an abstract symbolic world model that is defined in terms of a compact subset of the proposed predicates. At test time, given a novel goal in a novel setting, we use the VLM to construct a symbolic description of the current world state, and then use a search-based planning algorithm to find a sequence of low-level skills that achieves the goal. We demonstrate empirically across experiments in both simulation and the real world that our method can generalize aggressively, applying its learned world model to solve problems with a wide variety of object types, arrangements, numbers of objects, and visual backgrounds, as well as novel goals and much longer horizons than those seen at training time. Project Page: https://lnkd.in/e6CWZm8P ArXiV: https://lnkd.in/esnEW5DR Watch now on... Substack: https://lnkd.in/eph7Ew7Y Youtube: https://lnkd.in/eZkkYN5T

Ep#44: From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Mod

https://www.youtube.com/

1 Comment
Like Comment
Peter Farkas

Robotics - Automation - Assembly > | Business Development | Sales | Channel Management

7,879 followers 2y
Report this post
Manipulating objects without grasping them is an essential component of human dexterity, referred to as non-prehensile manipulation. Learning Hybrid Actor-Critic Maps for 6D Non-Prehensile Manipulation CoRL 2023 (Oral Presentation) Non-prehensile manipulation may enable more complex interactions with the objects, but also presents challenges in reasoning about robotic gripper-object interactions. In this work, we introduce Hybrid Actor-Critic Maps for Manipulation (HACMan), a reinforcement learning approach for 6D non-prehensile manipulation of objects using point cloud observations. HACMan proposes a temporally-abstracted and spatially-grounded object-centric action representation that consists of selecting a contact location from the object point cloud and a set of motion parameters describing how the robot will move after making contact. We modify an existing off-policy RL algorithm to learn in this hybrid discrete-continuous action representation. Github code: https://lnkd.in/gRj4Cp7Z Wenxuan Zhou - 1,2, Bowen Jiang - 1, Fan Yang - 1, Chris Paxton - 2*, David Held - 1* 1) Carnegie Mellon University 2) Meta AI *Equal Advising
No more previous content

No more next content
Like Comment

Object-Centric Learning Methods in Robotics

Summary

Ep#44: From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Mod

https://www.youtube.com/

More in Interactive Learning Methods

Explore categories