Egocentric Demonstration Generation for Viewpoint Generalization in Robotic Manipulation

Abstract

Imitation learning based visuomotor policies have achieved strong performance in robotic manipulation, yet they often remain sensitive to egocentric viewpoint shifts. Unlike third-person viewpoint changes that only move the camera, egocentric shifts simultaneously alter both the camera pose and the robot action coordinate frame, making it necessary to jointly transfer action trajectories and synthesize corresponding observations under novel egocentric viewpoints. To address this challenge, we present EgoDemoGen, a framework that generates paired observation–action demonstrations under novel egocentric viewpoints through two key components: 1) EgoTrajTransfer, which transfers robot trajectories to the novel egocentric coordinate frame through motion-skill segmentation, geometry-aware transformation, and inverse kinematics filtering; and 2) EgoViewTransfer, a conditional video generation model that fuses a novel-viewpoint reprojected scene video and a robot motion video rendered from the transferred trajectory to synthesize photorealistic observations, trained with a self-supervised double reprojection strategy without requiring multi-viewpoint data. Experiments in simulation and real-world settings show that EgoDemoGen consistently improves policy success rates under both standard and novel egocentric viewpoints, with absolute gains of +24.6% and +16.9% in simulation and +16.0% and +23.0% on the real robot. Moreover, EgoViewTransfer achieves superior video generation quality for novel egocentric observations.

Method

Overview

Overview of EgoDemoGen. Given source demonstrations from a standard egocentric viewpoint, we generate novel demonstrations through four steps: (1) sampling novel egocentric viewpoints via robot base motion \((\Delta x, \Delta y, \Delta \theta)\); (2) EgoTrajTransfer produces kinematically feasible action trajectories \(\tilde{Q}\) adapted to the novel egocentric coordinate frame, filtering out infeasible viewpoints via inverse kinematics; (3) EgoViewTransfer synthesizes photorealistic observation videos \(\tilde{V}\) from the novel egocentric viewpoint, depicting the transferred robot motion; (4) the generated demonstrations are combined with original data to train policies that generalize across egocentric viewpoints.

EgoTrajTransfer

EgoTrajTransfer pipeline. Top: source trajectory segmented into motion segments (free-space movement) and skill segments (contact-rich manipulation) by gripper states. Bottom: transferred trajectory using position scaling and orientation interpolation for motion segments, rigid transformation for skill segments.

EgoViewTransfer

EgoViewTransfer pipeline. We synthesize novel-viewpoint observations through three stages. First, scene video preparation: reproject the original video to the novel viewpoint, mask the robot region, and inpaint to obtain clean background. Second, robot motion rendering: render robot motion from the transferred trajectory using URDF and camera parameters. Third, conditional video generation: fuse both videos via DiT-based diffusion model with dual-video conditioning.

Results

1. Novel Egocentric Videos Generated by EgoViewTransfer (Simulation).

Each video is organized column-wise: the first column shows the original egocentric observation, the second column shows the ground-truth novel egocentric video, the third and fourth columns show the novel-view scene video and robot video inputs, and the fifth column presents the novel egocentric video synthesized by EgoViewTransfer.

Adjust Bottle

Beat

Handover

Lift

Open

Place

Put

2. Novel Egocentric Videos Generated by EgoViewTransfer (Real World).

Each video is organized column-wise: the first column shows the original egocentric observation, the second and third columns show the novel-view scene video and robot video inputs, and the fourth column presents the novel egocentric video synthesized by EgoViewTransfer.

Box Microwave

Close Microwave

Pick Bowl

Place Bowl on Basket

Place Bowl on Plate

3. Visualization of Policy Execution in the Simulation.

Each row shows policy execution for one task across three viewpoints: standard egocentric view and 2 random novel egocentric views.