EgoDemoGen: Egocentric Demonstration Generation for Viewpoint Generalization in Robotic Manipulation

UCAS1, CASIA2, GigaAI3, Tsinghua University4, X-Humanoid5, FiveAges6
*Equal Contribution    Corresponding Author

Abstract

Imitation learning based visuomotor policies have achieved strong performance in robotic manipulation, yet they often remain sensitive to egocentric viewpoint shifts. Unlike third-person viewpoint changes that only move the camera, egocentric shifts simultaneously alter both the camera pose and the robot action coordinate frame, making it necessary to jointly transfer action trajectories and synthesize corresponding observations under novel egocentric viewpoints. To address this challenge, we present EgoDemoGen, a framework that generates paired observation–action demonstrations under novel egocentric viewpoints through two key components: 1) EgoTrajTransfer, which transfers robot trajectories to the novel egocentric coordinate frame through motion-skill segmentation, geometry-aware transformation, and inverse kinematics filtering; and 2) EgoViewTransfer, a conditional video generation model that fuses a novel-viewpoint reprojected scene video and a robot motion video rendered from the transferred trajectory to synthesize photorealistic observations, trained with a self-supervised double reprojection strategy without requiring multi-viewpoint data. Experiments in simulation and real-world settings show that EgoDemoGen consistently improves policy success rates under both standard and novel egocentric viewpoints, with absolute gains of +24.6% and +16.9% in simulation and +16.0% and +23.0% on the real robot. Moreover, EgoViewTransfer achieves superior video generation quality for novel egocentric observations.

Method

Overview

Overview of EgoDemoGen. Given source demonstrations from a standard egocentric viewpoint, we generate novel demonstrations through four steps: (1) sampling novel egocentric viewpoints via robot base motion \((\Delta x, \Delta y, \Delta \theta)\); (2) EgoTrajTransfer produces kinematically feasible action trajectories \(\tilde{Q}\) adapted to the novel egocentric coordinate frame, filtering out infeasible viewpoints via inverse kinematics; (3) EgoViewTransfer synthesizes photorealistic observation videos \(\tilde{V}\) from the novel egocentric viewpoint, depicting the transferred robot motion; (4) the generated demonstrations are combined with original data to train policies that generalize across egocentric viewpoints.

EgoTrajTransfer

EgoTrajTransfer pipeline. Top: source trajectory segmented into motion segments (free-space movement) and skill segments (contact-rich manipulation) by gripper states. Bottom: transferred trajectory using position scaling and orientation interpolation for motion segments, rigid transformation for skill segments.

EgoViewTransfer

EgoViewTransfer pipeline. We synthesize novel-viewpoint observations through three stages. First, scene video preparation: reproject the original video to the novel viewpoint, mask the robot region, and inpaint to obtain clean background. Second, robot motion rendering: render robot motion from the transferred trajectory using URDF and camera parameters. Third, conditional video generation: fuse both videos via DiT-based diffusion model with dual-video conditioning.

Results

1. Novel Egocentric Videos Generated by EgoViewTransfer (Simulation).

Each video is organized column-wise: the first column shows the original egocentric observation, the second column shows the ground-truth novel egocentric video, the third and fourth columns show the novel-view scene video and robot video inputs, and the fifth column presents the novel egocentric video synthesized by EgoViewTransfer.

Adjust Bottle
Beat
Handover
Lift
Open
Place
Put

2. Novel Egocentric Videos Generated by EgoViewTransfer (Real World).

Each video is organized column-wise: the first column shows the original egocentric observation, the second and third columns show the novel-view scene video and robot video inputs, and the fourth column presents the novel egocentric video synthesized by EgoViewTransfer.

Box Microwave
Close Microwave
Pick Bowl
Place Bowl on Basket
Place Bowl on Plate

3. Visualization of Policy Execution in the Simulation.

Each row shows policy execution for one task across three viewpoints: standard egocentric view and 2 random novel egocentric views.

Adjust Bottle

Standard View

Novel View 1

Novel View 2

Beat

Standard View

Novel View 1

Novel View 2

Handover

Standard View

Novel View 1

Novel View 2

Lift

Standard View

Novel View 1

Novel View 2

Open

Standard View

Novel View 1

Novel View 2

Place

Standard View

Novel View 1

Novel View 2

Put

Standard View

Novel View 1

Novel View 2


4. Visualization of Policy Execution in the Real World.

Each row shows policy execution for one task across three viewpoints: standard egocentric view, counterclockwise novel view, and clockwise novel view.

Pick Bowl

Standard View

Novel View 1

Novel View 2

Place Bowl on Basket

Standard View

Novel View 1

Novel View 2

Place Bowl on Plate

Standard View

Novel View 1

Novel View 2

BibTeX

If you use our work in your research, please cite:

@article{xu2025egodemogen,
  title={EgoDemoGen: Novel Egocentric Demonstration Generation Enables Viewpoint-Robust Manipulation},
  author={Xu, Yuan and Yang, Jiabing and Wang, Xiaofeng and Chen, Yixiang and Zhu, Zheng and Fang, Bowen and Huang, Guan and Chen, Xinze and Ye, Yun and Zhang, Qiang and Li, Peiyan and Wu, Xiangnan and Wang, Kai and Zhan, Bing and Lu, Shuo and Liu, Jing and Liu, Nianfeng and Huang, Yan and Wang, Liang},
  journal={arXiv preprint arXiv:2509.22578},
  year={2025}
}