Imitation learning based policies perform well in robotic manipulation, but they often degrade under egocentric viewpoint shifts when trained from a single egocentric viewpoint. To address this issue, we present EgoDemoGen, a framework that generates paired novel egocentric demonstrations by retargeting actions in the novel egocentric frame and synthesizing the corresponding egocentric observation videos with proposed generative video repair model EgoViewTransfer, which is conditioned by a novel-viewpoint reprojected scene video and a robot-only video rendered from the retargeted joint actions. EgoViewTransfer is finetuned from a pretrained video generation model using self-supervised double reprojection strategy. We evaluate EgoDemoGen on both simulation (RoboTwin2.0) and real-world robot. After training with a mixture of EgoDemoGen-generated novel egocentric demonstrations and original standard egocentric demonstrations, policy success rate improves absolutely by +17.0% for standard egocentric viewpoint and by +17.7% for novel egocentric viewpoints in simulation. On real-world robot, the absolute improvements are +18.3% and +25.8%. Moreover, performance continues to improve as the proportion of EgoDemoGen-generated demonstrations increases, with diminishing returns. These results demonstrate that EgoDemoGen provides a practical route to egocentric viewpoint-robust robotic manipulation.
Framework of EgoDemoGen.
Overview of EgoDemoGen. (1) Egocentric View Transform: a Novel Egocentric View is specified by robot base motion \((\Delta x,\ \Delta y,\ \Delta \theta)\). (2) Action Retargeting: the original joint actions \(Q\) is retargeted into the novel robot base frame to yield a kinematically feasible joint actions \(\tilde{Q}\). (3) Novel Egocentric Observations: starting from the original observation video \(V\), we mask the robot, reproject the scene to the novel viewpoint, perform hole filling, and apply EgoViewTransfer to synthesize the coherent observations \(\tilde{V}\). (4) Novel Demonstrations & Policy Training: we obtain aligned pairs \((\tilde{V},\ \tilde{Q})\) for training egocentric viewpoint-robust policies.
Architecture of EgoViewTransfer.
EgoViewTransfer. (a) Double reprojection. It simulates artifacts and occlusions caused by viewpoint change. The double reprojected video are aligned with the original video to form input/label pairs for training. (b) Architecture of EgoViewTransfer. The model takes a degraded scene video and a robot video as conditions and generates egocentric observation videos consistent with dual inputs.
Visualization of policy execution in the Simulation. The green boxes denote the standard egocentric view, the red boxes denote the random novel egocentric view.
Visualization of policy execution in the Real World. The green boxes denote the standard egocentric view, the red boxes denote the counterclockwise egocentric view, and the blue boxes denote the clockwise egocentric view.
Visualization of EgoViewTransfer in Simulation. The green boxes denote the GT video, the red boxes denote the Video w/ EgoViewTransfer, and the blue boxes denote the Video w/o EgoViewTransfer (Naive Composition).
Visualization of EgoViewTransfer in Real World. The red boxes denote the Video w/ EgoViewTransfer, the blue boxes denote the Video w/o EgoViewTransfer (Naive Composition).
If you use our work in your research, please cite:
@article{xu2025egodemogen,
title={EgoDemoGen: Novel Egocentric Demonstration Generation Enables Viewpoint-Robust Manipulation},
author={Xu, Yuan and Yang, Jiabing and Wang, Xiaofeng and Chen, Yixiang and Zhu, Zheng and Fang, Bowen and Huang, Guan and Chen, Xinze and Ye, Yun and Zhang, Qiang and Li, Peiyan and Wu, Xiangnan and Wang, Kai and Zhan, Bing and Lu, Shuo and Liu, Jing and Liu, Nianfeng and Huang, Yan and Wang, Liang},
journal={arXiv preprint arXiv:2509.22578},
year={2025}
}