ObjectRelator: Enabling Cross-View Object Relation Understanding Across Ego-Centric and Exo-Centric Perspectives

ICCV 2025 (Highlight)


Yuqian Fu1,
Runze Wang2,
Bin Ren1,3,4,
Guolei Sun5,
Biao Gong6,

Yanwei Fu2,
Danda Pani Paudel1,
Xuanjing Huang2,
Luc Van Gool1

1INSAIT, 2Fudan University, 3University of Trento, 4University of Pisa, 5ETH Zurich, 6Ant Group


We tackle the task of Ego-Exo Object Correspondence which is recently proposed in Ego-Exo4D. Given object queries from one perspective (e.g., ego view), the task involves predicting the corresponding object masks in another perspective (e.g., exo view). Solving this task unlocks new possibilities in VR and Robotics, e.g., enabling virtual agents or robots to manipulate ego-view actions by learning from exo-view demonstrations.

Video Demos on Ego-Exo4D

Breif Summary & Main Contributions

Despite the importance of this task, most existing segmentation models (e.g., Mask2Former, SAM, LISA) operate on single-view inputs, making them nontrival for it. To address this, we:

  • Toward Ego-Exo Object Correspondence Task: We conduct an early exploration of this challenging task, analyzing its unique difficulties, constructing several baselines, and proposing a new method.
  • ObjectRelator Framework: We introduce ObjectRelator, a cross-view object segmentation method combining MCFuse and XObjAlign. MCFuse for the first time introduces the text modality into this task and improves localization using multimodal cues for the same object(s), while XObjAlign boosts performance under appearance variations with an object-level consistency constraint.
  • New Testbed and SOTA Results: Alongside Ego-Exo4D, we present HANDAL-X as an additional benchmark. ObjectRelator achieves SOTA results on both datasets.

Framework Overview

Ego2Exo is used as an example in the frameork. Our method builds on the PSALM baseline (pink blocks) and tailors it for Ego-Exo Object Correspondence with two novel modules: Multimodal Condition Fusion (MCFuse) and Cross-View Object Alignment (XObjAlign). More details please refer to our paper.

Framework Figure

Main Results on Ego-Exo4D

We highlight that: 1) Results are reported on Val set due to the lack of GT of testing set. 2) We construct a "Small TrainSet"(1/3 data) and "Full TrainSet". Splits are released for the community which are especially friendly to the GPU/Storage limited groups. 3) Our Method clearly outperforms baselines and competitors.

Framework Figure

Visulization Results

Visulization results show that: 1) Our MCFuse enhances object localization ability by using text as an extra prompt; 2) Our XObjAlign improves model's performance upon huge view shift.

Framework Figure

More: We also adapt HANDAL-X, a benchmark featuring robot-friendly objects, as an additional testbed for cross-view object segmentation. For detailed results and more visualizations, please refer to our paper.

Video Demos on Our Adapted HANDAL-X and Human2Robot

Citations

Please consider cite us if you find our tackled data, code, or model is useful to you.

Also feel free to ask questions or if you are interested in working on this topic together, thanks! :)

  @article{fu2024objectrelator,
      title={Objectrelator: Enabling cross-view object relation understanding in ego-centric and exo-centric videos},
      author={Fu, Yuqian and Wang, Runze and Bin, Ren and Guolei, Sun and Gong, Biao and Fu, Yanwei and Paudel, Danda Pani and Huang, Xuanjing and Van Gool, Luc},
      journal={ICCV2025},
      year={2025}
    }

  @article{fu2025cross,
      title={Cross-View Multi-Modal Segmentation@ Ego-Exo4D Challenges 2025},
      author={Fu, Yuqian and Wang, Runze and Fu, Yanwei and Paudel, Danda Pani and Van Gool, Luc},
      journal={arXiv preprint arXiv:2506.05856},
      year={2025}
    }