Introduction

Video object segmentation (VOS) aims to segment specified target objects throughout a video. Although state-of-the-art methods have achieved impressive performance (e.g., 90+% J&F ) on existing benchmarks such as DAVIS and YouTube-VOS, these datasets primarily contain salient, dominant, and isolated objects, limiting their generalization to real-world scenarios. To advance VOS toward more realistic environments, coMplex video Object SEgmentation (MOSEv1) was introduced to facilitate VOS research in complex scenes. Building on the strengths and limitations of MOSEv1, we present MOSEv2, a significantly more challenging dataset designed to further advance VOS methods under real-world conditions.

Teaser image

Figure: Examples of video clips from the coMplex video Object SEgmentation (MOSEv2) dataset. The selected target objects are masked in orange ▇. The most notable features of MOSEv2 include both challenges inherited from MOSEv1 such as disappearance-reappearance of objects (①-⑩), small/inconspicuous objects (①,③,⑥), heavy occlusions, and crowded scenarios (①,②), as well as newly introduced complexities including adverse weather conditions (⑥), low-light environments (⑤-⑦), multi-shots (⑧), camouflaged objects (⑤), non-physical objects like shadows (④), and knowledge dependency (⑨,⑩). The goal of MOSEv2 dataset is to provide a platform that promotes the development of more comprehensive and robust video object segmentation algorithms.

Statistics

Categories including novel and rare objects

Videos of dense scenarios

Objects with complex movement pattern

High-quality Mask annotations

Table 1. Statistical comparison between MOSEv2 and existing video object segmentation and tracking datasets.


• “Annotations”: number of annotated masks or boxes.
• “Duration”: the total duration of annotated videos, in minutes by default unless noted.
• “Disapp. Rate”: the frequency of objects disappearing in at least one frame, while
• “Reapp. Rate”: the frequency of objects that previously disappeared and later reappear.
• “Distractors” quantifies scene crowding as the average number of visually similar objects per target in the first frame.
*SA-V uses the combination of manual and auto annotations.

Demo

  • All
  • MOSEv2
  • MOSEv1
Heavy occulsion
Frequent disapp. - reapp.
Complex movements
Small & crowded objects
Non-physical objects
Low-light environment
Camouflaged objects
Adverse weather
Multi-shot
Knowledge dependency
Long disapperance
Object deformation

Tasks & Evaluation

We benchmark 20 representative VOS methods under 5 different settings and observe consistent performance drops. For example, SAM2 drops from 76.4% on MOSEv1 to only 50.9% on MOSEv2. We further evaluate 9 video object tracking methods and find similar declines, demonstrating that MOSEv2 presents challenges across tasks. These results highlight that despite high accuracy on existing datasets, current VOS methods still struggle under real-world complexities.

Table 2. Benchmark results of semi-supervised (one-shot) VOS on MOSEv2.

Dataset

Dataset Download

The dataset is available for non-commercial research purpose only. Please follow the following links to download.

Evaluation

Please submit your results of Val set here:

We strongly encourage you to evaluate with the MOSEv2 dataset. MOSEv1 is for legacy support only and may be deprecated in the future.

Data

MOSEv2 is a comprehensive video object segmentation dataset designed to advance VOS methods under real-world conditions. It consists of 5,024 videos and over 701,976 high-quality masks for 10,074 objects across 200 categories.

  • Following DAVIS, we use Region Jaccard J, Boundary F measure F, and their mean J&F as the evaluation metrics.
  • For MOSEv2, a modified Boundary F measure () is used, J&Ḟd and J&Ḟr are employed to evaluate the results on disapperance and reappearance clips, respectively.
  • For the validation sets, the first-frame annotations are released to indicate the objects that are considered in evaluation.
  • The test set online evaluation server will be open during the competition period only.
    To ensure fair comparison among participants and avoid leaderboard overfitting through repeated trial-and-error, the test set is only available during official competition periods. Please note that for each competition, the released testing videos are randomly sampled from the test set, and will not remain the same across different competitions. This further ensures fairness and prevents overfitting to a fixed set.

Data Structure

 train/valid.tar.gz
│
├── Annotations
│ ├── video_name_1
│ │ ├── 00000.png
│ │ ├── 00001.png
│ │ └── ...
│ └── video_name_...
│   └── ...
│ 
└── JPEGImages
  ├── video_name_1
  │ ├── 00000.png
  │ ├── 00001.png
  │ └── ...
  └── video_name_...
    └── ...

People

Henghui Ding

Fudan University

Kaining Ying

Fudan University

Chang Liu

SUFE

Shuting He

SUFE

Yu-Gang Jiang

Fudan University

Philip H.S. Torr

University of Oxford

Song Bai

Bytedance

Citation

Please consider to cite MOSE if it helps your research.

@article{MOSEv2,
  title={{MOSEv2}: A More Challenging Dataset for Video Object Segmentation in Complex Scenes},
  author={Ding, Henghui and Ying, Kaining and Liu, Chang and He, Shuting and Jiang, Xudong and Jiang, Yu-Gang and Torr, Philip HS and Bai, Song},
  journal={arXiv preprint arXiv:2508.05630},
  year={2025}
}
@inproceedings{MOSE,
  title={{MOSE}: A New Dataset for Video Object Segmentation in Complex Scenes},
  author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Torr, Philip HS and Bai, Song},
  booktitle={ICCV},
  year={2023}
}

Our related works on video segmentation:

@article{MeViSv2,
  title={MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation},
  author={Ding, Henghui and Liu, Chang and He, Shuting and Ying, Kaining and Jiang, Xudong and Loy, Chen Change and Jiang, Yu-Gang},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2025},
  publisher={IEEE}
}
@inproceedings{MeViS,
  title={{MeViS}: A Large-scale Benchmark for Video Segmentation with Motion Expressions},
  author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Loy, Chen Change},
  booktitle={ICCV},
  year={2023}
}
@inproceedings{GRES,
  title={{GRES}: Generalized Referring Expression Segmentation},
  author={Liu, Chang and Ding, Henghui and Jiang, Xudong},
  booktitle={CVPR},
  year={2023}
}
@article{VLT,
  title={{VLT}: Vision-language transformer and query generation for referring segmentation},
  author={Ding, Henghui and Liu, Chang and Wang, Suchen and Jiang, Xudong},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2023},
  publisher={IEEE}
}