MOSE: Complex Video Object Segmentation Dataset

Abstract

Video object segmentation (VOS) aims to segment specified target objects throughout a video. Although state-of-the-art methods have achieved impressive performance (e.g., 90+% J&F ) on existing benchmarks such as DAVIS and YouTube-VOS, these datasets primarily contain salient, dominant, and isolated objects, limiting their generalization to real-world scenarios. To advance VOS toward more realistic environments, coMplex video Object SEgmentation (MOSEv1) was introduced to facilitate VOS research in complex scenes. Building on the strengths and limitations of MOSEv1, we present MOSEv2, a significantly more challenging dataset designed to further advance VOS methods under real-world conditions. MOSEv2 consists of 5,024 videos and over 701,976 high-quality masks for 10,074 objects across 200 categories. Compared to its predecessor, MOSEv2 introduces significantly greater scene complexity, including more frequent object disappearance and reappearance, severe occlusions and crowding, smaller objects, as well as a range of new challenges such as adverse weather (e.g., rain, snow, fog), low-light scenes (e.g., nighttime, underwater), multi-shot sequences, camouflaged objects, non-physical targets (e.g., shadows, reflections), scenarios requiring external knowledge, etc. We benchmark 20 representative VOS methods under 5 different settings and observe consistent performance drops. For example, SAM2 drops from 76.4% on MOSEv1 to only 50.9% on MOSEv2. We further evaluate 9 video object tracking methods and find similar declines, demonstrating that MOSEv2 presents challenges across tasks. These results highlight that despite high accuracy on existing datasets, current VOS methods still struggle under real-world complexities.

MOSEv2: A More Challenging Dataset

MOSEv1 Visualization

Dataset Statistics

TABLE 1. Statistical comparison between MOSEv2 and existing video object segmentation and tracking datasets.

“Annotations”: number of annotated masks or boxes. “Duration”: the total duration of annotated videos, in minutes by default unless noted. “Disapp. Rate”: the frequency of objects disappearing in at least one frame, while “Reapp. Rate”: the frequency of objects that previously disappeared and later reappear. “Distractors” quantifies scene crowding as the average number of visually similar objects per target in the first frame. *SA-V uses the combination of manual and auto annotations.

Experiments

We benchmark the state-of-the-art methods to the best of our knowledge, please see the Dataset Report for details. If your method is more powerful, please feel free to contract us for benchmark evaluation, we will update the results.

TABLE 2. Benchmark results of semi-supervised (one-shot) VOS on MOSEv2.

Downloads

The dataset is avalibale on Hugging Face, OneDrive, Google Drive, and Baidu WangPan, please kindly refer to MOSE-api for more details. For MOSEv2, please register on Codabench to access the download link.

🚀 Download the dataset using gdown command:
📦 train.tar.gz 20.5 GB
  gdown https://drive.google.com/uc\?id\=ID_removed_to_avoid_overaccesses_get_it_by_yourself
📦 valid.tar.gz 3.61 GB
  gdown https://drive.google.com/uc\?id\=ID_removed_to_avoid_overaccesses_get_it_by_yourself

Tips: gdown may be temporarily throttled by Google Drive due to excessive downloads, you may wait 24h or download from the Google Drive page with a google account. Please feel free to open an issue on MOSE-api.

MOSE Evaluation

MOSEv1 Online Evaluation

MOSEv2 Online Evaluation (🔥ready now!)

● Following DAVIS, we use Region Jaccard J, Boundary F measure F, and their mean J&F as the evaluation metrics.
● For MOSEv2, a modified Boundary F measure (Ḟ) is used, J&Ḟ_d and J&Ḟ_r are employed to evaluate the results on disapperance and reappearance clips, respectively.
● For the validation sets, the first-frame annotations are released to indicate the objects that are considered in evaluation.
● The validation set online evaluation server is [here for MOSEv1] / [here for MOSEv2] for daily evaluation.
● The test set online evaluation server will be open during the competition period only.
To ensure fair comparison among participants and avoid leaderboard overfitting through repeated trial-and-error, the test set is only available during official competition periods. Please note that for each competition, the released testing videos are randomly sampled from the test set, and will not remain the same across different competitions. This further ensures fairness and prevents overfitting to a fixed set.

BibTeX

Please consider to cite MOSE if it helps your research.

@article{MOSEv2,
  title={{MOSEv2}: A More Challenging Dataset for Video Object Segmentation in Complex Scenes},
  author={Ding, Henghui and Ying, Kaining and Liu, Chang and He, Shuting and Jiang, Xudong and Jiang, Yu-Gang and Torr, Philip HS and Bai, Song},
  journal={arXiv preprint arXiv:2508.05630},
  year={2025}
}

@inproceedings{MOSE,
  title={{MOSE}: A New Dataset for Video Object Segmentation in Complex Scenes},
  author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Torr, Philip HS and Bai, Song},
  booktitle={ICCV},
  year={2023}
}

License

MOSE is licensed under a CC BY-NC-SA 4.0 License. The data of MOSE is released for non-commercial research purpose only.