MOSE: Complex Video Object Segmentation Dataset

1Fudan University     2ByteDance     3SUFE     4Nanyang Technological University     5University of Oxford    

Figure 1. Examples of video clips from the coMplex video Object SEgmentation (MOSEv2) dataset. The selected target objects are masked in orange ▇. The most notable features of MOSEv2 include both challenges inherited from MOSEv1 such as disappearance-reappearance of objects (①-⑩), small/inconspicuous objects (①,③,⑥), heavy occlusions, and crowded scenarios (①,②), as well as newly introduced complexities including adverse weather conditions (⑥), low-light environments (⑤-⑦), multi-shots (⑧), camouflaged objects (⑤), non-physical objects like shadows (④), and knowledge dependency (⑨,⑩). The goal of MOSEv2 dataset is to provide a platform that promotes the development of more comprehensive and robust video object segmentation algorithms.


News


Abstract

Video object segmentation (VOS) aims to segment specified target objects throughout a video. Although state-of-the-art methods have achieved impressive performance (e.g., 90+% J&F ) on existing benchmarks such as DAVIS and YouTube-VOS, these datasets primarily contain salient, dominant, and isolated objects, limiting their generalization to real-world scenarios. To advance VOS toward more realistic environments, coMplex video Object SEgmentation (MOSEv1) was introduced to facilitate VOS research in complex scenes. Building on the strengths and limitations of MOSEv1, we present MOSEv2, a significantly more challenging dataset designed to further advance VOS methods under real-world conditions. MOSEv2 consists of 5,024 videos and over 701,976 high-quality masks for 10,074 objects across 200 categories. Compared to its predecessor, MOSEv2 introduces significantly greater scene complexity, including more frequent object disappearance and reappearance, severe occlusions and crowding, smaller objects, as well as a range of new challenges such as adverse weather (e.g., rain, snow, fog), low-light scenes (e.g., nighttime, underwater), multi-shot sequences, camouflaged objects, non-physical targets (e.g., shadows, reflections), scenarios requiring external knowledge, etc. We benchmark 20 representative VOS methods under 5 different settings and observe consistent performance drops. For example, SAM2 drops from 76.4% on MOSEv1 to only 50.9% on MOSEv2. We further evaluate 9 video object tracking methods and find similar declines, demonstrating that MOSEv2 presents challenges across tasks. These results highlight that despite high accuracy on existing datasets, current VOS methods still struggle under real-world complexities.

MOSEv2: A More Challenging Dataset

0442a954  d321dde4  02221fb0  bbe97d18  e5e9eb29  002b4dce  26ed56e6  c791ddbb  002b4dce  26ed56e6  c791ddbb  e5e9eb29 

MOSEv1 Visualization

0442a954  d321dde4  02221fb0  bbe97d18  002b4dce  26ed56e6  c791ddbb  e5e9eb29 

Dataset Statistics

TABLE 1. Statistical comparison between MOSEv2 and existing video object segmentation and tracking datasets.
“Annotations”: number of annotated masks or boxes. “Duration”: the total duration of annotated videos, in minutes by default unless noted. “Disapp. Rate”: the frequency of objects disappearing in at least one frame, while “Reapp. Rate”: the frequency of objects that previously disappeared and later reappear. “Distractors” quantifies scene crowding as the average number of visually similar objects per target in the first frame. *SA-V uses the combination of manual and auto annotations.

Experiments

We benchmark the state-of-the-art methods to the best of our knowledge, please see the Dataset Report for details. If your method is more powerful, please feel free to contract us for benchmark evaluation, we will update the results.

TABLE 2. Benchmark results of semi-supervised (one-shot) VOS on MOSEv2.

Downloads



The dataset is avalibale on Hugging Face, OneDrive, Google Drive, and Baidu WangPan, please kindly refer to MOSE-api for more details. For MOSEv2, please register on Codabench to access the download link.
🚀 Download the dataset using gdown command:
📦 train.tar.gz 20.5 GB
  gdown https://drive.google.com/uc\?id\=ID_removed_to_avoid_overaccesses_get_it_by_yourself
📦 valid.tar.gz 3.61 GB
  gdown https://drive.google.com/uc\?id\=ID_removed_to_avoid_overaccesses_get_it_by_yourself
Tips: gdown may be temporarily throttled by Google Drive due to excessive downloads, you may wait 24h or download from the Google Drive page with a google account. Please feel free to open an issue on MOSE-api.

MOSE Evaluation



  • ● Following DAVIS, we use Region Jaccard J, Boundary F measure F, and their mean J&F as the evaluation metrics.
    ● For MOSEv2, a modified Boundary F measure () is used, J&Ḟd and J&Ḟr are employed to evaluate the results on disapperance and reappearance clips, respectively.
    ● For the validation sets, the first-frame annotations are released to indicate the objects that are considered in evaluation.
    ● The validation set online evaluation server is [here for MOSEv1] / [here for MOSEv2] for daily evaluation.
    ● The test set online evaluation server will be open during the competition period only.
    To ensure fair comparison among participants and avoid leaderboard overfitting through repeated trial-and-error, the test set is only available during official competition periods. Please note that for each competition, the released testing videos are randomly sampled from the test set, and will not remain the same across different competitions. This further ensures fairness and prevents overfitting to a fixed set.

    BibTeX

    Please consider to cite MOSE if it helps your research.
    @article{MOSEv2,
      title={{MOSEv2}: A More Challenging Dataset for Video Object Segmentation in Complex Scenes},
      author={Ding, Henghui and Ying, Kaining and Liu, Chang and He, Shuting and Jiang, Xudong and Jiang, Yu-Gang and Torr, Philip HS and Bai, Song},
      journal={arXiv preprint arXiv:2508.05630},
      year={2025}
    }
    
    @inproceedings{MOSE,
      title={{MOSE}: A New Dataset for Video Object Segmentation in Complex Scenes},
      author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Torr, Philip HS and Bai, Song},
      booktitle={ICCV},
      year={2023}
    }
    

    License

    Creative Commons License
    MOSE is licensed under a CC BY-NC-SA 4.0 License. The data of MOSE is released for non-commercial research purpose only.