Mitigating background bias in self-supervised video representation learning

Research output: Contribution to journalArticlepeer-review

Abstract

This paper addresses the problem of self-supervised video representation learning focused on motion features, aiming to capture features from foreground motion with reduced reliance on background bias. Recent successful methods often employ instance discrimination approaches, which entail heavy computation and may lead to inefficient and exhaustive pretraining. To this end, we utilize the augmentation technique MAC: Mask-Augmentation teChnique. MAC blends foreground motion using frame-difference-based masks and sets up a pretext task to recognize the applied transformation. By incorporating a game of predicting the correct blending multiplier at the pretraining stage, our model is compelled to encode motion-based features, which are then successfully transferred to downstream tasks such as action recognition. Moreover, we expand our approach within a joint contrastive framework and integrate additional tasks in the spatial and temporal domains to further enhance representation capabilities. Experimental results demonstrate that our method achieves superior performance on the UCF-101, HMDB51 and Diving-48 datasets under low-resource settings and competitive results with instance discrimination methods under costly computation settings.
Original languageEnglish
Article number55
Number of pages9
JournalSignal, Image and Video Processing
Volume19
Issue number1
DOIs
Publication statusPublished - 4 Dec 2024

Keywords

  • Contrastive learning
  • Motion-guidance
  • Self-supervised learning
  • Video action recognition
  • Video representation learning

Fingerprint

Dive into the research topics of 'Mitigating background bias in self-supervised video representation learning'. Together they form a unique fingerprint.

Cite this