AMeFu-Net

AMeFu-Net: Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition
ACM MM 2020 (oral)

Yuqian Fu
Fudan
Li Zhang
Oxford
Junke Wang
Fudan
Yanwei Fu
Fudan
Yu-Gang Jiang
Fudan

Abstract

Humans can easily recognize actions with only a few examples given, while the existing video recognition models still heavily rely on the large-scale labeled data inputs. This observation has motivated an increasing interest in few-shot video action recognition, which aims at learning new actions with only very few labeled samples. In this paper, we propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net. Concretely, we tackle the few-shot recognition problem from three aspects: firstly, we alleviate this extremely data-scarce problem by introducing depth information as a carrier of the scene, which will bring extra visual information to our model; secondly, we fuse the representation of original RGB clips with multiple non-strictly corresponding depth clips sampled by our temporal asynchronization augmentation mechanism, which synthesizes new instances at feature-level; thirdly, a novel Depth Guided Adaptive Instance Normalization (DGAdaIN) fusion module is proposed to fuse the two-stream modalities efficiently. Additionally, to better mimic the few-shot recognition process, our model is trained in the meta-learning way. Extensive experiments on several action recognition benchmarks demonstrate the effectiveness of our model.

Introdcution

Task: few-shot video action recognition.

Challenges: labeled examples are limited; fsl in video is more complex than fsl in image.

Motivation: explore how can multi modality especially the less explored depth helps in this problem.

AMeFu-Net Method

Main two insights:

multi-modality fusion: design a more advanced method to fuse the RGB and depth features rather than naively concating;
temporal augmentation: ultize the asynchronization between depth and RGB clips to augment videos temporally.

The multi-modality fusion results in our novel depth guided adaptive fusion (DGAdaIN) module, while the temporal augmentation results in our temporal asynchronization augmentation strategy.

Our whole model is illustrated as: for each RGB clip, we will:

sample another depth clip with our temporal asynchronization sampling;
gets the RGB frature map and depth feature map;
use the DGAdaIN module to fuse the two features;
train the FSL classifier by meta learning.

scales

Empirically, both the RGB feature extractor and the depth feature extractor are pretrained on its corresponding modality use the supervised classification tasks.

For the novel DGAdaIN, we are inspired from the AdaIN which is a flagship work in style transfer.

scales

Generally,

the AdaIN takes a content image I_a and a style image I_b as input, and transfers the style of I_a towards the I_b
our DGAdaIN takes the rgb feature I_rgb and the depth feature I_d as input, and transfers the "style" of I_rgb towards I_d but making the new "style" learnable.

For the temporal asynchronization augmentation, we sample not strictly matched RGB and depth pairs to augment the fused feature.

scales

Results

We compare with previous methods on Kinetics. Also, we perform experiments on UCF101 and HMDB51 (results can be found in paper).

scales

We also provide the visulization result. Results show with depth guidance, the model recognize the key content better.

scales

Citation

@inproceedings{fu2021meta,
    title={Meta-fdmixup: Cross-domain few-shot learning guided by labeled target data},
    author={Fu, Yuqian and Fu, Yanwei and Jiang, Yu-Gang},
    booktitle={Proceedings of the 29th ACM International Conference on Multimedia},
    pages={5326--5334},
    year={2021}
  }

@article{fu2022generalized,
    title={Generalized meta-fdmixup: Cross-domain few-shot learning guided by labeled target data},
    author={Fu, Yuqian and Fu, Yanwei and Chen, Jingjing and Jiang, Yu-Gang},
    journal={IEEE Transactions on Image Processing},
    volume={31},
    pages={7078--7090},
    year={2022},
    publisher={IEEE}
  }

AMeFu-Net: Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition
ACM MM 2020 (oral)

Paper

Video

Video

Code

Abstract

Introdcution

AMeFu-Net Method

Results

Related Links

Citation

AMeFu-Net: Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition ACM MM 2020 (oral)

Paper

Video

Video

Code

Abstract

Introdcution

AMeFu-Net Method

Results

Related Links

Citation

AMeFu-Net: Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition
ACM MM 2020 (oral)