AMeFu-Net: Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition
ACM MM 2020 (oral)



Humans can easily recognize actions with only a few examples given, while the existing video recognition models still heavily rely on the large-scale labeled data inputs. This observation has motivated an increasing interest in few-shot video action recognition, which aims at learning new actions with only very few labeled samples. In this paper, we propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net. Concretely, we tackle the few-shot recognition problem from three aspects: firstly, we alleviate this extremely data-scarce problem by introducing depth information as a carrier of the scene, which will bring extra visual information to our model; secondly, we fuse the representation of original RGB clips with multiple non-strictly corresponding depth clips sampled by our temporal asynchronization augmentation mechanism, which synthesizes new instances at feature-level; thirdly, a novel Depth Guided Adaptive Instance Normalization (DGAdaIN) fusion module is proposed to fuse the two-stream modalities efficiently. Additionally, to better mimic the few-shot recognition process, our model is trained in the meta-learning way. Extensive experiments on several action recognition benchmarks demonstrate the effectiveness of our model.


Task: few-shot video action recognition.

Challenges: labeled examples are limited; fsl in video is more complex than fsl in image.

Motivation: explore how can multi modality especially the less explored depth helps in this problem.

AMeFu-Net Method

Main two insights:

  1. multi-modality fusion: design a more advanced method to fuse the RGB and depth features rather than naively concating;
  2. temporal augmentation: ultize the asynchronization between depth and RGB clips to augment videos temporally.
The multi-modality fusion results in our novel depth guided adaptive fusion (DGAdaIN) module, while the temporal augmentation results in our temporal asynchronization augmentation strategy.

Our whole model is illustrated as: for each RGB clip, we will:

  1. sample another depth clip with our temporal asynchronization sampling;
  2. gets the RGB frature map and depth feature map;
  3. use the DGAdaIN module to fuse the two features;
  4. train the FSL classifier by meta learning.


Empirically, both the RGB feature extractor and the depth feature extractor are pretrained on its corresponding modality use the supervised classification tasks.

For the novel DGAdaIN, we are inspired from the AdaIN which is a flagship work in style transfer.


  • the AdaIN takes a content image Ia and a style image Ib as input, and transfers the style of Ia towards the Ib
  • our DGAdaIN takes the rgb feature Irgb and the depth feature Id as input, and transfers the "style" of Irgb towards Id but making the new "style" learnable.

For the temporal asynchronization augmentation, we sample not strictly matched RGB and depth pairs to augment the fused feature.



We compare with previous methods on Kinetics. Also, we perform experiments on UCF101 and HMDB51 (results can be found in paper).


We also provide the visulization result. Results show with depth guidance, the model recognize the key content better.


Related Links