ORViT TimeSformer | 88.0 | Object-Region Video Transformers | - |
AIM (CLIP ViT-L/14, 32x224) | 90.6 | AIM: Adapting Image Models for Efficient Video Action Recognition | - |
RSANet-R50 (16 frames, ImageNet pretrained, a single clip) | 84.2 | Relational Self-Attention: What's Missing in Attention for Video Understanding | - |