BLIP-2 ViT-L (zero-shot, 1K test set) | 88.6 | 98.9 | 97.6 | BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
Encoders and Large Language Models | - |
BLIP-2 ViT-G (zero-shot, 1K test set) | 89.7 | 98.9 | 98.1 | BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
Encoders and Large Language Models | - |