Emu | 36.57 | 18.19 | 28.9 | 28.24 | 14B | Emu: Generative Pretraining in Multimodality | - |
CogVLM-Chat | 47.88 | 28.75 | 36.75 | 37.16 | 17B | CogVLM: Visual Expert for Pretrained Language Models | - |
LLaVA-1.5 | 47.91 | 24.31 | 30.94 | 32.62 | 13B | Improved Baselines with Visual Instruction Tuning | - |
LLaMA-Adapter V2 | 46.12 | 22.08 | 28.7 | 30.46 | 7B | LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model | - |