Pushing Boundaries: Exploring Zero-Shot Object Classification with Large Multimodal Models
Published in 2023 Tenth International Conference on Social Networks Analysis, Management and Security (SNAMS), 2023
This conference paper explores the potential of Large Multimodal Models (LMMs) in performing zero-shot object classification by leveraging the synergy of language and vision models. LMMs, which integrate vision encoders with Large Language Models (LLMs), are designed to handle both language and visual comprehension tasks, pushing the boundaries of AI-assisted capabilities.
The study benchmarks LMMs’ performance on four diverse datasets: MNIST, Cats Vs. Dogs, Hymnoptera (Ants Vs. Bees), and Pox Vs. Non-Pox skin images, achieving remarkable classification accuracies of 85%, 100%, 77%, and 79% respectively, without any fine-tuning. Further analysis includes fine-tuning the model on a specialized dataset of images depicting faces of children with and without autism, where the accuracy improved from 55% to 83% post fine-tuning.
These findings highlight the versatility and transformative potential of LLVAs in real-world applications, particularly in scenarios requiring zero-shot learning capabilities. This research was presented at the 2023 Tenth International Conference on Social Networks Analysis, Management and Security (SNAMS).
Recommended citation: Islam, A., Biswas, M. R., Zaghouani, W., Belhaouari, S. B., & Shah, Z. (2023). "Pushing Boundaries: Exploring Zero-Shot Object Classification with Large Multimodal Models." In 2023 Tenth International Conference on Social Networks Analysis, Management and Security (SNAMS) (pp. 1--5). IEEE. https://doi.org/10.1109/SNAMS60348.2023.10375440
Download Paper