Visual Species Recognition (VSR) is pivotal to biodiversity assessment and conservation, evolution research, and ecology and ecosystem management. Training a machine-learned model for VSR typically requires vast amounts of annotated images. Yet, species-level annotation demands domain expertise, making it realistic for domain experts to annotate only a few examples. These limited labeled data motivate training an “expert” model via few-shot learning (FSL). Meanwhile, advanced Large Multimodal Models (LMMs) have demonstrated prominent performance on general recognition tasks. In this work, we investigate whether LMMs excel in the highly specialized VSR task and whether they outshine FSL expert models.
Over five challenging fine-grained biological datasets, our experiments show that:
We show examples of test images from five VSR benchmarks, along with an expert model’s top-3 predicted species and softmax confidence scores. A reference image is provided for each predicted species. The prevalence of visually similar species among top-3 predictions underscores the challenges of VSR. Notably, even when top-1 predictions are incorrect (marked by red boxes), the top-3 often contain correct species (marked by green boxes). Importantly, LMM can identify the correct ones through a post-hoc process!
We propose Post-hoc Correction (POC) to leverage LMMs to
enhance few-shot VSR.
Specifically, for a test image, the expert model predicts the top-k species along with their corresponding
softmax confidence scores. Then, POC constructs a few-shot in-context prompt by supplementing the test image with top-k species
names, confidences, and few-shot examples. Based on the given context, the LMM is instructed to re-rank the top-k species. Finally, the
top-ranked species from its output is returned as the final prediction.
POC effectively combines the expert
model's proficiency in few-shot learning with the LMM's broad
knowledge, leading to improved VSR performance, without extra training, validation, or human intervention.
Remarkably, we show that POC serves as a simple and effective plug-in method that significantly boosts various existing few-shot learning methods by upto 14%, including prompt learning, adapter learning, linear probing, and full finetuning, etc.
If you find our work useful, please consider citing our papers:
@article{liu2025poc,
title={Surely Large Multimodal Models (Don’t) Excel in Visual Species Recognition?},
author={Liu, Tian and Basu, Anwesha and Kong, Shu},
journal={arXiv preprint arXiv:2512.15748},
year={2025}
}
@inproceedings{liu2025few,
title={Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning},
author={Liu, Tian and Zhang, Huixin and Parashar, Shubham and Kong, Shu},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2025}
}