Surely Large Multimodal Models (Don't) Excel in Visual Species Recognition?

1Texas A&M University
  2University of Macau
* The first two authors make equal contributions

tl;dr: We find LMMs struggles in VSR, yet they can significantly boost
few-shot recognition performance through post-hoc correction.

Motivation

Visual Species Recognition (VSR) is pivotal to biodiversity assessment and conservation, evolution research, and ecology and ecosystem management. Training a machine-learned model for VSR typically requires vast amounts of annotated images. Yet, species-level annotation demands domain expertise, making it realistic for domain experts to annotate only a few examples. These limited labeled data motivate training an “expert” model via few-shot learning (FSL). Meanwhile, advanced Large Multimodal Models (LMMs) have demonstrated prominent performance on general recognition tasks. In this work, we investigate whether LMMs excel in the highly specialized VSR task and whether they outshine FSL expert models.

Overview of Findings

1

Over five challenging fine-grained biological datasets, our experiments show that:

  1. LMMs (e.g., Qwen-2.5-VL-7B-Instruct), although being pretrained on web-scale data, struggle in VSR and significantly underperform the FSL expert model, even with established Chain-of-Thought or Self-Verification prompting.
  2. Few-shot learned expert, e.g., simply finetuning a VLM such as OpenCLIP's visual encoder on the few-shot data, significantly outperforms LMMs in VSR.
  3. Leveraging LMMs via Post-hoc Correction (POC), i.e., feeding the top-k predictions from the expert model to a LMM and asking it to select the most probable prediction, significantly improves few-shot VSR performance.

Insights

Top-k predictions by expert often contain correct answer

1

We show examples of test images from five VSR benchmarks, along with an expert model’s top-3 predicted species and softmax confidence scores. A reference image is provided for each predicted species. The prevalence of visually similar species among top-3 predictions underscores the challenges of VSR. Notably, even when top-1 predictions are incorrect (marked by red boxes), the top-3 often contain correct species (marked by green boxes). Importantly, LMM can identify the correct ones through a post-hoc process!

Solution

Harnessing LMMs via Post-hoc Correction (POC)

1

We propose Post-hoc Correction (POC) to leverage LMMs to enhance few-shot VSR. Specifically, for a test image, the expert model predicts the top-k species along with their corresponding softmax confidence scores. Then, POC constructs a few-shot in-context prompt by supplementing the test image with top-k species names, confidences, and few-shot examples. Based on the given context, the LMM is instructed to re-rank the top-k species. Finally, the top-ranked species from its output is returned as the final prediction.

POC effectively combines the expert model's proficiency in few-shot learning with the LMM's broad knowledge, leading to improved VSR performance, without extra training, validation, or human intervention.

Results

POC significantly improves existing few-shot learning methods

comparison with SOTA.

Remarkably, we show that POC serves as a simple and effective plug-in method that significantly boosts various existing few-shot learning methods by upto 14%, including prompt learning, adapter learning, linear probing, and full finetuning, etc.

POC generalizes across different pretrained backbones and LMMs

1

  • Left: we compare the mean accuracy of POC over five benchmarks with few-shot experts learned with various pretrained backbones. Results show that POC consistently improves few-shot VSR across different backbones, with small standard deviations.
  • Right: we compare POC's accuracies across five benchmarks with different LMMs, including the open-sourced ones like Qwen-2.5-VL-7B-Instruct and GLM-4.1V-9B-Thinking, and the proprietary ones like GPT-5-Mini. Results show that POC yields consistent performance gains across different LMMs.

BibTeX

If you find our work useful, please consider citing our papers:


@article{liu2025poc,
title={Surely Large Multimodal Models (Don’t) Excel in Visual Species Recognition?}, 
author={Liu, Tian and Basu, Anwesha and Kong, Shu},
journal={arXiv preprint arXiv:2512.15748},
year={2025}
}

@inproceedings{liu2025few,
    title={Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning},
    author={Liu, Tian and Zhang, Huixin and Parashar, Shubham and Kong, Shu},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2025}
}