Few-Shot Recognition (FSR) aims to train a classification model with only a few labeled examples per concept
concerned by a downstream task. With an open-world philosophy, we exploit Vision-Language Model (VLM, which
is pretrained in the open world), and open data (e.g., the VLM's pretraining data) to facilitate few-shot learning.
Furthermore, our few-shot recognition setup is motivated by data annotation application, where there are annotation
guidelines that provide a few visual examples for each concept concerned by the downstream task. Hence, we explore methods to
prioritize recognition accuracy without limiting to parameter-efficient finetuning (PEFT) such as prompt tuning or
adapter learning, both of which are popular few-shot learning methods. We explore methods as simple as full-model
finetuning over few-shot and/or retrieved data. Through the process, we identify novel challenges and opportunities.
Over nine standard benchmark datasets, we show that:
Left: we show that the retrieved data exhibits different visual patterns (styles, backgrounds, resolutions, semantics, etc.) compared to the downstream few-shot data. Right: the retrieved data follows an imbalanced distribution, where some classes naturally appear less frequently than others. We retrieve relevant images from LAION dataset following REAL.
Given a data annotation guideline consisting of few-shot annotated images, SWAT retrieves open data relevant to the downstream concepts (e.g., from VLM's pretraining dataset LAION), and then finetunes a pretrained VLM (e.g., OpenCLIP) following a stage-wise strategy:
Remarkably, we show that:
If you find our work useful, please consider citing our papers:
@inproceedings{liu2025few,
title={Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning},
author={Liu, Tian and Zhang, Huixin and Parashar, Shubham and Kong, Shu},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2025}
}
@inproceedings{parashar2024neglected,
title={The Neglected Tails in Vision-Language Models},
author={Parashar, Shubham and Lin, Zhiqiu and Liu, Tian and Dong, Xiangjue and Li, Yanan and Ramanan, Deva and Caverlee, James and Kong, Shu},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2024}
}