Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning.

Motivation

Few-Shot Recognition (FSR) aims to train a classification model with only a few labeled examples per concept concerned by a downstream task. With an open-world philosophy, we exploit Vision-Language Model (VLM, which is pretrained in the open world), and open data (e.g., the VLM's pretraining data) to facilitate few-shot learning.

Furthermore, our few-shot recognition setup is motivated by data annotation application, where there are annotation guidelines that provide a few visual examples for each concept concerned by the downstream task. Hence, we explore methods to prioritize recognition accuracy without limiting to parameter-efficient finetuning (PEFT) such as prompt tuning or adapter learning, both of which are popular few-shot learning methods. We explore methods as simple as full-model finetuning over few-shot and/or retrieved data. Through the process, we identify novel challenges and opportunities.

Overview of Findings

Over nine standard benchmark datasets, we show that:

Finetuning a VLM on large amounts of retrieved data barely surpasses state-of-the-art zero-shot methods due to the domain gaps of retrieved data compared to few-shot annotated data and its imbalanced distribution issues.
Simply finetuning a VLM on few-shot examples alone significantly outperforms prior FSR methods (>3%), without suffering from overfitting issue. Moreover, finetuning on the mixed retrieved and few-shot data yields even better results.
To mitigate domain gaps and imbalanced distribution issues, we propose a simple yet effective method: Stage-Wise retrieval-Augmented fineTuning (SWAT), resoundingly surpassing prior FSR methods by >6% in accuracy, with 20-30% accuracy improvements on challenging datasets such as Semi-Aves, Aircraft, EuroSAT (see details in the paper).

Challenges

Retrieved data has domain gaps and follows an imbalanced distribution, degrading finetuning performance

Left: we show that the retrieved data exhibits different visual patterns (styles, backgrounds, resolutions, semantics, etc.) compared to the downstream few-shot data. Right: the retrieved data follows an imbalanced distribution, where some classes naturally appear less frequently than others. We retrieve relevant images from LAION dataset following REAL.

Solution

Stage-Wise retrieval-Augmented fineTuning (SWAT)

Given a data annotation guideline consisting of few-shot annotated images, SWAT retrieves open data relevant to the downstream concepts (e.g., from VLM's pretraining dataset LAION), and then finetunes a pretrained VLM (e.g., OpenCLIP) following a stage-wise strategy:

Stage 1: end-to-end finetuning on the mixed retrieved and few-shot data.
Stage 2: retraining the classifier solely on the balanced few-shot images.

SWAT effectively mitigates the domain gaps and imbalanced distribution of retrieved data, significantly outperforming previous FSR methods, as shown below.

Results

SWAT achieves state-of-the-art FSR performance

Remarkably, we show that:

Few-shot finetuning already outperfroms previous FSR methods by >3%, without overfitting.
SWAT largely outperforms existing zero-shot and few-shot methods by >6%.

We mark the accuracy improvements over previous SOTA FSR method CLAP in superscripts.

SWAT mitigates domain gaps and imbalanced distribution

When fintuning on retrieved data only for the first stage, retraining classifier on few-shot data yields significant improvement on both common and rare class accuracies, confirming SWAT mitigates domain gaps.
Across all stage-1 training scenarios, stage-2 training improves rare classes much more than the common classes, validating the mitigation of imbalanced learning.

We mark the accuracy improvements over stage-1 model in superscripts and standard deviation across three different runs in subscripts.

More examples of retrieved data for various downstream concepts

BibTeX

If you find our work useful, please consider citing our papers:


@inproceedings{liu2025few,
          title={Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning},
          author={Liu, Tian and Zhang, Huixin and Parashar, Shubham and Kong, Shu},
          booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
          year={2025}
        }

@inproceedings{parashar2024neglected,
        title={The Neglected Tails in Vision-Language Models},
        author={Parashar, Shubham and Lin, Zhiqiu and Liu, Tian and Dong, Xiangjue and Li, Yanan and Ramanan, Deva and Caverlee, James and Kong, Shu},
        booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
        year={2024}
    }

Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning

CVPR 2025, CVinW and FGVC12 Workshops

tl;dr: We adapt a pretrained Vision-Language Model and repurpose its
pretraining data to boost few-shot recognition performance

Motivation

Overview of Findings

Challenges

Retrieved data has domain gaps and follows an imbalanced distribution, degrading finetuning performance

Solution

Stage-Wise retrieval-Augmented fineTuning (SWAT)

Results

SWAT achieves state-of-the-art FSR performance

SWAT mitigates domain gaps and imbalanced distribution

More examples of retrieved data for various downstream concepts

BibTeX

Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning

CVPR 2025, CVinW and FGVC12 Workshops

tl;dr: We adapt a pretrained Vision-Language Model and repurpose itspretraining data to boost few-shot recognition performance

Motivation

Overview of Findings

Challenges

Retrieved data has domain gaps and follows an imbalanced distribution, degrading finetuning performance

Solution

Stage-Wise retrieval-Augmented fineTuning (SWAT)

Results

SWAT achieves state-of-the-art FSR performance

SWAT mitigates domain gaps and imbalanced distribution

More examples of retrieved data for various downstream concepts

BibTeX

tl;dr: We adapt a pretrained Vision-Language Model and repurpose its
pretraining data to boost few-shot recognition performance