Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning

1Texas A&M University
  2University of Macau

tl;dr: We adapt a pretrained Vision-Language Model and repurpose its
pretraining data to boost few-shot recognition performance

Motivation

Few-Shot Recognition (FSR) aims to train a classification model with only a few labeled examples per concept concerned by a downstream task. With an open-world philosophy, we exploit Vision-Language Model (VLM, which is pretrained in the open world), and open data (e.g., the VLM's pretraining data) to facilitate few-shot learning.

Furthermore, our few-shot recognition setup is motivated by data annotation application, where there are annotation guidelines that provide a few visual examples for each concept concerned by the downstream task. Hence, we explore methods to prioritize recognition accuracy without limiting to parameter-efficient finetuning (PEFT) such as prompt tuning or adapter learning, both of which are popular few-shot learning methods. We explore methods as simple as full-model finetuning over few-shot and/or retrieved data. Through the process, we identify novel challenges and opportunities.

Overview of Findings

1

Over nine standard benchmark datasets, we show that:

  1. Finetuning a VLM on large amounts of retrieved data barely surpasses state-of-the-art zero-shot methods due to the domain gaps of retrieved data compared to few-shot annotated data and its imbalanced distribution issues.
  2. Simply finetuning a VLM on few-shot examples alone significantly outperforms prior FSR methods (>3%), without suffering from overfitting issue. Moreover, finetuning on the mixed retrieved and few-shot data yields even better results.
  3. To mitigate domain gaps and imbalanced distribution issues, we propose a simple yet effective method: Stage-Wise retrieval-Augmented fineTuning (SWAT), resoundingly surpassing prior FSR methods by >6% in accuracy, with 20-30% accuracy improvements on challenging datasets such as Semi-Aves, Aircraft, EuroSAT (see details in the paper).

Challenges

Retrieved data has domain gaps and follows an imbalanced distribution, degrading finetuning performance

1

Left: we show that the retrieved data exhibits different visual patterns (styles, backgrounds, resolutions, semantics, etc.) compared to the downstream few-shot data. Right: the retrieved data follows an imbalanced distribution, where some classes naturally appear less frequently than others. We retrieve relevant images from LAION dataset following REAL.

Solution

Stage-Wise retrieval-Augmented fineTuning (SWAT)

1

Given a data annotation guideline consisting of few-shot annotated images, SWAT retrieves open data relevant to the downstream concepts (e.g., from VLM's pretraining dataset LAION), and then finetunes a pretrained VLM (e.g., OpenCLIP) following a stage-wise strategy:

  • Stage 1: end-to-end finetuning on the mixed retrieved and few-shot data.
  • Stage 2: retraining the classifier solely on the balanced few-shot images.
SWAT effectively mitigates the domain gaps and imbalanced distribution of retrieved data, significantly outperforming previous FSR methods, as shown below.

Results

SWAT achieves state-of-the-art FSR performance

comparison with SOTA.

Remarkably, we show that:

  • Few-shot finetuning already outperfroms previous FSR methods by >3%, without overfitting.
  • SWAT largely outperforms existing zero-shot and few-shot methods by >6%.
We mark the accuracy improvements over previous SOTA FSR method CLAP in superscripts.

SWAT mitigates domain gaps and imbalanced distribution

1

  • When fintuning on retrieved data only for the first stage, retraining classifier on few-shot data yields significant improvement on both common and rare class accuracies, confirming SWAT mitigates domain gaps.
  • Across all stage-1 training scenarios, stage-2 training improves rare classes much more than the common classes, validating the mitigation of imbalanced learning.
We mark the accuracy improvements over stage-1 model in superscripts and standard deviation across three different runs in subscripts.

More examples of retrieved data for various downstream concepts

1

BibTeX

If you find our work useful, please consider citing our papers:


@inproceedings{liu2025few,
          title={Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning},
          author={Liu, Tian and Zhang, Huixin and Parashar, Shubham and Kong, Shu},
          booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
          year={2025}
        }

@inproceedings{parashar2024neglected,
        title={The Neglected Tails in Vision-Language Models},
        author={Parashar, Shubham and Lin, Zhiqiu and Liu, Tian and Dong, Xiangjue and Li, Yanan and Ramanan, Deva and Caverlee, James and Kong, Shu},
        booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
        year={2024}
    }