Few-Shot Recognition via Stage-Wise Augmented Finetuning

1Texas A&M University
2University of Macau

tl;dr: We adapt a pretrained Vision-Language Model and repurpose its
pretraining data to boost few-shot recognition performance

Overview

Movitated by the significant improvements of retrieval-augmented learning (RAL) in zero-shot recognition with Vision-Language Models (VLMs, e.g. OpenCLIP), for the first time, we explore RAL for few-shot recognition by retrieving relevant images from VLMs' pretraining set. We identify novel challenges and opportunities:

  1. Simply finetuning VLMs on a large amount of retrieved data barely surpasses state-of-the-art zero-shot methods due to the imbalanced distribution of retrieved data and its domain gaps compared to few-shot annotated data.
  2. Finetuning a VLM on few-shot examples alone significantly outperforms prior methods, and finetuning on the mix of retrieved and few-shot data yields even better results.
  3. To mitigate the imbalanced distribution and domain gap issue, we propose Stage-Wise Augmented fineTuning (SWAT) method, which involves end-to-end finetuning on mixed data for the first stage and retraining the classifier solely on the few-shot data in the second stage.
Our SWAT resoundingly outperforms prior works by ~10% in accuracy on standard benchmark datasets.

Stage-Wise Augmented fineTuning (SWAT)

1

Given a data annotation guideline consisting of few-shot images of downstream concepts, SWAT first retrieves relevant pretraining images from VLM's pretraining set (e.g. LAION), and then finetunes the VLM (e.g. OpenCLIP) following a stage-wise strategy:

  • Stage 1: end-to-end finetuning on the mixed data of retrieved and few-shot images.
  • Stage 2: retraining the classifier solely on the few-shot images.
SWAT effectively mitigates the domain gap and imbalanced distribution of retrieved data, as illustrated below.

Challenges

Retrieved data has domain gaps compared to donwstream few-shot data and follows an imbalanced distribution

1

Left: we show that the retrieved data exhibits different visual patterns (styles, backgrounds, resolutions, semantics, etc.) compared to the downstream few-shot data. Right: the retrieved data follows an imbalanced distribution, where some classes naturally appear less frequently on the Internet.

Benchmarking SWAT

1

Across five standard benchmark datasets, we show that:

  • finetuning visual encoder on few-shot data already outperforms previous methods.
  • finetuning on retrieved data merely outperforms zero-shot methods due to domain gap and imbalanced distribution.
  • SWAT achieves the best performance, with > 10% accuracy improvements over previous methods.

SWAT outperforms SOTA zero-shot and few-shot methods by >10%

comparison with SOTA.

Suprisingly, finetuning on few-shot data already outperforms previous zero-shot and few-shot methods, without suffering from overfitting issue. In addition, SWAT largely outperforms existing zero-shot and few-shot methods on standard benchmark datasets. We mark the accuracy improvements over CLAP in superscripts.

SWAT mitigates domain gap and imbalanced performance

comparison with SOTA.

Across different finetuning scenario in stage 1, SWAT effectively improves the recognition accuracy on both common and rare classes (least frequent 10%). The improvement on the rare classes is more significant than the common classes, confirming that SWAT mitigates the imbalanced learning. We mark the accuracy improvements over stage 1 model in superscripts and standard deviation across three different runs in subscripts.

Ablation study on important components of SWAT

comparison with SOTA.

Compared to the SOTA adapter-based method CLAP, finetuning the model on few-shot data results in 5% accuracy improvement, and adding retrieved data further improves the performance by 4%. Applying CutMix data augmentation provides additional 2% improvement. Finally,retraining the classifier on few-shot data in stage 2 leads to another 1% improvement. We mark the accuracy improvements of each component (relative to the corresponding row above) in superscripts.

Comparing efficiency between SWAT and adapter-based methods

1

We estimate the compute cost using Semi-Aves dataset (200 classes with 16 few-shot examples per class). All experiments are conducted on a single Quadro RTX 6000 (24GB) GPU with 50GB storage for hosting the retrieved data for all five datasets. SWAT improves accuracy significantly by >10% over CLAP with very affordable retrieval and training cost. We mark the accuracy improvements over CLAP in superscripts.

BibTeX

If you find our work useful, please consider citing our papers:

@article{liu2024few,
          title={Few-Shot Recognition via Stage-Wise Augmented Finetuning},
          author={Liu, Tian and Zhang, Huixin and Parashar, Shubham and Kong, Shu},
          journal={arXiv preprint arXiv:2406.11148},
          year={2024}
      }
@inproceedings{parashar2024neglected,
        title={The Neglected Tails in Vision-Language Models},
        author={Parashar, Shubham and Lin, Zhiqiu and Liu, Tian and Dong, Xiangjue and Li, Yanan and Ramanan, Deva and Caverlee, James and Kong, Shu},
        booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
        year={2024}
    }