Movitated by the significant improvements of retrieval-augmented learning (RAL) in zero-shot recognition with Vision-Language Models (VLMs, e.g. OpenCLIP), for the first time, we explore RAL for few-shot recognition by retrieving relevant images from VLMs' pretraining set. We identify novel challenges and opportunities:
Given a data annotation guideline consisting of few-shot images of downstream concepts, SWAT first retrieves relevant pretraining images from VLM's pretraining set (e.g. LAION), and then finetunes the VLM (e.g. OpenCLIP) following a stage-wise strategy:
Left: we show that the retrieved data exhibits different visual patterns (styles, backgrounds, resolutions, semantics, etc.) compared to the downstream few-shot data. Right: the retrieved data follows an imbalanced distribution, where some classes naturally appear less frequently on the Internet.
Across five standard benchmark datasets, we show that:
Suprisingly, finetuning on few-shot data already outperforms previous zero-shot and few-shot methods, without suffering from overfitting issue. In addition, SWAT largely outperforms existing zero-shot and few-shot methods on standard benchmark datasets. We mark the accuracy improvements over CLAP in superscripts.
Across different finetuning scenario in stage 1, SWAT effectively improves the recognition accuracy on both common and rare classes (least frequent 10%). The improvement on the rare classes is more significant than the common classes, confirming that SWAT mitigates the imbalanced learning. We mark the accuracy improvements over stage 1 model in superscripts and standard deviation across three different runs in subscripts.
Compared to the SOTA adapter-based method CLAP, finetuning the model on few-shot data results in 5% accuracy improvement, and adding retrieved data further improves the performance by 4%. Applying CutMix data augmentation provides additional 2% improvement. Finally,retraining the classifier on few-shot data in stage 2 leads to another 1% improvement. We mark the accuracy improvements of each component (relative to the corresponding row above) in superscripts.
We estimate the compute cost using Semi-Aves dataset (200 classes with 16 few-shot examples per class). All experiments are conducted on a single Quadro RTX 6000 (24GB) GPU with 50GB storage for hosting the retrieved data for all five datasets. SWAT improves accuracy significantly by >10% over CLAP with very affordable retrieval and training cost. We mark the accuracy improvements over CLAP in superscripts.
If you find our work useful, please consider citing our papers:
@article{liu2024few,
title={Few-Shot Recognition via Stage-Wise Augmented Finetuning},
author={Liu, Tian and Zhang, Huixin and Parashar, Shubham and Kong, Shu},
journal={arXiv preprint arXiv:2406.11148},
year={2024}
}
@inproceedings{parashar2024neglected,
title={The Neglected Tails in Vision-Language Models},
author={Parashar, Shubham and Lin, Zhiqiu and Liu, Tian and Dong, Xiangjue and Li, Yanan and Ramanan, Deva and Caverlee, James and Kong, Shu},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2024}
}