Semi-supervised few-shot learning (SSFSL) formulates realworld applications like “auto-annotation”, as it aims to learn a model over a few labeled and abundant unlabeled examples to annotate the unlabeled ones. Despite the availability of powerful open-source Vision-Language Models (VLMs) and their pretraining data, the SSFSL literature largely neglects these open-source resources. To achive auto-annotation in the real world, we exploit finetuning VLMs and their pretraining data for SSFSL.
Over five challenging fine-grained SSL datasets, our experiments show that:
We run OpenCLIP ViT-B/32 on the unlabeled images of the semi-Aves dataset by zero-shot prompting its 200 class names.
We apply temperatures for (1) sharpening the softmax probability distributions and (2) strengthening supervision signals. We illustrate our temperature tuning technique with FixMatch as an example. Specifically, we introduce two temperatures: (a) loss temperature T_loss to sharpen the softmax probabilities when computing the cross-entropy loss, and (b) confidence temperature T_conf to scale the softmax probabilities when determining whether the confidence exceeds the threshold for utilizing unlabeled data. By tuning these two temperatures, we can effectively mitigate the issues caused by flat softmax probabilities from VLMs, leading to improved finetuning performance.
Besides Temperature Tuning, we also propose classifier initialization to mitigate the flat softmax issue. Specifically, we initialize the classification head by linear probing with few-shot examples, providing a better starting point for finetuning. Combining classifier initialization and temperature tuning, our final method SWIFT finetunes the entire VLM in a stage-wise manner.
We illustrate the impact of loss temperature T_loss through few-shot finetuning the visual encoder of OpenCLIP ViT-B/32. Training loss (left) and test accuracy (right) over epochs show that finetuning without TT (i.e., T_loss=1.0) yields slow convergence (slow reduction in training loss and increase in test accuracy), due to the weak supervision. In contrast, applying a loss temperature, either by fixing T_loss to a moderately small value (e.g., 0.1 or 0.07, solid lines) or by learning it dynamically (dashed lines), greatly accelerates convergence and improves test accuracy, demonstrating the strengthening of training supervisions.
We illustrate the impact of confidence temperature T_conf through finetuning OpenCLIP ViT-B/32 with FixMatch on semi-Aves. The left figure shows that without TT (i.e., T_conf=1.0), the utilization of unlabeled data is zero with a default confidence threshold of 0.8, resulting no accuracy gains over the few-shot finetuning. However, reducing the confidence temperature, e.g., fixing T_conf to a moderately small value (0.1 or 0.07) significantly increases the utilization of unlabeled data, yielding notable accuracy gains (right).
If you find our work useful, please consider citing our papers:
@article{liu2025swift,
title={Solving Semi-Supervised Few-Shot Learning from an Auto-Annotation Perspective},
author={Liu, Tian and Basu, Anwesha and Kong, Shu},
journal={arXiv preprint arXiv:2512.10244},
year={2025}
}
@inproceedings{liu2025few,
title={Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning},
author={Liu, Tian and Zhang, Huixin and Parashar, Shubham and Kong, Shu},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2025}
}