Solving Semi-Supervised Few-Shot Learning from an Auto-Annotation Perspective

Motivation

Semi-supervised few-shot learning (SSFSL) formulates realworld applications like “auto-annotation”, as it aims to learn a model over a few labeled and abundant unlabeled examples to annotate the unlabeled ones. Despite the availability of powerful open-source Vision-Language Models (VLMs) and their pretraining data, the SSFSL literature largely neglects these open-source resources. To achive auto-annotation in the real world, we exploit finetuning VLMs and their pretraining data for SSFSL.

Overview of Findings

Over five challenging fine-grained SSL datasets, our experiments show that:

Influential Semi-Supervised Learning (SSL) methods, such as FixMatch, DebiasPL that finetune an ImageNet-pretrained ResNet50 backbone, underperform the state-of-the-art (SOTA) zero-shot learning (ZSL) methods like REAL.
SOTA SSL method FineSSL that learns prompts with a frozen VLM (e.g., CLIP) achives notable gains, yet still lags behind recent few-shot learning (FSL) methods like Few-Shot finetune that finetune the VLM's visual encoder. This motivates us to explore finetuning VLMs for SSFSL.
Directly applying FixMatch and DebiasPL on VLM significantly underperforms recent FSL methods such as Few-Shot Finetune and SWAT, which do not even exploit unlabeled data. We reveal the root cause of such failures lies in the rather flat softmax probabilities from VLMs, leading to weak supervision and zero utilization of unlabeled data.
Our SSFSL method SWIFT, incorporates our simple yet effective classifier initialization and temperature tuning techniques to remedy the above issues, significantly outperforms existing SSL methods and FSL methods by 5%, even rivalling fully supervised methods.

Challenges

VLM produces ''flat'' softmax probabilities

We run OpenCLIP ViT-B/32 on the unlabeled images of the semi-Aves dataset by zero-shot prompting its 200 class names.

(A) VLMs produce a ''flat'' distribution of softmax probabilities (200-dim column vector per unlabeled example), resulting in low confidence scores and weak supervision signals that prevent effective finetuning. (B) Applying a temperature sharpens the softmax distribution, illustrated by the more prominent red diagonal line.
(C) Without temperature, the original low confidence scores result in zero utilization of unlabeled examples for FixMatch (threshold set at 0.8)! (D) Using a temperature increases confidence scores, improving the utilization of unlabeled data. Note the difference in confidence range between (C) and (D).

Solution

Temperature Tuning (TT) with FixMatch

We apply temperatures for (1) sharpening the softmax probability distributions and (2) strengthening supervision signals. We illustrate our temperature tuning technique with FixMatch as an example. Specifically, we introduce two temperatures: (a) loss temperature T_loss to sharpen the softmax probabilities when computing the cross-entropy loss, and (b) confidence temperature T_conf to scale the softmax probabilities when determining whether the confidence exceeds the threshold for utilizing unlabeled data. By tuning these two temperatures, we can effectively mitigate the issues caused by flat softmax probabilities from VLMs, leading to improved finetuning performance.

Stage-wise Finetuning with Temperature Tuning (SWIFT)

Besides Temperature Tuning, we also propose classifier initialization to mitigate the flat softmax issue. Specifically, we initialize the classification head by linear probing with few-shot examples, providing a better starting point for finetuning. Combining classifier initialization and temperature tuning, our final method SWIFT finetunes the entire VLM in a stage-wise manner.

Results

SWIFT achieves SOTA SSFSL performance

Left: across five SSL benchmark datasets, our simple classifier initialization and temperature tuning solutions effectively improves existing SSL methods like FixMatch and DebiasPL by a large margin (up to 20%) when finetuning VLMs. Our final method SWIFT significantly outperforms all existing SSL methods and FSL methods by 5% on average.
Right: each component in SWIFT brings significant gains. Notably, our SWIFT recipe generalizes well to various existing SSL methods.

Tuning the loss temperature strengthens training supervision

We illustrate the impact of loss temperature T_loss through few-shot finetuning the visual encoder of OpenCLIP ViT-B/32. Training loss (left) and test accuracy (right) over epochs show that finetuning without TT (i.e., T_loss=1.0) yields slow convergence (slow reduction in training loss and increase in test accuracy), due to the weak supervision. In contrast, applying a loss temperature, either by fixing T_loss to a moderately small value (e.g., 0.1 or 0.07, solid lines) or by learning it dynamically (dashed lines), greatly accelerates convergence and improves test accuracy, demonstrating the strengthening of training supervisions.

Tuning the confidence temperature enables utilization of pseudo-labeled data

We illustrate the impact of confidence temperature T_conf through finetuning OpenCLIP ViT-B/32 with FixMatch on semi-Aves. The left figure shows that without TT (i.e., T_conf=1.0), the utilization of unlabeled data is zero with a default confidence threshold of 0.8, resulting no accuracy gains over the few-shot finetuning. However, reducing the confidence temperature, e.g., fixing T_conf to a moderately small value (0.1 or 0.07) significantly increases the utilization of unlabeled data, yielding notable accuracy gains (right).

BibTeX

If you find our work useful, please consider citing our papers:


@article{liu2025swift,
title={Solving Semi-Supervised Few-Shot Learning from an Auto-Annotation Perspective}, 
author={Liu, Tian and Basu, Anwesha and Kong, Shu},
journal={arXiv preprint arXiv:2512.10244},
year={2025}
}

@inproceedings{liu2025few,
    title={Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning},
    author={Liu, Tian and Zhang, Huixin and Parashar, Shubham and Kong, Shu},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2025}
}