πΆ and there are no cats in America, and the streets are paved with cheese π΅
πΆ and there are no cats in America, and the streets are paved with cheese π΅
glad to hear she takes after her great-grandmother
Our method, β¨SPARCβ¨, significantly boosts performance on three different multilabel recognition datasets and nine different CLIP backbones and complements the strengths of existing white-box and training-based methods. Looking forward to presenting it at CVPR!
We also find that CLIP scores are impacted by image- and prompt-level bias. Simple standardization is surprisingly effective at removing these biases and boosting multilabel recognition performance.
We find that the second-highest score provides a better signal, and in general we get our best results by adaptively fusing all of the ranks using the direction of maximum variance.
How should we use these βcompoundβ prompts? A natural choice would be to use the highest-scoring one, as it is likely the most descriptive. However, we find that this approach leads to false positives due to the βOR-gateβ nature of CLIP scores.
Our question: How can we make VLMs better at multilabel recognition, without need for training or access to VLM internals?
Idea: Make each classβs prompt more descriptive by pairing with classes that tend to co-occur. E.g., instead of βcatβ, try βcat and dogβ, βcat and bedβ, etc.
Looking forward to presenting our paper ββ¨SPARCβ¨: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Modelsβ at #CVPR2025 this Friday!
Checkout our β¨code + paper + posterβ¨: github.com/kjmillerCURI...
Our method, β¨SPARCβ¨, significantly boosts performance on three different multilabel recognition datasets and nine different CLIP backbones and complements the strengths of existing white-box and training-based methods. We look forward to presenting our work at #CVPR2025 in June.
We also find that CLIP scores are impacted by image- and prompt-level bias. Simple standardization is surprisingly effective at removing these biases and boosting multilabel recognition performance.
We find that the second-highest score provides a better signal, and in general we get our best results by adaptively fusing all of the ranks using the direction of maximum variance.
A diagram with two example images, illustrating how using the highest score can lead to false-positives, while using the second-highest score can mitigate this problem.
How should we use these βcompoundβ prompts? A natural choice would be to use the highest-scoring one, as it is likely the most descriptive. However, we find that this approach leads to false positives due to the βOR-gateβ nature of CLIP scores.
Our question: How can we make VLMs better at multilabel recognition, without need for training or access to VLM internals?
Idea: Make each classβs prompt more descriptive by pairing with classes that tend to co-occur. E.g., instead of βcatβ, try βcat and dogβ, βcat and bedβ, etc.
β¨ Our paper "SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models" has been accepted to #CVPR2025! β¨
arxiv.org/pdf/2502.16911
Huge thanks to my amazing coauthors Aditya Gangrade, Samarth Mishra, Kate Saenko, and Venkatesh Saligrama.