City Research Online - Parts of Speech–Grounded⁠ Subspaces in Vision-Language Models

Parts of Speech–Grounded⁠ Subspaces in Vision-Language Models

Oldfield, J., Tzelepis, C. ORCID: 0000-0002-2036-9089, Panagakis, Y. , Nicolaou, M. & Patras, I. (2024). Parts of Speech–Grounded⁠ Subspaces in Vision-Language Models. In: Proceedings of the thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023). Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023), 10-16 Dec 23, New Orleans, USA.

Abstract

Latent image representations arising from vision-language models have proved immensely useful for a variety of downstream tasks. However, their utility is limited by their entanglement with respect to different visual attributes. For instance, recent work has shown that CLIP image representations are often biased towards specific visual properties (such as objects or actions) in an unpredictable manner. In this paper, we propose to separate representations of the different visual modalities in CLIP’s joint vision-language space by leveraging the association between parts of speech and specific visual modes of variation (e.g. nouns relate to objects, adjectives describe appearance). This is achieved by formulating an appropriate component analysis model that learns subspaces capturing variability corresponding to a specific part of speech, while jointly minimising variability to the rest. Such a subspace yields disentangled representations of the different visual properties of an image or text in closed form while respecting the underlying geometry of the manifold on which the representations lie. What’s more, we show the proposed model additionally facilitates learning subspaces corresponding to specific visual appearances (e.g. artists’ painting styles), which enables the selective removal of entire visual themes from CLIP-based text-to-image synthesis. We validate the model both qualitatively, by visualising the subspace projections with a text-to-image model and by preventing the imitation of artists’ styles, and quantitatively, through class invariance metrics and improvements to baseline zero-shot classification

Publication Type:	Conference or Workshop Item (Paper)
Additional Information:	Copyright the authors. This paper will be presented at the NeurIPS – 37th Anniversary Conference.
Subjects:	Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Departments:	School of Science & Technology > Computer Science

Preview

Text - Accepted Version
Download (10MB) | Preview

Export

Downloads

Downloads per month over past year

View more statistics

Metadata

Altmetric

Funder Information

CORE (COnnecting REpositories)

Actions (login required)

Admin Login

Creators:	Oldfield, J. Tzelepis, C. ORCID: 0000-0002-2036-9089 Panagakis, Y. Nicolaou, M. Patras, I.
Event Title:	Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023)
Event Type:	Conference
Event Location:	New Orleans, USA
Event Dates:	10-16 Dec 23
Status:	Published
Refereed:	Yes
Journal or Publication Title:	Proceedings of the thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023)
URI:	https://openaccess.city.ac.uk/id/eprint/31574
Date available in CRO:	27 Oct 2023 14:07
Date deposited:	24 October 2023
Dates:	Date Event 21 September 2023 Accepted 30 May 2024 Published Online 30 May 2024 Published