The goal of multimodal alignment is to learn a single latent space that is shared between multimodal inputs. The most powerful models in this space have been trained using massive datasets of paired inputs and large-scale computational resources, making them prohibitively expensive to train in many practical scenarios. We surmise that existing unimodal encoders pre-trained on large amounts of unimodal data should provide an effective bootstrap to create multimodal models from unimodal ones at much lower costs. We therefore propose FuseMix, a multimodal augmentation scheme that operates on the latent spaces of arbitrary pre-trained unimodal encoders. Using FuseMix for multimodal alignment, we achieve competitive performance -- and in certain cases outperform state-of-the art methods -- in both image-text and audio-text retrieval, with orders of magnitude less compute and data: for example, we outperform CLIP on the Flickr30K text-to-image retrieval task with $sim ! 600times$ fewer GPU days and $sim ! 80times$ fewer image-text pairs. Additionally, we show how our method can be applied to convert pre-trained text-to-image generative models into audio-to-image ones. Code is available at: https://github.com/layer6ai-labs/fusemix.

The paper introduces FuseMix, a multimodal augmentation scheme that learns a shared latent space between different modalities, such as images, text, and audio, by leveraging pre-trained unimodal encoders. The key advantages of FuseMix are its competitive performance compared to state-of-the-art methods and its significantly lower computational and data requirements.

Specifically, FuseMix outperforms CLIP, a prominent image-text retrieval model, on the Flickr30K text-to-image retrieval task while using approximately 600 times fewer GPU days and 80 times fewer image-text pairs during training. Additionally, the paper demonstrates how FuseMix can convert pre-trained text-to-image generative models into audio-to-image ones, showcasing its versatility.

The authors argue that pre-trained unimodal encoders, which are trained on large amounts of unimodal data, provide an effective starting point for creating multimodal models at a much lower cost compared to training from scratch on massive datasets of paired inputs.