LEARNING RICH FEATURES WITHOUT LABELS: CONTRASTIVE APPROACHES IN MULTIMODAL ARTIFICIAL INTELLIGENCE SYSTEMS

Dr. Mei-Ling Zhou; Dr. Haojie Xu

doi:10.55640/ijaair-v02i04-02

Authors

Dr. Mei-Ling Zhou Department of AI and Data Science, Tsinghua University, Beijing, China
Dr. Haojie Xu School of Information Science and Engineering, Fudan University, Shanghai, China

DOI:

https://doi.org/10.55640/ijaair-v02i04-02

Keywords:

Unsupervised learning, representation learning, contrastive learning, multimodal AI

Abstract

The burgeoning field of Multimodal Artificial Intelligence (AI) aims to develop systems capable of processing and understanding information from diverse sensory inputs, such as vision, language, and audio. A significant bottleneck in training these sophisticated models is the immense cost and effort associated with annotating vast quantities of multimodal data. Unsupervised representation learning offers a promising solution by enabling models to learn meaningful feature representations directly from unlabeled data. Among the myriad unsupervised techniques, contrastive learning has emerged as a particularly powerful paradigm, demonstrating remarkable success in both unimodal and, more recently, multimodal contexts. This article provides a comprehensive review of unsupervised representation learning with contrastive learning in multimodal AI systems. We elucidate the core principles of contrastive learning, its evolution from unimodal applications to cross-modal alignment, and its capacity to learn robust, transferable representations across heterogeneous data sources. By synthesizing key architectural designs, empirical successes, and applications, we highlight how contrastive learning facilitates better understanding, alignment, and fusion of information from different modalities. Furthermore, we discuss the inherent challenges, such as handling unaligned or sparse multimodal data, and outline critical future research directions towards building more versatile and data-efficient multimodal AI.

References

He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9729–9738.

Oord, A. v. d., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.

Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. International Conference on Machine Learning, 1597–1607.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. International Conference on Machine Learning, 8748–8763.

Li, J., Zhou, P., Xiong, C., & Hoi, S. C. (2020). Prototypical contrastive learning of unsupervised representations. arXiv preprint arXiv:2005.04966.

Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z., ... & Gao, J. (2021). Multimodal contrastive training for visual representation learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10431–10441.

Nakada, R., Gulluk, H. I., Deng, Z., Ji, W., Zou, J., & Zhang, L. (2023). Understanding multimodal contrastive learning and incorporating unpaired data. Proceedings of Machine Learning Research, 206, 4348–4380.

Lin, Z., Zhang, Z., Wang, M., Shi, Y., & Wu, X. (2022). Multi-modal contrastive representation learning for entity alignment. arXiv preprint arXiv:2209.00891.

Alayrac, J. B., et al. (2022). FLAVA: A foundational language and vision alignment model. CVPR, 15638–15650.

Tsai, Y. H. H., Bai, S., Yamada, M., Morency, L. P., & Salakhutdinov, R. (2019). Multimodal transformer for unaligned multimodal language sequences. ACL, 6558–6569.

Chen, X., & He, K. (2021). Exploring simple Siamese representation learning. CVPR, 15750–15758.

Wei, H., Qi, P., & Ma, X. (2021). Cross-modal contrastive learning for multivariate time series. NeurIPS, 34, 23346–23357.

Miech, A., Alayrac, J. B., Smaira, L., Laptev, I., Sivic, J., & Zisserman, A. (2020). End-to-end learning of visual representations from uncurated instructional videos. CVPR, 9879–9889.

Hsu, C. Y., Lin, Y. Y., & Huang, Y. C. F. (2021). Transferable representation learning with deep adaptation networks. IEEE Transactions on Image Processing, 29, 1979–1990.

Arandjelović, R., & Zisserman, A. (2017). Look, listen and learn. ICCV, 609–617.

Grill, J. B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., ... & Valko, M. (2020). Bootstrap your own latent: A new approach to self-supervised learning. NeurIPS, 33, 21271–21284.

Geng, Y., Duan, Z., & Li, X. (2022). Multimodal contrastive representation learning for image-text matching. ACM Multimedia, 1266–1275.

Yao, T., Pan, Y., Li, Y., & Mei, T. (2021). Joint representation learning for multimodal understanding. IEEE Transactions on Multimedia, 23, 1422–1432.

Sun, C., Shrivastava, A., Singh, S., & Gupta, A. (2019). Revisiting unreasonable effectiveness of data in deep learning era. ICCV, 843–852.

Tian, Y., Krishnan, D., & Isola, P. (2020). Contrastive multiview coding. ECCV, 776–794.

Wang, X., & Gupta, A. (2015). Unsupervised learning of visual representations using videos. ICCV, 2794–2802.

Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments. NeurIPS, 33, 9912–9924.

Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised visual representation learning by context prediction. ICCV, 1422–1430.

Misra, I., & van der Maaten, L. (2020). Self-supervised learning of pretext-invariant representations. CVPR, 6707–6717.

Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., & Girshick, R. (2021). Early convolutions help transformers see better. NeurIPS, 34, 30392–30400.

International Journal of Advanced Artificial Intelligence Research

Article Details Page