0

0

Smart Multi-Modal Search: Contextual Sparse and Dense Embedding Integration in Adobe Express

    Published 8/30/2024 by Cherag Aroraa, Tracy Holloway King, Jayant Kumar, Yi Lu, Sanat Sharma, Arvind Srikantan, David Uvalle, Josep Valls-Vargas, Harsha Vardhan

    Overview

    • Presents a novel approach for integrating sparse and dense embeddings in a multi-modal search system
    • Developed for Adobe Express, a popular design and publishing platform
    • Aims to improve search relevance and user experience by leveraging both textual and visual information

    Sparse embeddings have more dimensions than dense, but many are empty.

    1/2

    Dense Embedding Example Sparse Embedding Example
    Dim. Img. 1 Img. 2 Dim. Query Img. 2 Img. 3 Img. 4
    1 0.11 1.23 1 1.12
    2 1.21 0.42 2 0.81
    3 0.15 0.53 3 1.16 0.83 0.64
    4 0.22 2.25 4 1.83
    ... ...
    2048 2.17 0.64 8192 0.13 0.01

    Original caption: Table 1: Dense and Sparse representations of embeddings with sample scoring for sparse embeddings. Dense embeddings are shown with 2048 dimensions. Sparse embeddings have more dimensions (here 8192) but most of the dimensions have no values.

    Plain English Explanation

    The paper describes a technique for improving multi-modal search in Adobe Express, a design and publishing platform. The key idea is to combine two types of data representations - sparse embeddings that capture textual information, and dense embeddings that represent visual features.

    By integrating these complementary embeddings, the system can more effectively understand the user's intent and retrieve the most relevant content, whether it's based on text, images, or a combination of both.

    This approach helps to improve the overall search quality and product discovery experience for Adobe Express users.

    Technical Explanation

    The paper presents a multi-modal search system that integrates sparse textual embeddings and dense visual embeddings to enhance search relevance. The sparse embeddings capture semantic relationships between textual elements, while the dense embeddings represent visual features of the content.

    The system first generates sparse and dense embeddings for the search query and the indexed content (e.g., design templates, images, etc.) using separate neural networks. It then applies a contextual integration module to combine these complementary representations, taking into account the relationships between the text and visual features.

    The integrated embeddings are then used to calculate the relevance score between the query and the indexed content, allowing the system to retrieve the most relevant results for the user's search.

    The authors evaluate their approach on a large-scale dataset from Adobe Express and demonstrate significant improvements in search quality compared to existing methods that only use textual or visual features in isolation.

    Critical Analysis

    The paper provides a comprehensive solution for enhancing multi-modal search in the context of a design and publishing platform. The authors thoughtfully address the challenge of effectively integrating textual and visual information to improve the overall search experience.

    One potential limitation is the reliance on pre-trained networks for generating the sparse and dense embeddings. While this approach leverages existing models, it may be worth exploring end-to-end training of the entire system to further optimize the integration of the different modalities.

    Additionally, the authors could have delved deeper into the potential biases that may arise from the multi-modal representation and how they plan to mitigate such issues to ensure fair and inclusive search results.

    Overall, the paper presents a compelling and practical solution for enhancing multi-modal search in the context of a design and publishing platform, and the insights gained can be valuable for similar applications in other domains.

    Conclusion

    The paper introduces a novel approach for integrating sparse textual embeddings and dense visual embeddings to improve the relevance and user experience of multi-modal search in the Adobe Express platform. By leveraging the complementary information from both textual and visual features, the system can better understand the user's intent and retrieve the most relevant content, ultimately enhancing the overall search quality and product discovery capabilities.

    The techniques and insights presented in this work can be valuable for researchers and practitioners working on similar challenges in multi-modal information retrieval, particularly in the context of design and publishing platforms where both textual and visual elements play a crucial role in the user experience.

    Full paper

    Loading...

    Loading PDF viewer...

    Read original: arXiv:2408.14698



    This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

    Total Score

    0

    Follow @aimodelsfyi on 𝕏 →

    Related Papers

    Compressible and Searchable: AI-native Multi-Modal Retrieval System with Learned Image Compression
    Total Score

    0

    Compressible and Searchable: AI-native Multi-Modal Retrieval System with Learned Image Compression

    Jixiang Luo

    The burgeoning volume of digital content across diverse modalities necessitates efficient storage and retrieval methods. Conventional approaches struggle to cope with the escalating complexity and scale of multimedia data. In this paper, we proposed framework addresses this challenge by fusing AI-native multi-modal search capabilities with neural image compression. First we analyze the intricate relationship between compressibility and searchability, recognizing the pivotal role each plays in the efficiency of storage and retrieval systems. Through the usage of simple adapter is to bridge the feature of Learned Image Compression(LIC) and Contrastive Language-Image Pretraining(CLIP) while retaining semantic fidelity and retrieval of multi-modal data. Experimental evaluations on Kodak datasets demonstrate the efficacy of our approach, showcasing significant enhancements in compression efficiency and search accuracy compared to existing methodologies. Our work marks a significant advancement towards scalable and efficient multi-modal search systems in the era of big data.

    Read more

    4/17/2024

    Contextual Document Embeddings
    Total Score

    1

    Contextual Document Embeddings

    John X. Morris, Alexander M. Rush

    Dense document embeddings are central to neural retrieval. The dominant paradigm is to train and construct embeddings by running encoders directly on individual documents. In this work, we argue that these embeddings, while effective, are implicitly out-of-context for targeted use cases of retrieval, and that a contextualized document embedding should take into account both the document and neighboring documents in context - analogous to contextualized word embeddings. We propose two complementary methods for contextualized document embeddings: first, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss; second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation. Results show that both methods achieve better performance than biencoders in several settings, with differences especially pronounced out-of-domain. We achieve state-of-the-art results on the MTEB benchmark with no hard negative mining, score distillation, dataset-specific instructions, intra-GPU example-sharing, or extremely large batch sizes. Our method can be applied to improve performance on any contrastive learning dataset and any biencoder.

    Read more

    11/11/2024

    Designing Interfaces for Multimodal Vector Search Applications
    Total Score

    0

    Designing Interfaces for Multimodal Vector Search Applications

    Owen Pendrigh Elliott, Tom Hamer, Jesse Clark

    Multimodal vector search offers a new paradigm for information retrieval by exposing numerous pieces of functionality which are not possible in traditional lexical search engines. While multimodal vector search can be treated as a drop in replacement for these traditional systems, the experience can be significantly enhanced by leveraging the unique capabilities of multimodal search. Central to any information retrieval system is a user who expresses an information need, traditional user interfaces with a single search bar allow users to interact with lexical search systems effectively however are not necessarily optimal for multimodal vector search. In this paper we explore novel capabilities of multimodal vector search applications utilising CLIP models and present implementations and design patterns which better allow users to express their information needs and effectively interact with these systems in an information retrieval context.

    Read more

    9/19/2024

    Enhancing Cross-Modal Contextual Congruence for Crowdfunding Success using Knowledge-infused Learning
    Total Score

    0

    Enhancing Cross-Modal Contextual Congruence for Crowdfunding Success using Knowledge-infused Learning

    Trilok Padhi, Ugur Kursuncu, Yaman Kumar, Valerie L. Shalin, Lane Peterson Fronczek

    The digital landscape continually evolves with multimodality, enriching the online experience for users. Creators and marketers aim to weave subtle contextual cues from various modalities into congruent content to engage users with a harmonious message. This interplay of multimodal cues is often a crucial factor in attracting users' attention. However, this richness of multimodality presents a challenge to computational modeling, as the semantic contextual cues spanning across modalities need to be unified to capture the true holistic meaning of the multimodal content. This contextual meaning is critical in attracting user engagement as it conveys the intended message of the brand or the organization. In this work, we incorporate external commonsense knowledge from knowledge graphs to enhance the representation of multimodal data using compact Visual Language Models (VLMs) and predict the success of multi-modal crowdfunding campaigns. Our results show that external knowledge commonsense bridges the semantic gap between text and image modalities, and the enhanced knowledge-infused representations improve the predictive performance of models for campaign success upon the baselines without knowledge. Our findings highlight the significance of contextual congruence in online multimodal content for engaging and successful crowdfunding campaigns.

    Read more

    11/19/2024