A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models

    Read original: arXiv:2402.13636 - Published 6/18/2024 by Ashutosh Sathe, Prachi Jain, Sunayana Sitaram
    Total Score

    0

    A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models

    Sign in to get full access

    or

    If you already have an account, we'll log you in

    Overview

    • Proposes a unified framework and dataset for assessing gender bias in vision-language models
    • Introduces a new dataset called Social Counterfactuals that includes visual scenes with diverse gender and racial representations
    • Evaluates several state-of-the-art vision-language models on the dataset to uncover biases

    Plain English Explanation

    This research paper presents a new way to measure gender bias in AI systems that combine vision and language, such as image captioning or visual question answering models. The researchers created a dataset called Social Counterfactuals that contains images depicting diverse people in various scenarios.

    By evaluating how well-known vision-language models perform on this dataset, the researchers were able to identify biases in how the models perceive and describe people of different genders. For example, the models might be more likely to associate women with domestic tasks and men with leadership roles.

    The paper provides a framework for probing and understanding these biases, which is an important step towards building more equitable and inclusive AI systems. The Uncovering Bias in Large Vision-Language Models and Think Before You Act papers also address the important issue of bias in vision-language models.

    Technical Explanation

    The researchers introduce a new dataset called Social Counterfactuals that contains over 100,000 images depicting diverse people in a variety of scenarios. This dataset was designed to probe for intersectional biases related to gender, race, and social roles.

    They then evaluate several state-of-the-art vision-language models, including CLIP, VinVL, and VisualBERT, on this dataset. The models are tasked with generating captions that describe the visual scenes. By analyzing the generated captions, the researchers are able to identify biases in how the models perceive and describe people of different genders.

    The Uncovering Bias in Large Vision-Language Models and No Filter papers also explore techniques for uncovering and mitigating biases in vision-language models.

    Critical Analysis

    The researchers acknowledge that their dataset and evaluation framework have some limitations. For example, the dataset primarily focuses on gender and racial biases, but does not address other forms of social bias, such as those related to age, class, or disability.

    Additionally, the researchers note that the vision-language models they evaluated were not specifically trained to avoid gender biases. Future work could explore techniques for probing and mitigating intersectional biases during the training process.

    Overall, this research represents an important step towards understanding and addressing gender biases in vision-language models, which is crucial for developing more equitable and inclusive AI systems.

    Conclusion

    This paper presents a unified framework and dataset for assessing gender bias in vision-language models. By evaluating state-of-the-art models on the Social Counterfactuals dataset, the researchers were able to uncover biases in how these models perceive and describe people of different genders.

    The findings from this research can inform the development of more equitable and inclusive AI systems, which is an important goal for the field. Continued work in this area, such as probing and mitigating intersectional biases, will be crucial for ensuring that AI technologies benefit all members of society.



    This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

    Follow @aimodelsfyi on 𝕏 →

    Related Papers

    A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models
    Total Score

    0

    A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models

    Ashutosh Sathe, Prachi Jain, Sunayana Sitaram

    Vision-language models (VLMs) have gained widespread adoption in both industry and academia. In this study, we propose a unified framework for systematically evaluating gender, race, and age biases in VLMs with respect to professions. Our evaluation encompasses all supported inference modes of the recent VLMs, including image-to-text, text-to-text, text-to-image, and image-to-image. Additionally, we propose an automated pipeline to generate high-quality synthetic datasets that intentionally conceal gender, race, and age information across different professional domains, both in generated text and images. The dataset includes action-based descriptions of each profession and serves as a benchmark for evaluating societal biases in vision-language models (VLMs). In our comparative analysis of widely used VLMs, we have identified that varying input-output modalities lead to discernible differences in bias magnitudes and directions. Additionally, we find that VLM models exhibit distinct biases across different bias attributes we investigated. We hope our work will help guide future progress in improving VLMs to learn socially unbiased representations. We will release our data and code.

    Read more

    6/18/2024

    VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model
    Total Score

    0

    VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model

    Jie Zhang, Sibo Wang, Xiangkui Cao, Zheng Yuan, Shiguang Shan, Xilin Chen, Wen Gao

    The emergence of Large Vision-Language Models (LVLMs) marks significant strides towards achieving general artificial intelligence. However, these advancements are tempered by the outputs that often reflect biases, a concern not yet extensively investigated. Existing benchmarks are not sufficiently comprehensive in evaluating biases due to their limited data scale, single questioning format and narrow sources of bias. To address this problem, we introduce VLBiasBench, a benchmark aimed at evaluating biases in LVLMs comprehensively. In VLBiasBench, we construct a dataset encompassing nine distinct categories of social biases, including age, disability status, gender, nationality, physical appearance, race, religion, profession, social economic status and two intersectional bias categories (race x gender, and race x social economic status). To create a large-scale dataset, we use Stable Diffusion XL model to generate 46,848 high-quality images, which are combined with different questions to form 128,342 samples. These questions are categorized into open and close ended types, fully considering the sources of bias and comprehensively evaluating the biases of LVLM from multiple perspectives. We subsequently conduct extensive evaluations on 15 open-source models as well as one advanced closed-source model, providing some new insights into the biases revealing from these models. Our benchmark is available at https://github.com/Xiangkui-Cao/VLBiasBench.

    Read more

    6/21/2024

    Vision-Language Models under Cultural and Inclusive Considerations
    Total Score

    0

    Vision-Language Models under Cultural and Inclusive Considerations

    Antonia Karamolegkou, Phillip Rust, Yong Cao, Ruixiang Cui, Anders S{o}gaard, Daniel Hershcovich

    Large vision-language models (VLMs) can assist visually impaired people by describing images from their daily lives. Current evaluation datasets may not reflect diverse cultural user backgrounds or the situational context of this use case. To address this problem, we create a survey to determine caption preferences and propose a culture-centric evaluation benchmark by filtering VizWiz, an existing dataset with images taken by people who are blind. We then evaluate several VLMs, investigating their reliability as visual assistants in a culturally diverse setting. While our results for state-of-the-art models are promising, we identify challenges such as hallucination and misalignment of automatic evaluation metrics with human judgment. We make our survey, data, code, and model outputs publicly available.

    Read more

    7/9/2024

    Images Speak Louder than Words: Understanding and Mitigating Bias in Vision-Language Model from a Causal Mediation Perspective
    Total Score

    0

    Images Speak Louder than Words: Understanding and Mitigating Bias in Vision-Language Model from a Causal Mediation Perspective

    Zhaotian Weng, Zijun Gao, Jerone Andrews, Jieyu Zhao

    Vision-language models (VLMs) pre-trained on extensive datasets can inadvertently learn biases by correlating gender information with specific objects or scenarios. Current methods, which focus on modifying inputs and monitoring changes in the model's output probability scores, often struggle to comprehensively understand bias from the perspective of model components. We propose a framework that incorporates causal mediation analysis to measure and map the pathways of bias generation and propagation within VLMs. This approach allows us to identify the direct effects of interventions on model bias and the indirect effects of interventions on bias mediated through different model components. Our results show that image features are the primary contributors to bias, with significantly higher impacts than text features, specifically accounting for 32.57% and 12.63% of the bias in the MSCOCO and PASCAL-SENTENCE datasets, respectively. Notably, the image encoder's contribution surpasses that of the text encoder and the deep fusion encoder. Further experimentation confirms that contributions from both language and vision modalities are aligned and non-conflicting. Consequently, focusing on blurring gender representations within the image encoder, which contributes most to the model bias, reduces bias efficiently by 22.03% and 9.04% in the MSCOCO and PASCAL-SENTENCE datasets, respectively, with minimal performance loss or increased computational demands.

    Read more

    7/4/2024