Explaining CLIP’s performance disparities on data from blind/low vision users

Daniela Massiceti; Camilla Longden; Agnieszka Slowik; Samuel Wills; Martin Grayson; Cecily Morrison

Explaining CLIP’s performance disparities on data from blind/low vision users

Daniela Massiceti ,
Camilla Longden ,
Agnieszka Slowik ,
Samuel Wills ,
Martin Grayson ,
Cecily Morrison

2024 Computer Vision and Pattern Recognition | June 2024

Organized by IEEE/CVF

Download BibTex

Large multi-modal models (LMMs) hold the potential to usher in a new era of automated visual assistance for people who are blind or low vision (BLV). Yet, these models have not been systematically evaluated on data captured by BLV users. We address this by empirically assessing CLIP, a widely-used LMM likely to underpin many assistive technologies. Testing 25 CLIP variants in a zero-shot classification task, we find that their accuracy is 15 percentage points lower on average for images captured by BLV users than web-crawled images. This disparity stems from CLIP’s sensitivities to 1) image content (e.g. not recognizing disability objects as well as other objects); 2) image quality (e.g. not being robust to lighting variation); and 3) text content (e.g. not recognizing objects described by tactile adjectives as well as visual ones). We delve deeper with a textual analysis of three common pre-training datasets: LAION-400M, LAION-2B and DataComp-1B, showing that disability content is rarely mentioned. We then provide three examples that illustrate how the performance disparities extend to three downstream models underpinned by CLIP: OWL-ViT, CLIPSeg and DALL-E2. We find that few-shot learning with as few as 5 images can mitigate CLIP’s quality-of-service disparities for BLV users in some scenarios, which we discuss alongside a set of other possible mitigations.

Panel: Generative AI for Global Impact: Challenges and Opportunities

Hosted by Jacki O’Neill, with Sunayana Sitaram, Daniela Massiceti, and Tanuja Ganu at Microsoft Research Forum, Episode 3

Microsoft researchers discuss the challenges and opportunities of making AI more inclusive and impactful for everyone—from data that represents a broader range of communities and cultures to novel use cases for AI that are globally relevant.

All Research Forum sessions

Transcript

Panel: Generative AI for Global Impact: Challenges and Opportunities

JACKI O’NEILL: I’m delighted to be hosting what promises to be a really engaging panel today with three fabulous panelists. In my talk, I talked about the importance of building globally equitable generative AI systems for diverse communities and application areas, and I hope that I’ve convinced you all of the importance of doing this if generative AI is not going to compound existing systemic inequalities. In this panel, we’re going to dive much deeper into the application areas, the user populations, the problems, and the solutions of doing this with our three expert panelists: Sunayana Sitaram, Tanuja Ganu, and Daniela Massiceti. So without further ado, I’d like to ask each of the panelists to introduce themselves.

TANUJA GANU: Thank you, Jacki, and hello, everyone. My name is Tanuja Ganu, and I’m principal research engineering manager at Microsoft Research in India. My background is in applied AI, and my work is focused on developing and validating technologies that would drive positive change in society. I have been leading an incubation center in MSR India called SCAI—Societal Impact through Cloud and AI—and in last 1½ years, I’m spending a lot of time on how we can take the potential of generative AI to empower every individual across the globe and catalyze the change in some of the domains like education. Thank you.

SUNAYANA SITARAM: Hi, everyone. I’m Sunayana Sitaram. I’m principal researcher at Microsoft Research India, and my background is in natural language processing. My research involves trying to make sure that large language models, or generative AI as they’re also known, work well for all languages and cultures. And over the last couple of years, my research group has really looked into how to evaluate how well these large language models are doing for different languages across the world, including languages that have smaller amounts of data compared to English but are still spoken by millions of people worldwide. Thank you.

DANIELA MASSICETI: Hi, everyone. My name is Daniela Massiceti, and I’m a senior researcher at Microsoft Research based in Australia. My background is in machine learning, but nowadays, I work much more at the intersection of machine learning and human-computer interaction, particularly looking at multi-modal models. So these are models that work with both image and text input. And my main focus is, how do we ensure that these AI models or AI systems work well for the users who are in the tails of the user distribution? And in particular, the research that I’ve done along with my team, it particularly looks at people with disabilities, who will, of course, be major beneficiaries of these multi-modal models.

O’NEILL: Thank you so much. I’d like to start by asking you what you see as the core problems we face building equitable generative AI that works well for diverse communities and user groups. Tanuja, would you like to start us off?

GANU: Let me start off by saying that I feel that this is an exciting time to be in technology, and I’m really thrilled with the remarkable progress and the vast potential of generative AI. And we are already seeing successful deployments of generative AI in enterprise applications like GitHub Copilot for programmers or Office 365 Copilot for enterprise users, which is showing the improved efficiency and quality as well as giving ability to the users to focus more on their creative work. So the natural next question is, how can we take this power of generative AI and empower every individual, every individual across the globe—the people who are coming from different nationalities, different ethnicities, cultures, as well as with varied, kind of, technology access and financial, kind of, affordability, as well? So when we are looking at this technological evolution, I think it’s crucial that we, kind of, prioritize and focus and address the digital divide and we really, kind of, actively work to reduce this particular gap. So taking these points into account, there are [a] few sociotechnical challenges that we need to address when we want to make sure that generative AI technology truly, kind of, works for every individual. So firstly, I think the first important challenge is making sure that these technologies are able to provide seamless interaction across thousands of world languages. And it’s not only about language, but it’s also about incorporating and preserving cultural nuances in these different kind of communities and user groups. The second important challenge is about designing for existing infrastructural constraints. Infrastructural constraints are like the existing technologies need to have low-end mobile phones as primary interface in some of the cases or dealing with low or intermittent network connectivity and overall low affordability when we are especially looking at vast majority of populations from Global South. The third important problem that I consider is the varied access levels depending upon the literacy levels as well as the access needs depending upon disabilities. And the fourth important challenge is really overarching as in, how can we expand and how can we revisit the responsible AI and safe deployment principles taking into account these culturally and linguistically varied user groups and expanding to include the dimensions of equity, access, and inclusion? So I think these are really some of the important challenges.

O’NEILL: Thank you so much, Tanuja. I think you’ve really given us a great overview there. Daniela, I wonder if you could deep dive a bit on the accessibility questions that Tanuja raised.

MASSICETI: Yeah, sure thing, Jacki. So, yeah, I can definitely bring some perspectives here from the work that my team—me and my team—have done in the accessibility space. So we know, as I said earlier, that these multi-modal models really hold the potential to transform assistive technologies for communities with disabilities. But up until now, very few works have actually quantified, well, how well are these models going to work for these communities? And so a piece of work that we recently did, which was published in CVPR, basically aimed to do exactly this. Specifically, we looked at images and text captured by users who are blind and then evaluated how well CLIP, which is a very popular multi-modal model, actually works on their data. And I wanted to share, kind of, three insights that came from this work which speak to the core challenges I think that lie ahead of us realizing truly equitable AI systems in the future.

So the first is that the datasets typically used to train these AI models do not include data from communities with disabilities. In our work, we analyzed three large-scale datasets that are typically used to pretrain these large multi-modal models, and we found that disability content—things like guide canes, Braille displays—are significantly underrepresented or actually just not present at all in these datasets. And so this means that then any model that is trained on this dataset will perform poorly on any task that involves identifying, locating, or answering questions about any of these particular objects. And I don’t think that this problem of data inclusion is just the case for the blind and low-vision community but many, many marginalized communities who may not be included in these datasets. And the second core problem is that I think we’re moving toward this paradigm where we have a very small number of enormous models—these so-called foundation models—which are being widely used by many, many downstream models and applications. But if these foundation models don’t work well in the first instance for marginalized communities, then we have the potential to see this compounding essentially in any downstream application that uses these foundation models. And this is exactly what we saw in our CVPR work.

We identified that CLIP, as a base model, significantly underperforms on data from blind and low-vision users. But then when CLIP is embedded as a component in other models, these failures persist and in some cases are even amplified. So, for example, we looked at DALL-E 2, which uses a CLIP vision encoder under the hood, and we basically saw that it couldn’t generate any decent images of any of the disability objects we tested. You know, when we asked it for a guide cane, it gave us very funky-looking walking sticks. And when we asked it for Braille keyboards, it again gave us these random arrangements of white dots on a page.

And in the final core problem I’ll reflect on is that I think we don’t often embed ourselves deeply enough in marginalized communities to really understand the ways that AI models need to work for these communities. So, for example, one of the findings in our CVPR paper was that CLIP has trouble recognizing objects if users describe them by their material rather than their color. So, for example, a user might say find my leather bag rather than my brown bag. And we only really knew to test for this because our team collectively has over 20-plus years of experience in working with the blind and low-vision community to know that users often use these material-based descriptions when they’re talking about their objects. And so without this insight, we would never have uncovered this particular failure mode, and so I think it’s really important, to achieve truly equitable AI models, we really need to deeply embed ourselves in the communities that we’re working with.

O’NEILL: Thank you, Daniela. So Sunayana, Daniela’s given us a really good overview of the challenges with the multi-modal models and the image models. I know that your research is primarily thinking about how different language communities can interact with these language models. I’m wondering, what do you see as the problems for making these models work well for anyone, anywhere, whatever language they speak?

SITARAM: Right. So as Daniela mentioned, there is a data divide, right, even when it comes to languages because most language models today are trained predominantly on data that comes from the web. And we know that not all languages and cultures are equally represented on the web, right. So at the very first step of the pipeline, you now have this inequity because of different representation of different languages and cultures. But I think that’s not the only problem. There are a lot of other decisions that are taken during the model-building process which could also influence downstream performance. So, for example, in some of our research earlier last year, which was published in EMNLP, we found that the tokenizer, which is the component that actually breaks words down into smaller pieces, that doesn’t work equally well for all languages, and that actually has a significant impact on downstream performance. So things like this, you know, decisions that are taken during the model-building process can also really influence the performance. And finally, you know, one of the biggest challenges I see—and I may be a little biased because this is my area of research—is that, you know, we are not able to actually evaluate these models across all languages and cultures well. And this is because of a variety of reasons, including the fact that, you know, not too many benchmarks exist with the sufficient linguistic and cultural diversity. But because we are not doing a good job of evaluation, we don’t even know how well these models work for different languages and cultures. And so I think, you know, beyond data, there are many other challenges that need to be addressed in order to make these models actually work for all languages and cultures.

O’NEILL: Yeah, thank you so much. I think it’s really clear from your answers how these technologies are the biggest challenges for making these technologies work at both the societal level and also the level of the actual models themselves, you know, whether they’re vision or multi-modal models or language models, and we know that this has a direct impact on various user populations. As Tanuja mentioned in the beginning, you know, we’re seeing a lot of enterprise applications and enterprise technologies being developed, whether that’s for helping you code or ideate or answer emails. But are there other user populations who could really benefit from applications of generative AI which works well? Tanuja?

GANU: Yeah, so I think there are a lot of interesting and impactful applications which are emerging for generative AI in domains like education or health care and agriculture. So let me give you an example from our work in education, where we are developing [an] AI assistant, which is called Shiksha copilot, that provides agency to the teachers in public schools in India for generating personalized and engaging learning experiences like activities, assessments, the teaching material for their students. So what is important here is that the content generated is completely grounded in the local curriculum and the interaction is completely in local language, which is Kannada in this particular case. It’s also important that the content, kind of, preserves the cultural or local norms. So let’s take an example of a teacher teaching components of food or balanced diet as the topic. So it should include the examples which are coming from the local diet and cuisine, maybe giving an example of biryani or maybe giving an example of ragi mudde, which is made up of finger millet. So it’s also additionally important that the teacher is able to use and generate the lesson plans on the mobile phone or their desktop, whichever are the, kind of, resources which are available to them, and they should be able to utilize this particular Shiksha copilot while using in the classrooms where AV systems might not be available. So they can generate the lesson plan on the phone, and they can take it to the classroom and completely utilize it in the offline manner. So I think these are all the challenges that we discussed earlier; those become really important when we are doing these kind of real-world deployments. So with Shiksha copilot, we have completed a successful small pilot with 50 teachers in India, and now we are gearing towards a scaled pilot with thousand teachers. And I feel like applications like these can have a really transformative effect in the education system and create a positive impact for students and teachers across the globe.

O’NEILL: Thank you. Daniela, for the accessibility populations, what type of applications and populations are important in this space?

MASSICETI: Yeah, sure thing. So an estimated 1.3 billion people—around 16 percent of the global population—live with some level of disability today. So I think it’s really exciting to see these generative AI applications coming online for these communities, and our team has done, as you may already have gathered, a lot of work with the blind and low-vision community. And so I wanted to call out a couple of promising generative AI applications for this particular community. The first is Microsoft’s own actually: Seeing AI. So Seeing AI is a mobile app for users who are blind and low vision, and they’re really leading the charge in innovating new assistive user experiences using models like GPT-4. So, for example, they’ve built in features which allow users to answer really detailed questions about a document they’ve scanned as well as get these beautifully detailed captions or descriptions of photos that they’ve taken. And you can really see the impact of these. For example, maybe when you’re visiting a museum, you can snap a picture and get these beautiful descriptions around the artworks that are … of the artworks that are around you. I’ll also call out the partnership which was recently announced or announced last year between Be My Eyes and OpenAI. So Be My Eyes is a video-calling app which connects blind users with sighted volunteers when they need help on a particular task. So, for example, they snap a picture of a packet of potatoes or a packet of tomatoes and then ask the sighted volunteer if they’re out of date, for example. And the promise with the OpenAI partnership is that perhaps some point in the future, these sighted volunteers may be replaced by a model like GPT-4 with vision, enabling pretty much instantaneous and fully automated assistance for blind users anywhere in the world. So I think that’s really exciting. And in fact, I—along with some other colleagues at Microsoft Research—worked very closely with OpenAI and teams across Microsoft to red team the GPT-4 with vision model and really ensure that it met Microsoft’s high bar before it was publicly released. And I think this is a really tangible demonstration of Microsoft’s commitment to delivering safe and responsible AI technologies to its customers.

O’NEILL: Thank you so much. So how do we, given these large populations who could really benefit, how do we go about building solutions for them that actually work?

GANU: So maybe I will take this. So given that we are working with really diverse populations, I think it’s extremely useful that we work with user-centered design or participatory design approach and collect the voices of the users and especially the marginalized communities and the underserved communities right from the start at the design time. It’s also important while we are dealing with this nascent or emerging technology that we do have the right safeguards while deploying the system and we are able to collect the feedback at every stage when we, kind of, deploy the systems, such as using the expert-in-the-loop kind of deployment, where the expert has the ability to verify as well as override the responses as and when required. So to give an example, this was one of the, kind of, conscious decisions when we started working with Shiksha copilot, to start with the teachers and not with the students first, where teacher is the expert in the loop, and we can extend the benefits of the technology to the students through teachers to start with and eventually, kind of, go to the students.

Also, while we are working and looking at various applications across population scale, as I mentioned earlier, in domains like agriculture, education, health care, and other domains, what we are seeing is that there are common problems or universal challenges which are repeated across all these particular domains. As Sunayana talked about earlier, multilingual interaction is a huge problem across all domains. The other important problem is that most of the knowledge base that is required for grounding or, kind of, generating these AI experiences on is non-digitally native and multi-modal. So how do we extract the information from these multi-modal, non-digitally native content is a challenge across these different domains. So what we are doing as part of our project, which is called Project VeLLM, which stands for “uniVersal Empowerment with Large Language Models,” is we are building this versatile platform, which you can think of as building blocks or tool set providing all these different functionalities which are common across these different, kind of, applications. And now the other developers do not have to start from scratch. They can use these building blocks and create their equitable AI experiences rapidly across different domains.

SITARAM: Generalizing a little bit from what Tanuja just said about expert in the loop, I think that, you know, one of the solutions that we’ve been using is to actually design with “human in the loop” in mind because we know that these technologies are not perfect. And so, you know, we really want to figure out ways in which humans and AI systems can work together in order to create the most effective outcome. And in our research, we’ve actually been doing this for evaluation of, you know, multilingual scenarios. So, for example, we know that, you know, large language models can do a good job of evaluation, but we also know that they don’t do a very good job on some languages and along some dimensions, right. So those languages and those dimensions should ideally be left to a human to do, whereas for the ones that we are very confident that the LLM is doing a good job, we can actually rely on it more with some human oversight in order to scale up the process of evaluation. So this idea of actually using humans and AI together and designing for this kind of hybrid system, I think, is really crucial. And, of course, we need to keep revisiting this design as these AI systems become more and more capable.

MASSICETI: Yeah, so many points I can agree with there and build on. I think what’s common with both Tanuja’sand Sunayana’s answers is really this need to, kind of, bring models and humans together. And I think one real limitation we’ve seen in our work across many of the models we’ve worked with is that they really often generate quite generic responses, you know. So if you prompt an LLM to write you an email, the tone and style don’t quite, sort of, quite feel like yours. And so I think as we look to this next decade of generative AI solutions, I really hope to see that we’re going to see more personalized AI models and solutions come through much more strongly, solutions where you as the user have much more control, much more agency, around how your model works for you. And I think that’s another example of how human users and the AI model need to come together in order to create something even more powerful. And I think this is going to be even more impactful for marginalized—even more important even—for marginalized communities, whose needs often differ a lot from, kind of, the average or the generic needs.

And to, kind of, just bring one concrete example to the table, our team has been building a personalizable object recognizer over the last year. So here, a blind user can pretty much teach the object recognizer their personal objects, things like their sunglasses, their partner’s sunglasses, maybe their favorite T-shirt. And they do this by taking short videos of these objects, and then the personalized recognizer can then help them locate these things at any point in the future. And so in this sense, the user is really given the agency. It’s really this example of a human-in-the-loop paradigm, where a user is given the agency to personalize their AI system to meet their exact needs. So, yeah, it’s really exciting. This feature has actually just been released in Seeing AI, and so we’re really keen to begin imagining how we might see more personalizable generative AI experiences for users in the near future.

O’NEILL: Yeah, I really love that idea. I think we would all benefit from more personalized AI, even when you’re just trying to craft an email or something like that. The challenge people often face is it doesn’t really sound like them.

MASSICETI: Exactly.

O’NEILL: And then if you have to edit it too much, then, you know, you reduce the benefit. So I think there’s so many areas right across the board where personalization could help. So finally, as we’re coming to a close, I really would love to finish by asking each of you what you think the biggest research questions that are still open are, what the biggest gaps are, and how you would advise the research community to go about solving them.

MASSICETI: Yeah, it’s a big, big question. I’ll maybe take a stab first. So I think a couple of us have already touched on this point before, but the data divide, I think, is really a big, big challenge. You know, the fact that data is widely available for some communities but then totally absent or very sparse for others. And I think this is one of the biggest hurdles we need to address as a research community in order to really move the needle on equitable AI because it’s impacting everything from the way that we can train models but also, as Sunayanasaid, to how we can evaluate these models, as well. But I want to, kind of, call out that even though we’ve identified the problem—we, kind of, know what the problem is; you know, we need to include data from these communities—I think there’s just so many open questions around how we actually do this well and how we actually do this right. And so I want to bring up two specific challenges or open questions that I feel are very prevalent.

The first is, what do equitable paradigms actually look like when we’re collecting data from or about a marginalized community? These communities, as we know, have often historically been exploited. And so we really need to find fair ways of not only involving these communities in these data collection efforts, but also compensating them for their efforts as these models are then trained on this data and then are deployed and used more broadly. But then the second open question, I think, is that we really need deep technical innovation in adapting models to new data. You know, we’ve obviously seen a lot of adaptation methods coming online—fine-tuning, LoRA—and they do really well at, kind of, adapting these models to new datasets and tasks. But what we’re seeing in our current experiments is that these approaches don’t work so well when that new data that’s coming in is very different from the pretraining dataset. So in one particular example, we gave Stable Diffusion 10 training images of a funky-looking cat statue, and it learned it really well, and it could generate actually really realistic images of this statue. But then when we did the same for a guide cane, Stable Diffusion just still cannot generate realistic-looking images of guide canes. And so I think we really need to build as a research community a deeper understanding around how we get models to learn new concepts or new things, even when they aren’t well represented in the pretraining datasets.

O’NEILL: Thanks so much, Daniela. Tanuja, is there anything you want to add?

GANU: So for me, it feels like we are just, kind of, beginning to scratch the surface, and there is a lot more work underway across the dimensions of culture, cost, human values, cognition, universal access, and many other dimensions. So while the journey is long and we are trying to solve some of these hard and important problems, it is important that we, kind of, continue to make progress systematically and iteratively and we continue to, kind of, collect feedback and critical feedback at each of these stages. We definitely need to do lot more work also looking at different types of models as in large language models for more complex tasks. But can we look at smaller language models, especially when we are looking at the infrastructural challenges, as I discussed earlier. How can we use combination of these models? How can we generate and collect data from different cultures and involve these communities to, kind of … because these are very implicit things and not documented, kind of, information about different cultures. So how do we, kind of, learn for those is also important question. And I think collaboration is the key here. It’s important that we involve the experts from multiple disciplines, user communities, researchers, and policymakers and accelerate the progress in the right direction. We are already doing some of these collaborations with academia and NGOs, with the programs like Microsoft Research AI & Society Fellows, and some of the existing collaborations with our community and partners in India and Africa. But I think we’ll just need to continue doing it more and continue making steady progress on this important problem.

SITARAM: I completely agree with what both Daniela and Tanuja said. And talking more about the language and culture aspect, I think we need to figure out a way to involve these local communities in the design and training as well as evaluation phases of model building. And we need to do this at scale if we really want to reach all languages, all cultures, etc., right. So I think that is the thing that we really need to figure out how to do. So there are a couple of projects that we’ve been working on that have attempted to do this. One of them is called DOSA, where we collected a dataset of cultural artifacts from different users in India. And this was meant to be a participatory design approach where people would tell us what cultural artifacts were really important to them, and then we would collect this data from the ground up and try to evaluate whether LLMs did a good job or not, right. That’s one example. The other project that we’ve been working on is called Pariksha, where we employ workers from this ethical data company called Karya to do evaluation of Indian language models. So here we’re really asking the users, who speak multiple languages, to tell us whether these models work for them or not. And so I feel like we need to figure out more ways in which we can involve these local communities but at scale so that we can really impact the model-building process and then so that we can actually make these models work well for everybody.

O’NEILL: I couldn’t agree with you more, Sunayana. I think involving user communities in technology design in general is one of the most important things that we can do, and this is even more so with underserved communities. I would just like to add something to that, though, which is that we really need multidisciplinary research that goes beyond anything that we’ve done before, involving researchers and practitioners and community members. And it’s important to remember that machine learning engineers and researchers on their own can’t solve the problem of building globally equitable generative AI. This is something that we really need to do in a large scale. We need to transcend disciplinary boundaries if we’re going to build technology that really works for everyone, everywhere. And on that note, I’d like to say thank you to the panelists. It’s been a great discussion and thank you to the audience.

MASSICETI: Thanks very much.

GANU: Thank you so much.

SITARAM: Thank you.

Ask Microsoft research copilot experience

Microsoft research copilot experience What were the main challenges and opportunities of making AI more inclusive that were discussed in the panel?

Insights into the Challenges and Opportunities of Large Multi-Modal Models for Blind and Low Vision Users: CLIP

Presented by Daniela Massiceti at Microsoft Research Forum, Episode 3

Daniela Massiceti delves into the transformative potential of multimodal models such as CLIP for assistive technologies. Specifically focusing on the blind/low-vision community, the talk explores the current distance from realizing this potential and the advancements needed to bridge this gap.

All Research Forum sessions

Transcript

Insights into the Challenges and Opportunities of Large Multi-Modal Models for Blind and Low Vision Users: A Case Study on CLIP

DANIELA MASSICETI: Hi there. My name is Daniela Massiceti, and I’m a senior researcher at Microsoft Research Cambridge. Today, I will be sharing our recent CVPR paper, which examines the challenges and opportunities of large multi-modal models for blind and low-vision users.

Today’s AI models hold incredible potential for assisting the blind community—from text recognition to object identification to question answering. Apps like Seeing AI are already deploying some of these AI features. But there is potential for much more. And I think this is hinted at by the recent partnership between OpenAI and Be My Eyes, with the promise that one day, human assistance could be replaced by AI agents that provide instantaneous assistance to blind users around the world. But despite their potential, no works have really looked at, well, how well do these models actually work on image and text data captured by blind users? And we know from the literature that this data is likely to be out of distribution or different in a number of ways. For example, blind users use a range of quite specialized assistive objects. They also are more likely to capture images with quality variation, things like camera blur and occlusion. And they’re also more likely to make use of non-visual vocabulary, for example, describing their objects by their physical rather than their visual properties.

Our work, therefore, set out to remedy this. Specifically, we systematically evaluated 25 variants of the CLIP model on data from blind and low-vision users. CLIP is one of today’s most widely used multi-modal models. It has over 15,000 citations and 75 million downloads. We used the ORBIT and the VizWiz-Classification datasets. Both of these are collected by blind users through real-world assistive applications. And we inspected CLIP’s performance on both a zero-shot image classification task directly as well as through examining the performance of models that use CLIP as a component, which is very widely done in the community. I unfortunately don’t have time to go into all the details of our work, but I will share our top three findings with you. First, we confirmed that CLIP does indeed underperform on data that is captured by blind and low-vision users. Second, these disparities trickle down to models that use CLIP as a component. And then third, these disparities stem from the fact that disability content is significantly underrepresented and sometimes missing completely from the datasets that are used to pretrain these large models. And I’ll dive into our three findings in a bit more detail.

So for the first finding, we found that CLIP underperforms on objects, image quality, and language that is typically used by blind users. On object type, CLIP recognizes disability objects like a Braille keyboard, for example, up to 28 percentage points less accurately than common objects like a TV remote. On image quality, CLIP is up to 23 percentage points more sensitive to images with things like camera blur and lighting compared to images that don’t have these quality issues. And on language, CLIP recognizes objects that are described by their material—so, for example, a leather boot—up to 12 percentage points less accurately than objects described by their color—for example, a brown boot. And we know that blind users rely heavily on this tactile rather than visual language.

Towards our second finding, we examined three models that use CLIP under the hood—an object detection model, an image segmentation model, and an image generation model—and found that all three struggle with disability content. For example, DALL-E 2, which relies on a CLIP vision encoder, cannot generate common disability objects like guide canes and Braille keyboards. Instead, as you can see here, it gives us very strange-looking walking canes and lots and lots of randomly placed white dots. In comparison, DALL-E 2 generated really high-quality and realistic images for almost all of the non-disability objects that we tested.

And then towards our third and final finding, we really wanted to understand where these performance disparities were stemming from. And so we quantified just how prevalent disability content is in three popular datasets that are commonly used to pretrain these large models: LAION-[400]Million, LAION-2 Billion, and the DataComp-1B dataset, or 1 billion dataset. Specifically, we counted how many times objects are mentioned in these datasets’ captions and found that disability objects appear up to 16 to 17 times less frequently than non-disability objects across all three of the datasets.

So as you can see, our work has identified a clear gap in current models’ capabilities for blind users, and this could have very real consequences if these models are then integrated into assistive technologies for the blind and low-vision community. So what should we, as a research community, be doing about it? First, I think more work is needed to understand how models come to learn or adapt to long-tailed data. Some of our early results show that few-shot learning approaches hold some promise, but they don’t always work, especially in more challenging scenarios, for example, when objects appear in highly cluttered scenarios. And second, I think it’s important for us to really focus on including more disability content in these large-scale pretraining datasets. And our team [is] currently working on developing equitable and fair practices alongside disabled communities to source data that is truly representative of their needs. And so with that, I will wrap up.

Thank you to all the people behind this work and thank you for listening.

Microsoft research copilot experience What insights did Daniela Massiceti share about the challenges and opportunities of multi-modal models like CLIP for blind and low vision users?