Position: To Make Text-to-Image Models that Work for Marginalized Communities, We Need New Measurement Practices for the Long Tail
- Nari Johnson ,
- Hamna . ,
- Deepthi Sudharsan ,
- Theo Holroyd ,
- Samantha Dalal ,
- Siobhan Mackenzie Hall ,
- Jennifer Wortman Vaughan ,
- Daniela Massiceti ,
- Cecily Morrison
While the capabilities of frontier text-to-image models are rapidly improving, they often fail to represent the low data, long tail concepts that matter to historically marginalized communities. Effective measurement is a critical first step towards identifying and addressing these errors, yet little work has validated if existing T2I evaluation metrics work for the long tail. In this paper, we draw upon two community-based case studies to identify challenges with applying best practices to validate T2I metrics using human preference data. We show that available approaches to create and validate evaluation metrics break down when applied to tail concepts because of the need for community knowledge (scaling community annotations) and challenges achieving a range of good and bad images (shades of bad). We take the position that methodological innovation is needed to develop measurement practices that work for the long tail. We outline directions for future work that moves beyond traditional approaches to measurement towards imagining new ways to center community expertise throughout the measurement process.