AI Testing and Evaluation: Learnings from Science and Industry

Discover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.

Generative AI presents a unique challenge and opportunity to reexamine governance practices for the responsible development, deployment, and use of AI. To advance thinking in this space, Microsoft has tapped into the experience and knowledge of experts across domains—from genome editing to cybersecurity—to investigate the role of testing and evaluation as a governance tool. AI Testing and Evaluation: Learnings from Science and Industry, hosted by Microsoft Research’s Kathleen Sullivan, explores what the technology industry and policymakers can learn from these fields and how that might help shape the course of AI development.

Episodes

Introducing ‘AI Testing and Evaluation: Learnings from Science and Industry’

Amanda Craig Deckard | June 23, 2025

In the introductory episode of this new series, host Kathleen Sullivan and Senior Director Amanda Craig Deckard explore Microsoft’s efforts to draw on the experience of other domains to help advance the role of AI testing and evaluation as a governance tool.

Illustrated headshots of Amanda Craig Deckard & Kathleen Sullivan.

Podcast AI Testing and Evaluation: Learnings from Science and Industry

BLog Learning from other domains to advance AI evaluation and testing

Guest

Amanda Craig Deckard
Amanda Craig Deckard is senior director of public policy in Microsoft’s Office of Responsible AI, where she leads efforts to strengthen AI governance as a foundation for trust and innovation.

Episode 1 | AI Testing and Evaluation: Learnings from genome editing

Alta Charo, Daniel Kluttz | June 30, 2025

Bioethics and law expert Alta Charo explores the value of regulating technologies at the application level and the role of coordinated oversight in genome editing, while Microsoft GM Daniel Kluttz reflects on Charo’s points, drawing parallels to AI governance.

Outline illustrations of Alta Charo, Kathleen Sullivan, and Daniel Kluttz

Podcast AI Testing and Evaluation: Learnings from genome editing

Guests

Alta Charo (opens in new tab)
Alta Charo, the Warren P. Knowles Professor Emerita of Law and Bioethics, is a biotechnology policy and ethics consultant who has been at the forefront of the field for decades.

Daniel Kluttz (opens in new tab)
Daniel Kluttz is a partner general manager in Microsoft’s Office of Responsible AI, where he leads the group’s Sensitive Uses and Emerging Technologies program.

Episode 2 | AI Testing and Evaluation: Learnings from pharmaceuticals and medical devices

Daniel Carpenter, Timo Minssen, Chad Atalla | July 7, 2025

Professors Daniel Carpenter and Timo Minssen explore evolving pharma and medical device regulation, including the role of clinical trials, while Microsoft applied scientist Chad Atalla shares where AI governance stakeholders might find inspiration in the fields.

Illustrated headshots of Daniel Carpenter, Timo Minssen, Chad Atalla, and Kathleen Sullivan for the Microsoft Research Podcast

Podcast AI Testing and Evaluation: Learnings from pharmaceuticals and medical devices

Guests

Daniel Carpenter (opens in new tab)
Daniel Carpenter is the Allie S. Freed Professor of Government and chair of the department of government at Harvard. His research spans social and political science, including pharmaceutical regulation.

Timo Minssen (opens in new tab)
Timo Minssen is a law professor at the University of Copenhagen, where he leads the Center for Advanced Studies in Bioscience Innovation Law. He specializes in legal aspects of biomedical innovation.

Chad Atalla
Chad Atalla is a senior applied scientist in Microsoft Research New York City’s Sociotechnical Alignment Center, where they contribute to responsible AI research and practical responsible AI solutions.

Episode 3 | AI Testing and Evaluation: Learnings from cybersecurity

Ciaran Martin, Tori Westerhoff | July 14, 2025

Drawing on his previous work as the UK’s cybersecurity chief, Professor Ciaran Martin explores differentiated standards and public-private partnerships in cybersecurity, and Microsoft’s Tori Westerhoff examines the insights through an AI red-teaming lens.

Illustrated images of Kathleen Sullivan, Ciaran Martin, and Tori Westerhoff for the Microsoft Research podcast

Podcast AI Testing and Evaluation: Learnings from cybersecurity

Guests

Ciaran Martin (opens in new tab)
Ciaran Martin is a professor of practice in the management of public organizations at the University of Oxford. Previously, he was the founding chief executive of the UK’s National Cyber Security Centre.

Tori Westerhoff (opens in new tab)
A principal director on the Microsoft AI Red Team, Tori Westerhoff leads AI security and safety red team operations and dangerous capability testing, directly informing company leadership.

Up next

Episode 4 | AI Testing and Evaluation: Reflections

Amanda Craig Deckard | July 21, 2025

In the series finale, Amanda Craig Deckard returns to examine what Microsoft has learned about testing as a governance tool. She also explores the roles of rigor, standardization, and interpretability in testing and what’s next for Microsoft’s AI governance work.

Series contributors: Neeltje Berger, Tetiana Bukhinska, David Celis Garcia, Matt Corwine, Amanda Craig Deckard, Kristina Dodge, Chris Duryee, Milan Gandhi, Ann Griffin, Alyssa Hughes, Gretchen Huizinga, Matthew McGinley, Amanda Melfi, Joe Plummer, Brenda Potts, Kathleen Sullivan, Amber Tingle, Kathleen Toohill, Craig Tuschoff, Sarah Wang, Brian Wesolowski, and Katie Zoller.

Other resources

Responsible AI at Microsoft

Global Governance:
Goals and Lessons for AI (opens in new tab)

Microsoft Research Podcast

Learning from other domains to advance AI evaluation and testing