Savannah Thais, a particle physicist by training, highlighted a “huge measurement crisis in AI” during an AI & Equality Open Studio. She emphasized the mischaracterization of AI model capabilities due to poorly designed benchmarks, citing ImageNet as an example. While ImageNet is a standard for computer vision, its focus on animals and vehicles means models excelling on it are specialized in those categories, not broader computer vision tasks. Thais also pointed out the lack of scientific rigor in AI measurements, stemming from conflicting “ground truth” definitions and insufficient consultation with domain experts, leading to inconsistencies across research. She criticized the “inherent unknowability” attributed to AI systems, where “magical thinking” replaces precise scientific measurement.Dangers of Poor Measurement
Thais argues that treating AI as a “magic solution” rather than a scientific artifact is dangerous. This approach risks flawed research and false conclusions about problem-solving. More significantly, it carries substantial societal dangers. Inaccurate AI models can entrench biases and inequality, impacting decision processes, resource allocation, and concentrating power among AI developers.
Qualities of Good Measurement
According to Thais, good measurement should possess three key qualities:
- Precision and Robustness: This includes a clear understanding of uncertainty and the reliability of the measurement, acknowledging that real-world data often yields approximate results.
- Domain-Informed: Measurements must be grounded in knowledge of the system being measured to avoid inaccuracies or violations of causality.
- Implementability: Real-world data is often messy and incomplete, requiring creative approaches to measurement given available information, such as limited access to proprietary model data.
Examples of Thais’s Work:
- Public Understanding of AI: An experiment designed to probe the general public’s understanding of AI revealed that most people equate AI primarily with ChatGPT or large language models, indicating a significant need for broader technical literacy.
- Cybersecurity Policy Characterization: This project used language models to analyze cybersecurity policy recommendations. Thais’s role focused on ensuring the model’s generalization across various recommendation types through statistical analysis and expert consultation, understanding how model performance varies with factors like sentence length or the presence of modal verbs.
- Public Health Data Leveraging: In a public health project, her team analyzed sex education policies and mapped them to CDC survey data to assess how often schools follow policy requirements. This crucial step supports causal analysis of policy impact and highlights how creative data source leveraging can enable good measurement.
The Human Rights Benchmark Project
A crucial ongoing project for Thais with Women at the Table and AI & Equality, is developing a human rights benchmark for large language models. This initiative is vital because AI models are increasingly involved in high-stakes processes that directly impact human rights. Currently, no benchmark exists to assess whether LLMs understand human rights legal frameworks.
The project involves extensive collaboration with AI researchers and human rights stakeholders to develop a dataset that probes LLMs’ knowledge of human rights. Initially, the focus is on due process, health, social security, privacy, and freedom from discrimination. The benchmark taxonomizes human rights tasks, such as identifying issues and applicable laws, and employs various prompt types, including multiple-choice, classification, and open-ended text generation.
Initial testing, such as with a prompt about AI replacing judges, showed that models failed to recognize due process violations, suggesting they may not have internalized human rights principles. Thais emphasized opportunities for further involvement, seeking input on concerns and scenarios, and inviting participation in prompt review and model evaluation for this critical endeavor.
About Savannah Thais
Dr. Savannah Thais is an Associate Research Scientist in the Columbia Data Science Institute with a focus on machine learning (ML). She is interested in complex system modeling and in understanding what types of information is measurable or modelable and what impacts designing and performing measurements have on systems and societies. This work is informed by her background in high energy particle physics and incorporates traditional scientific experiment design components such as uncertainty quantification, experimental blinding, and decorrelation/de-biasing methods. Her recent work has focused on geometric deep learning, methods to incorporate physics-based inductive biases into ML models, regulation of emerging technology, social determinants of health, and community education.
Dr. Thais is the founder and Research Director of Community Insight and Impact, an non-profit organization focused on data-driven community needs assessments for vulnerable populations and effective resource allocation. She is passionate about the impacts of science and technology on society and is a strong advocate for improving access to scientific education and literacy, community centered technology development, and equitable data practices. She was the ML Knowledge Convener for the CMS Experiment from 2020-2022, currently serves on the Executive Board of Women in Machine Learning and the Executive Committee of the APS Group on Data Science, and is a Founding Editor of the Springer AI Ethics journal.
Dr. Thais received her PhD in Physics from Yale University in 2019 and was a postdoctoral researcher at Princeton University from 2019-2022.