Testing AI Safety: Why Current Guardrails Fail to Stop Social Bias with Anna-Maria Gueorgiueva | AI & Equality Pub-Talk
How do large language models understand the lived experiences of stigmatized groups, and when does this understanding differ from the human perspective? Can this lead to bias, and if so, do our existing safety tools help mitigate such bias? This work investigated open-source language models for bias against 93 stigmatized groups, identifying that specific types of biases (especially those deemed by humans to be 'threatening' such as having HIV or a criminal record) experience significantly more bias than other types of stigmatized identities.