Large Language Models Produce Moral Rationalizations Tailored to Political Groups
University of California, Davis
Do LLMs reproduce the moral biases associated with social groups?
To avoid risks and steer towards benefit, we should understand whether LLMs reproduce moral biases.
Criteria
Solution
Moral Foundations Theory (MFT)
Care/ Harm
Fairness/ Cheating
Loyalty/ Betrayal
Authority/ Subversion
Sanctity/ Degradation
Care/ Harm
Liberty/ Oppression
Fairness/ Cheating
Loyalty/ Betrayal
Authority/ Subversion
Sanctity/ Degradation
Individualizing
Care/ Harm
Liberty/ Oppression
Fairness/ Cheating
Binding
Loyalty/ Betrayal
Authority/ Subversion
Sanctity/ Degradation
General Idea: Large Language Models produce moral content that reflects the moral biases of social groups when conditioned with social group identity
This Research: LLMs produce moral content that reflects the moral biases of US liberals and conservatives when conditioned with liberal/conservative political identity
RQ1Can LLMs produce high-quality moral rationalizations?
RQ3Do LLMs use moral foundations differently when prompted with different political identities?
RQ6How does moral mimicry relate to model size?
Goal: Elicit moral “rationalization-like artifacts” from the model.
Prompt consists of three parts embedded in a template:
Scenario
Identity
Stance
Template:
As a identity, here are the moral arguments for why scenario is stance:
Prompt:
As a conservative, here are the moral arguments for why stealing from the store is immoral:
Completion:
GPT-3
Commercial model family from OpenAI
OPT [7]
Publicly-available model family from Meta AI
RQ1Can LLMs produce rationalization-like artifacts of sufficient quality for us to care?
RQ3Do LLMs use moral foundations differently when prompted with different political identities? (moral mimicry)
RQ6How is the moral mimicry capability affected by model size?
Hand-label 100 examples of completions from GPT-3-175B for three text qualities:
Relevance
Coherence
Stance Agreement
Separate results by normative and non-normative stance
OPT
GPT