Project Overview
This project investigates the utility of Topological Data Analysis (TDA) for analyzing the results of contrastive editing algorithms applied to large language models.
Contrastive Editing Algorithms
Contrastive editing algorithms are tools that search for prompts that induce adversarial model behavior.
For example, one might use a contrastive editing algorithm to find prompts that result in impolite responses from an LLM like OpenAI's ChatGPT in the dialog setting.
Understanding the Challenge
It is in the interest of practitioners who build LLMs and dependent applications to understand what parts of the prompt space induce undesired behavior.
However, the prompt space is vast, and for any target behavior, there are numerous prompts that induce the behavior.
Addressing the Challenge with TDA
This project explores the use of TDA to provide analysis methods to abstract over these thousands of instances. The end goal is to enable practitioners to characterize the shape of adversarial prompt space.