Getting started with the DEFCON 31 AI Village Red Team Dataset #24

cdSpivey · 2024-07-09T19:03:34Z

Summary writer: Carl Spivey
Paper title: Getting started with the DEFCON 31 AI Village Red Team Dataset
Author(s): n/a

Overview:

[2-3 sentences of what the paper is about]

This post will cover preliminary data exploration and findings in the dataset published on Hugging Face. A starter notebook that we made to analyze the dataset can be found here. We will cover our methods for cleaning and feature extraction of the dataset and a few of the data analysis techniques we tried.

Introduction

The AI Village Red Team Challenge featured 8 LLMs from NVIDIA, Meta, OpenAI, Anthropic, Cohere, Google, Huggingface, Stability.ai, and 2244 participants. There were 21 total challenges in ranging from bad math to network security where the participant tried to break the LLM in some way. We took this dataset, cleaned it, and extracted features that we thought could be of use. We then tried some preliminary data analysis techniques, to show some examples of how to use the data. Although we have no significant findings, the purpose of this article and accompanying notebook are to give others a starting point from which to continue.

Body

Data

Of 6384 submissions, 2702 were accepted for an acceptance rate of 45.39%. The data contained 17.3k entries, each with 8 features: category name, challenge name, contestant message, conversation, submission message, user justification, submission grade, and conversation length.

Dataset Cleaning and Feature Extraction

To begin the cleaning of the dataset, we dropped any entries that were not submitted and did not receive a score. We also dropped entries without a category or challenge name. The conversation originally contained a list of dictionaries where the key value pair was the speaker, and their text, so we converted the conversation to a string containing only the user's entries. Since we are trying to gain insight into bad LLM behavior having the model's responses felt like cheating.

We then created two new features using other models. First to categorize each attack into a technique, we sent each conversation to LLaMa3 and asked it to categorize the attack into one of the techniques found here. The techique used was added to be another feature of the dataset. Also the text conversations were sent to bge-m3 and embedded into a length 1024 vector, and these vectors became another feature in the dataset. As one final cleaning step, we dropped any entries whose values for technique were not found in our techniques given to prompt LLaMa3.

Visualizations in a 2D Space

To get a visualization of the embedding space, we used TSNE to perform dimension reduction on the embeddings and produce 2D vectors for each submission. We then plotted each submission on a scatter plot and colored them in various ways to see if we found anything interesting. A few of our results can be seen below.

Heatmap Visualization

Lastly, we looked at the interaction between persuasion technique used, the challenge name, and success. When counting successes of technique, challenge pairs, and sorting them to look nice, we get the resulting heamap.

Between the lines

That's all folks! Thank you for looking at this data with us and I hope you can take this notebook into a direction that gives you meaningful insight into the data. If all else fails, manual inspection of successful attempts is bound to give you a laugh.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting started with the DEFCON 31 AI Village Red Team Dataset #24

Getting started with the DEFCON 31 AI Village Red Team Dataset #24

cdSpivey commented Jul 9, 2024

Getting started with the DEFCON 31 AI Village Red Team Dataset #24

Getting started with the DEFCON 31 AI Village Red Team Dataset #24

Comments

cdSpivey commented Jul 9, 2024

Overview:

Introduction

Body

Data

Dataset Cleaning and Feature Extraction

Visualizations in a 2D Space

Heatmap Visualization

Between the lines