You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This post will cover preliminary data exploration and findings in the dataset published on Hugging Face. A starter notebook that we made to analyze the dataset can be found here. We will cover our methods for cleaning and feature extraction of the dataset and a few of the data analysis techniques we tried.
Introduction
The AI Village Red Team Challenge featured 8 LLMs from NVIDIA, Meta, OpenAI, Anthropic, Cohere, Google, Huggingface, Stability.ai, and 2244 participants. There were 21 total challenges in ranging from bad math to network security where the participant tried to break the LLM in some way. We took this dataset, cleaned it, and extracted features that we thought could be of use. We then tried some preliminary data analysis techniques, to show some examples of how to use the data. Although we have no significant findings, the purpose of this article and accompanying notebook are to give others a starting point from which to continue.
Body
Data
Of 6384 submissions, 2702 were accepted for an acceptance rate of 45.39%. The data contained 17.3k entries, each with 8 features: category name, challenge name, contestant message, conversation, submission message, user justification, submission grade, and conversation length.
Dataset Cleaning and Feature Extraction
To begin the cleaning of the dataset, we dropped any entries that were not submitted and did not receive a score. We also dropped entries without a category or challenge name. The conversation originally contained a list of dictionaries where the key value pair was the speaker, and their text, so we converted the conversation to a string containing only the user's entries. Since we are trying to gain insight into bad LLM behavior having the model's responses felt like cheating.
We then created two new features using other models. First to categorize each attack into a technique, we sent each conversation to LLaMa3 and asked it to categorize the attack into one of the techniques found here. The techique used was added to be another feature of the dataset. Also the text conversations were sent to bge-m3 and embedded into a length 1024 vector, and these vectors became another feature in the dataset. As one final cleaning step, we dropped any entries whose values for technique were not found in our techniques given to prompt LLaMa3.
Visualizations in a 2D Space
To get a visualization of the embedding space, we used TSNE to perform dimension reduction on the embeddings and produce 2D vectors for each submission. We then plotted each submission on a scatter plot and colored them in various ways to see if we found anything interesting. A few of our results can be seen below.
Heatmap Visualization
Lastly, we looked at the interaction between persuasion technique used, the challenge name, and success. When counting successes of technique, challenge pairs, and sorting them to look nice, we get the resulting heamap.
Between the lines
That's all folks! Thank you for looking at this data with us and I hope you can take this notebook into a direction that gives you meaningful insight into the data. If all else fails, manual inspection of successful attempts is bound to give you a laugh.
The text was updated successfully, but these errors were encountered:
Summary writer: Carl Spivey
Paper title: Getting started with the DEFCON 31 AI Village Red Team Dataset
Author(s): n/a
Link: Colab nb
Overview:
[2-3 sentences of what the paper is about]
This post will cover preliminary data exploration and findings in the dataset published on Hugging Face. A starter notebook that we made to analyze the dataset can be found here. We will cover our methods for cleaning and feature extraction of the dataset and a few of the data analysis techniques we tried.
Introduction
The AI Village Red Team Challenge featured 8 LLMs from NVIDIA, Meta, OpenAI, Anthropic, Cohere, Google, Huggingface, Stability.ai, and 2244 participants. There were 21 total challenges in ranging from bad math to network security where the participant tried to break the LLM in some way. We took this dataset, cleaned it, and extracted features that we thought could be of use. We then tried some preliminary data analysis techniques, to show some examples of how to use the data. Although we have no significant findings, the purpose of this article and accompanying notebook are to give others a starting point from which to continue.
Body
Data
Of 6384 submissions, 2702 were accepted for an acceptance rate of 45.39%. The data contained 17.3k entries, each with 8 features: category name, challenge name, contestant message, conversation, submission message, user justification, submission grade, and conversation length.
Dataset Cleaning and Feature Extraction
To begin the cleaning of the dataset, we dropped any entries that were not submitted and did not receive a score. We also dropped entries without a category or challenge name. The conversation originally contained a list of dictionaries where the key value pair was the speaker, and their text, so we converted the conversation to a string containing only the user's entries. Since we are trying to gain insight into bad LLM behavior having the model's responses felt like cheating.
We then created two new features using other models. First to categorize each attack into a technique, we sent each conversation to LLaMa3 and asked it to categorize the attack into one of the techniques found here. The techique used was added to be another feature of the dataset. Also the text conversations were sent to bge-m3 and embedded into a length 1024 vector, and these vectors became another feature in the dataset. As one final cleaning step, we dropped any entries whose values for technique were not found in our techniques given to prompt LLaMa3.
Visualizations in a 2D Space
To get a visualization of the embedding space, we used TSNE to perform dimension reduction on the embeddings and produce 2D vectors for each submission. We then plotted each submission on a scatter plot and colored them in various ways to see if we found anything interesting. A few of our results can be seen below.
Heatmap Visualization
Lastly, we looked at the interaction between persuasion technique used, the challenge name, and success. When counting successes of technique, challenge pairs, and sorting them to look nice, we get the resulting heamap.
Between the lines
That's all folks! Thank you for looking at this data with us and I hope you can take this notebook into a direction that gives you meaningful insight into the data. If all else fails, manual inspection of successful attempts is bound to give you a laugh.
The text was updated successfully, but these errors were encountered: