Skip to content

Commit

Permalink
Merge branch 'main' of github.com:jondurbin/airoboros
Browse files Browse the repository at this point in the history
  • Loading branch information
j-durbin committed Aug 3, 2023
2 parents 8433793 + 3a19314 commit b1458ef
Show file tree
Hide file tree
Showing 2 changed files with 86 additions and 3 deletions.
35 changes: 32 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,16 +48,45 @@ You can override the `topic_prompt` string in the configuration to use a differe

https://bmc.link/jondurbin

ETH
0xce914eAFC2fe52FdceE59565Dd92c06f776fcb11

BTC
bc1qdwuth4vlg8x37ggntlxu5cjfwgmdy5zaa7pswf

## Models (research use only):

### gpt-4 versions

#### llama-2 base model

Latest version (2.0 / m2.0 datasets)
* [airoboros-l2-7b-gpt4-2.0](https://huggingface.co/jondurbin/airoboros-l2-7b-gpt4-2.0)
* [airoboros-l2-7b-gpt4-m2.0](https://huggingface.co/jondurbin/airoboros-l2-7b-gpt4-m2.0)
* [airoboros-l2-13b-gpt4-2.0](https://huggingface.co/jondurbin/airoboros-l2-13b-gpt4-2.0)
* [airoboros-l2-13b-gpt4-m2.0](https://huggingface.co/jondurbin/airoboros-l2-13b-gpt4-m2.0)

Previous generation (1.4.1 dataset)
* [airoboros-l2-70b-gpt4-1.4.1](https://huggingface.co/jondurbin/airoboros-l2-70b-gpt4-1.4.1)
* [airoboros-l2-13b-gpt4-1.4.1](https://huggingface.co/jondurbin/airoboros-l2-13b-gpt4-1.4.1)
* [airoboros-l2-7b-gpt4-1.4.1](https://huggingface.co/jondurbin/airoboros-l2-7b-gpt4-1.4.1)

#### original llama base model

Latest version (2.0 / m2.0 datasets)
* [airoboros-33b-gpt4-2.0](https://huggingface.co/jondurbin/airoboros-33b-gpt4-2.0)
* [airoboros-33b-gpt4-m2.0](https://huggingface.co/jondurbin/airoboros-33b-gpt4-m2.0)

Previous generation (1.4.1 dataset)
* [airoboros-65b-gpt4-1.4](https://huggingface.co/jondurbin/airoboros-65b-gpt4-1.4)
* [airoboros-33b-gpt4-1.4](https://huggingface.co/jondurbin/airoboros-33b-gpt4-1.4)
* [airoboros-mpt-30bb-gpt4-1.4](https://huggingface.co/jondurbin/airoboros-mpt-30b-gpt4-1p4-five-epochs)
* [airoboros-13b-gpt4-1.4](https://huggingface.co/jondurbin/airoboros-13b-gpt4-1.4)
* [airoboros-7b-gpt4-1.4](https://huggingface.co/jondurbin/airoboros-7b-gpt4-1.4)
* *older versions on HF as well*

#### mpt-30b base model
* [airoboros-mpt-30bb-gpt4-1.4](https://huggingface.co/jondurbin/airoboros-mpt-30b-gpt4-1p4-five-epochs)

### gpt-3.5-turbo versions
* [airoboros-gpt-3.5-turbo-100k-7b](https://huggingface.co/jondurbin/airoboros-gpt-3.5-turbo-100k-7b)
* [airoboros-13b](https://huggingface.co/jondurbin/airoboros-13b)
Expand All @@ -71,8 +100,8 @@ https://bmc.link/jondurbin
* [airoboros-gpt4-1.2](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.2)
* [airoboros-gpt4-1.3](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.3)
* [airoboros-gpt4-1.4](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.4)
* [airoboros-gpt4-1.4.1 (recommended)](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.4.1)

* [airoboros-gpt4-2.0 (June only GPT4)](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-2.0)
* [airoboros-gpt4-m2.0 (recommended)](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-m2.0)

## Coming soon

Expand Down
54 changes: 54 additions & 0 deletions scripts/convert_to_conversation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
import re
import json
import uuid
inputs = [json.loads(line) for line in open("instructions.jsonl").readlines()]

def split_response(instruction, response):
if '</s>' not in response:
return [
{
"from": "human",
"value": instruction,
},
{
"from": "gpt",
"value": response,
},
]
parts = response.split('</s>')
user = [instruction]
assistant = []
for idx in range(len(parts)):
part = parts[idx]
if idx == 0:
assistant.append(part)
continue
match = re.match(r'^\s*USER:(.*?)ASSISTANT:(.*)\s*$', part, re.DOTALL)
if not match:
return None
user.append(match.group(1).strip())
assistant.append(match.group(2).strip())
conv = []
for idx in range(len(user)):
conv.append({
"from": "human",
"value": user[idx],
})
conv.append({
"from": "gpt",
"value": assistant[idx]
})
return conv

conversations = []
for row in inputs:
conversation = split_response(row['instruction'], row['response'])
if not conversation:
print("Bad format, skipping...")
continue
conversations.append({
"id": str(uuid.uuid4()),
"conversations": conversation,
})
with open("as_conversations.json", "w") as outfile:
outfile.write(json.dumps(conversations, indent=2))

0 comments on commit b1458ef

Please sign in to comment.