Skip to content

Latest commit

 

History

History
8 lines (6 loc) · 319 Bytes

README.md

File metadata and controls

8 lines (6 loc) · 319 Bytes

GPT 2 tokenizer

GPT 2 tokenizer based on video by Andrej Karpathy

excluding:

  • <|ENDOFTEXT|>
  • regex for splitting into tokens is only partial (no negative lookahead in POSIX)

for memory allocation troubleshooting run: gcc tokenizer.c decoder.c encoder.c -o tokenizer -fsanitize=address && ./tokenizer < input.txt