Skip to content

dmathijs/gpt2-c-tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPT 2 tokenizer

GPT 2 tokenizer based on video by Andrej Karpathy

excluding:

  • <|ENDOFTEXT|>
  • regex for splitting into tokens is only partial (no negative lookahead in POSIX)

for memory allocation troubleshooting run: gcc tokenizer.c decoder.c encoder.c -o tokenizer -fsanitize=address && ./tokenizer < input.txt

About

GPT2 tokenizer written in C

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages