Skip to content

Commit

Permalink
Added bash script to prepare corpus for benchmark
Browse files Browse the repository at this point in the history
  • Loading branch information
petr-tik committed Dec 4, 2017
1 parent b09fbb1 commit 1864b37
Showing 1 changed file with 5 additions and 0 deletions.
5 changes: 5 additions & 0 deletions prepare_text.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
head -n 50000 /usr/share/dict/words | tail -n 20000 | tr -d "[A-Z|']" | iconv -f utf8 -t ascii//TRANSLIT | uniq | head -n 18000 > clean_words

shuf clean_words > random_words

head -n 80000 /usr/share/dict/words | tail -n 1000 | tr -d "[A-Z|']" | iconv -f utf8 -t ascii//TRANSLIT | uniq | head -n 800 > missing_words

0 comments on commit 1864b37

Please sign in to comment.