Algorithms and models for Bilibili Search Engine (blbl.top).
Train sentencepiece model from video texts.
python -m models.sentencepiece.train -m sp_507m_400k_0.9995_0.9 -ec -vs 400000 -cc 0.9995 -sf 0.9 -e
Test:
python -m models.sentencepiece.train -m sp_507m_400k_0.9995_0.9 -t
Merge sentencepiece models which are trained on different datasets.
python -m models.sentencepiece.merge
Tokenize video texts from database, and save to parquets. Used by datasets.videos.freq
and models.fasttext.train
.
python -m datasets.videos.cache -ec -dn video_texts_tid_all -fw 200 -bw 100 -bs 10000
Count video terms freqs from database or parquets, and save to csv and pickle. Used by models.fasttext.train
.
Specify region tid:
python -m datasets.videos.freq -o video_texts_freq_tid_17_nt -dn "video_texts_tid_17" -tid 17 -nt
All regions:
python -m datasets.videos.freq -o video_texts_freq_tid_all_nt -dn "video_texts_tid_all" -nt