bili-search-algo

Algorithms and models for Bilibili Search Engine (blbl.top).

models.sentencepiece.train

Train sentencepiece model from video texts.

python -m models.sentencepiece.train -m sp_507m_400k_0.9995_0.9 -ec -vs 400000 -cc 0.9995 -sf 0.9 -e

Test:

python -m models.sentencepiece.train -m sp_507m_400k_0.9995_0.9 -t

Merge sentencepiece models which are trained on different datasets.

python -m models.sentencepiece.merge

Tokenize video texts from database, and save to parquets. Used by datasets.videos.freq and models.fasttext.train.

python -m datasets.videos.cache -ec -dn video_texts_tid_all -fw 200 -bw 100 -bs 10000

Count video terms freqs from database or parquets, and save to csv and pickle. Used by models.fasttext.train.

Specify region tid:

python -m datasets.videos.freq -o video_texts_freq_tid_17_nt -dn "video_texts_tid_17" -tid 17 -nt

All regions:

python -m datasets.videos.freq -o video_texts_freq_tid_all_nt -dn "video_texts_tid_all" -nt

Name		Name	Last commit message	Last commit date
Latest commit History 212 Commits
configs		configs
datasets		datasets
models		models
stats		stats
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py