-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add test and incorporate char.tsv to improve syllable error rate #6
base: main
Are you sure you want to change the base?
Conversation
@@ -4,6 +4,7 @@ | |||
t2s = OpenCC('t2s').convert | |||
|
|||
os.system('wget -nc https://raw.githubusercontent.com/rime/rime-cantonese/5b6d334/jyut6ping3.dict.yaml') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much will the results be affected if we eliminate the downstream file and grab all files from upstream? Presuming we are going with this, let’s make upstream a submodule instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think submodule would be hard to maintain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will? Isn’t a submodule just an SHA, as if we manually amend the link regularly in this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Then it may not be a problem
呢個 PR 好似好大,我想知嗰啲 txt 係邊度嚟嘅?啲標音嘅正確率如何? |
@laubonghaudoi 正確率咪就係上面 Benefits 嗰度寫嘅嘢 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗰兩個 *_base
file 喺邊度嚟嘅?係咪就係未 normalize 嘅 result?
同埋我 prefer 唔 include generated files,另外嗰兩個 results file 你擺上嚟/send 畀我哋就得
reference = remove_ng_onset(normalize_nei_to_ni(reference)) | ||
hypothesis = remove_ng_onset(normalize_nei_to_ni(hypothesis)) | ||
if reference == hypothesis or \ | ||
diff_by_tone_only(reference, hypothesis) or \ | ||
diff_by_a(reference, hypothesis): | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest let jiwer
perform these normalizations by adding the third and fourth parameters to jiwer.wer
. Currently sentences differing by tone or -a
are included in neither correct nor wrong sentences file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filtered those sentences out because many times they are only stylistic differences in the romanization or sometimes personal preferences for 變調. However, it's generally hard to tell whether it's a true error or just stylistic difference so I think it's important to consider those differences in the calculation of WER.
I view the output sentences as an overview for humans to find and fix common error patterns. Since we already know that some stylistic differences do occur and are many times false alarms, I think we can safely filter them out in the sentences files to help us focus on more pressing and easier-to-fix issues.
如果你得閒,可唔可以用 |
唔係,我係唔理解點解要將幾個大 txt 加入呢個包入面,既然佢哋唔係個程式依賴嘅數據,只不過係用嚟做 benchmark嘅,噉點解要加落呢個包度?我問正確率如何意思係,我見到啲 txt 有 wrong 又有 wrong base,意思即係佢啲數據係人標嘅而唔係呢隻程式嘅輸出?噉既然係人標嘅就會有準確率? |
Sorry for the confusion. The base files are the output of the version before this PR while the non-base files are the output after this PR. Those files are meant to ease human inspection and shouldn't be packaged for a release. I'm not too familiar with how Python packages are released in general, so feel free to delete those files or put them into gitignore, etc. |
噉你將佢哋加入去 .gitignore 兼刪埋佢啦 |
Ok,搞掂。 |
This PR features 2 additions:
test/test.py
outputs the correct and incorrect sentences as text files for inspection and also gives an averaged syllable error rate over the entire corpus.preprocess.py
so that most frequent 預設 jyutpings for characters can overwrite uncommon pronuncations in thejyut6ping3.dict.yaml
file.Benefits
After the addition of 2, the syllable error rate decreased almost 20% from 7.33% down to 5.88%.
Future work
May need proper word segmentation instead of longest prefix match for more accurate handling of polyphones.