Add test and incorporate char.tsv to improve syllable error rate #6

AlienKevin · 2023-08-19T19:22:46Z

This PR features 2 additions:

Adds a simple test using human-annotated jyutping sentences from words.hk. The test module test/test.py outputs the correct and incorrect sentences as text files for inspection and also gives an averaged syllable error rate over the entire corpus.
Incorporate char.tsv in the preprocess.py so that most frequent 預設 jyutpings for characters can overwrite uncommon pronuncations in the jyut6ping3.dict.yaml file.

Benefits

After the addition of 2, the syllable error rate decreased almost 20% from 7.33% down to 5.88%.

Future work

May need proper word segmentation instead of longest prefix match for more accurate handling of polyphones.

…rom 7.33% to 5.88%

graphemecluster · 2023-08-19T19:38:11Z

preprocess.py

@@ -4,6 +4,7 @@
 t2s = OpenCC('t2s').convert

 os.system('wget -nc https://raw.githubusercontent.com/rime/rime-cantonese/5b6d334/jyut6ping3.dict.yaml')


How much will the results be affected if we eliminate the downstream file and grab all files from upstream? Presuming we are going with this, let’s make upstream a submodule instead.

I think submodule would be hard to maintain

How will? Isn’t a submodule just an SHA, as if we manually amend the link regularly in this file?

I see. Then it may not be a problem

laubonghaudoi · 2023-08-20T00:50:57Z

呢個 PR 好似好大，我想知嗰啲 txt 係邊度嚟嘅？啲標音嘅正確率如何？

graphemecluster · 2023-08-20T07:54:45Z

@laubonghaudoi 正確率咪就係上面 Benefits 嗰度寫嘅嘢

graphemecluster

嗰兩個 *_base file 喺邊度嚟嘅？係咪就係未 normalize 嘅 result？
同埋我 prefer 唔 include generated files，另外嗰兩個 results file 你擺上嚟／send 畀我哋就得

graphemecluster · 2023-08-20T08:02:16Z

tests/test.py

+            reference = remove_ng_onset(normalize_nei_to_ni(reference))
+            hypothesis = remove_ng_onset(normalize_nei_to_ni(hypothesis))
+            if reference == hypothesis or \
+                diff_by_tone_only(reference, hypothesis) or \
+                diff_by_a(reference, hypothesis):
+                    continue


I suggest let jiwer perform these normalizations by adding the third and fourth parameters to jiwer.wer. Currently sentences differing by tone or -a are included in neither correct nor wrong sentences file.

I filtered those sentences out because many times they are only stylistic differences in the romanization or sometimes personal preferences for 變調. However, it's generally hard to tell whether it's a true error or just stylistic difference so I think it's important to consider those differences in the calculation of WER.

I view the output sentences as an overview for humans to find and fix common error patterns. Since we already know that some stylistic differences do occur and are many times false alarms, I think we can safely filter them out in the sentences files to help us focus on more pressing and easier-to-fix issues.

graphemecluster · 2023-08-20T08:10:42Z

如果你得閒，可唔可以用 ToJyutping.get_jyutping 統計一下邊啲字錯得最多？

laubonghaudoi · 2023-08-20T17:11:45Z

唔係，我係唔理解點解要將幾個大 txt 加入呢個包入面，既然佢哋唔係個程式依賴嘅數據，只不過係用嚟做 benchmark嘅，噉點解要加落呢個包度？我問正確率如何意思係，我見到啲 txt 有 wrong 又有 wrong base，意思即係佢啲數據係人標嘅而唔係呢隻程式嘅輸出？噉既然係人標嘅就會有準確率？

AlienKevin · 2023-09-07T14:41:55Z

我問正確率如何意思係，我見到啲 txt 有 wrong 又有 wrong base，意思即係佢啲數據係人標嘅而唔係呢隻程式嘅輸出？噉既然係人標嘅就會有準確率？

Sorry for the confusion. The base files are the output of the version before this PR while the non-base files are the output after this PR. Those files are meant to ease human inspection and shouldn't be packaged for a release. I'm not too familiar with how Python packages are released in general, so feel free to delete those files or put them into gitignore, etc.

laubonghaudoi · 2023-09-07T16:05:14Z

噉你將佢哋加入去 .gitignore 兼刪埋佢啦

AlienKevin · 2023-09-08T21:29:57Z

噉你將佢哋加入去 .gitignore 兼刪埋佢啦

Ok，搞掂。

AlienKevin added 2 commits August 19, 2023 14:34

Add jyutping tests

4d02371

Use rime-cantonese-upstream char.csv; decreases syllable error rate f…

630faed

…rom 7.33% to 5.88%

AlienKevin mentioned this pull request Aug 19, 2023

「湯」默認讀音係棄用音 #5

Closed

graphemecluster requested review from ayaka14732, laubonghaudoi and graphemecluster August 19, 2023 19:28

graphemecluster reviewed Aug 19, 2023

View reviewed changes

graphemecluster reviewed Aug 20, 2023

View reviewed changes

Untrack test outputs

7333a7d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add test and incorporate char.tsv to improve syllable error rate #6

Add test and incorporate char.tsv to improve syllable error rate #6

AlienKevin commented Aug 19, 2023

graphemecluster Aug 19, 2023

ayaka14732 Aug 22, 2023

graphemecluster Aug 22, 2023

ayaka14732 Aug 23, 2023

laubonghaudoi commented Aug 20, 2023

graphemecluster commented Aug 20, 2023

graphemecluster left a comment

graphemecluster Aug 20, 2023 •

edited

Loading

AlienKevin Sep 7, 2023

graphemecluster commented Aug 20, 2023

laubonghaudoi commented Aug 20, 2023

AlienKevin commented Sep 7, 2023

laubonghaudoi commented Sep 7, 2023

AlienKevin commented Sep 8, 2023 •

edited

Loading

		@@ -4,6 +4,7 @@
		t2s = OpenCC('t2s').convert

		os.system('wget -nc https://raw.githubusercontent.com/rime/rime-cantonese/5b6d334/jyut6ping3.dict.yaml')

Add test and incorporate char.tsv to improve syllable error rate #6

Are you sure you want to change the base?

Add test and incorporate char.tsv to improve syllable error rate #6

Conversation

AlienKevin commented Aug 19, 2023

Benefits

Future work

graphemecluster Aug 19, 2023

Choose a reason for hiding this comment

ayaka14732 Aug 22, 2023

Choose a reason for hiding this comment

graphemecluster Aug 22, 2023

Choose a reason for hiding this comment

ayaka14732 Aug 23, 2023

Choose a reason for hiding this comment

laubonghaudoi commented Aug 20, 2023

graphemecluster commented Aug 20, 2023

graphemecluster left a comment

Choose a reason for hiding this comment

graphemecluster Aug 20, 2023 • edited Loading

Choose a reason for hiding this comment

AlienKevin Sep 7, 2023

Choose a reason for hiding this comment

graphemecluster commented Aug 20, 2023

laubonghaudoi commented Aug 20, 2023

AlienKevin commented Sep 7, 2023

laubonghaudoi commented Sep 7, 2023

AlienKevin commented Sep 8, 2023 • edited Loading

graphemecluster Aug 20, 2023 •

edited

Loading

AlienKevin commented Sep 8, 2023 •

edited

Loading