-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store subword indices for quality scores in Response #357
Conversation
The bindings still expose the byteranges as they previously did, because otherwise I would also have to expose AnnotatedText etc. which is apparently not something we're doing right now. Because the bindings create a new score vector on the fly, you'll have to delete it when you no longer use it, unfortunately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving a few queries below. Mozilla and QE have negotiated some properties (meaningful units of QE = word, etc.). There was some threshold based operation at extension discussion as well. This was before HTML, the assumptions were plaintext QE - so we may revisit.
Requesting @mfomicheva, @abarbosa94 and @felipesantosk to take a look, since we're modifying the parts.
src/tests/common-impl.cpp
Outdated
for (const auto &wordByteRange : sentenceQualityEstimate.wordByteRanges) { | ||
for (const auto &wordRange : sentenceQualityEstimate.wordRanges) { | ||
const ByteRange wordByteRange{response.target.wordAsByteRange(sentenceIdx, wordRange.begin).begin, | ||
response.target.wordAsByteRange(sentenceIdx, wordRange.end).begin}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a possibility wordByteRange
ends up including an HTML opening/closing tag if two subwords end up with an HTML something in between?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. They are completely unsynchronized and they have to be for better or worse. I imagine HTML rendering will use something like #358 by copying the QE score tags.
} | ||
|
||
return words; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's happening here? There was a "words as meaningful QE units" notion before, are we losing this with the HTML integrated QE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The QualityEstimator class still works with this assumption, subwords are grouped into words in mapWords
:
bergamot-translator/src/translator/quality_estimator.cpp
Lines 241 to 268 in bce30cf
std::vector<SubwordRange> mapWords(const std::vector<float>& logProbs, const AnnotatedText& target, | |
const size_t sentenceIdx) { | |
// Ignore empty target | |
if ((logProbs.size() < 2) || (target.numWords(sentenceIdx) == 0)) { | |
return {}; | |
} | |
// It is expected that translated words will have at least one word | |
std::vector<SubwordRange> wordIndices(/*numWords=*/1); | |
/// The LogisticRegressorQualityEstimator model ignores the presence of the EOS token, and hence we only need to | |
/// iterate n-1 positions. | |
for (size_t subwordIdx = 0; subwordIdx < (logProbs.size() - 1); ++subwordIdx) { | |
ByteRange subword = target.wordAsByteRange(sentenceIdx, subwordIdx); | |
const char firstLetter = target.text.at(subword.begin); | |
// if the first character is whitespace, it's a beginning of a new word | |
if (isspace(firstLetter)) { | |
wordIndices.back().end = subwordIdx; | |
wordIndices.emplace_back(); | |
wordIndices.back().begin = subwordIdx; | |
} | |
} | |
wordIndices.back().end = logProbs.size() - 1; | |
return wordIndices; | |
} |
The bit of code I removed here turned those groupings into a single byterange. That has been moved to the WASM bindings bit since that's the last possible moment to do it.
I think most clients will find it useful to do this themselves by accessing the AnnotatedText's data. But since we don't expose those to Javascript, and I didn't want to start a half-assed attempt at reworking the Javascript API, I moved the functionality above to here:
bergamot-translator/wasm/bindings/response_bindings.cpp
Lines 33 to 52 in bce30cf
std::vector<SentenceQualityScore> getQualityScores(const Response& response) { | |
std::vector<SentenceQualityScore> scores; | |
scores.reserve(response.qualityScores.size()); | |
for (size_t sentenceIdx = 0; sentenceIdx < response.qualityScores.size(); ++sentenceIdx) { | |
std::vector<ByteRange> wordByteRanges; | |
wordByteRanges.reserve(response.qualityScores[sentenceIdx].wordRanges.size()); | |
for (auto&& word : response.qualityScores[sentenceIdx].wordRanges) { | |
wordByteRanges.emplace_back(); | |
wordByteRanges.back().begin = response.target.wordAsByteRange(sentenceIdx, word.begin).begin; | |
wordByteRanges.back().end = response.target.wordAsByteRange(sentenceIdx, word.end).begin; | |
} | |
scores.emplace_back(SentenceQualityScore{response.qualityScores[sentenceIdx].wordScores, std::move(wordByteRanges), | |
response.qualityScores[sentenceIdx].sentenceScore}); | |
} | |
return scores; | |
} |
Note that this does not work for HTML as the closing tags of the previous word will be placed before the spaces, so the spaces won't be skipped unfortunately.
Closing this in favour of #358 (which also includes part of this pull request) |
Fixes #355
This change stores the quality score per subword range instead of byte range in
Response
. When inserting HTML into the translation output, the byte offsets change but the subword offsets don't.I added a type for
SubwordRange
(previously just an alias forByteRange
) to avoid any confusion between the two. Now you can't compare or interchange them, even though they're both just ranges.The WASM bindings still produce a SentenceQualityScore that works exactly the same as now, i.e. with ByteRanges. In #355 I mused that Javascript could access the byteranges per word through
response.target.wordAsByteRange()
but neitherresponse.target
nor the typeAnnotatedText
are currently exposed through the bindings. So to not introduce much more new API at this point, I madegetQualityScores()
return the same thing it always did.Downside of this approach is that you'll need to
delete()
that returned object in Javascript land. I've changedworker.js
to do this.