-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The unnatural encoding of current implementation #54
Comments
Have this problem ever been addressed? Apart from this the charsets are not stable between different training datasets, yielding incompatible models. |
I suggest checking out the paper and repo I cite in #62. It also has pretrained models if you need that. |
@pechersky do you accept pull requests? I've made some improvements to your |
Yeah, go ahead and make a PR.
…On Thu, Mar 30, 2017 at 8:48 AM, Eli ***@***.***> wrote:
@pechersky <https://github.com/pechersky> do you accept pull requests?
I've made some improvements to your preprocessing routine and the CLI.
Most importantly, I changed the parsing scheme to address the issues
mentioned here.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#54 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFGDhiUCRPs3JDSigy6wG3O-EXw4DTFSks5rq6SYgaJpZM4LapUj>
.
|
After testing, I found that the procedure of building a list of the unique characters used in the dataset (The "charset") is wired. Current encoding will make the resulting output much fragile, because we didn't avoid the situation of Cl interpreted as "C", "l". For example, we should treat 'Cl' as independent character rather than 'C' and 'l' directly. It chemically unreasonable to see 'l' along.
The text was updated successfully, but these errors were encountered: