Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for non-ASCII characters #5

Open
sykesd opened this issue Aug 15, 2017 · 4 comments
Open

Support for non-ASCII characters #5

sykesd opened this issue Aug 15, 2017 · 4 comments

Comments

@sykesd
Copy link

sykesd commented Aug 15, 2017

Thanks for this library. It looks fantastic.

However, it appears that the current implementation does not support any characters outside of the ASCII 0-127 range. Specifically, this condition in EdgeBag.get(char c) seems to trigger if a character with code > 127 appears in the input text:

    public Edge get(char c) {
        if (c != (char) (byte) c) {
            throw new IllegalArgumentException("Illegal input character " + c + ".");
        }
...

I am happy to dig in and try and implement support for at least the normal Java char range of characters, but before I do I was wondering if there is any inherent reason for the current limitation?

My application that I am considering this library for is part of search function over a large text index, and I need to support multiple languages most of which use characters outside the range currently supported.

@abahgat
Copy link
Owner

abahgat commented Aug 17, 2017 via email

@sykesd
Copy link
Author

sykesd commented Aug 17, 2017

Thanks for replying.

I ended up forking your repo and implementing support for Unicode. It still needs some more testing involving actual surrogate pairs to ensure correctness, but for now it seems to work.

If you would like me to complete it and create a PR for you, let me know.

For now I have left your EdgeBag implementation, just modified it to allow any valid Unicode code point.

@abahgat
Copy link
Owner

abahgat commented Sep 6, 2017

Thanks, and apologies for the late reply.
I'm curious, do you have a sense of how the memory footprint of your solution compares with respect to replacing EdgeBag with a HashMap<Char, Edge>?

@sykesd
Copy link
Author

sykesd commented Sep 7, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants