Support for non-ASCII characters #5

sykesd · 2017-08-15T08:12:34Z

Thanks for this library. It looks fantastic.

However, it appears that the current implementation does not support any characters outside of the ASCII 0-127 range. Specifically, this condition in EdgeBag.get(char c) seems to trigger if a character with code > 127 appears in the input text:

    public Edge get(char c) {
        if (c != (char) (byte) c) {
            throw new IllegalArgumentException("Illegal input character " + c + ".");
        }
...

I am happy to dig in and try and implement support for at least the normal Java char range of characters, but before I do I was wondering if there is any inherent reason for the current limitation?

My application that I am considering this library for is part of search function over a large text index, and I need to support multiple languages most of which use characters outside the range currently supported.

The text was updated successfully, but these errors were encountered:

abahgat · 2017-08-17T01:00:58Z

Thanks! The context around that was that when I originally developed this I was targeting a somewhat more specific use case: a low-memory environment and a large number of ASCII strings to index, so using a very specialized `EdgeBag` (compact) instead of a more general (but larger) `Map` made sense at that time. I later changed `EdgeBag` to implement a `Map`, so if you were to use (for example) a `HashMap` for keeping track of edges, you would hopefully get the behavior you need at the cost of possibly needing more memory.

…

On Tue, Aug 15, 2017 at 4:12 AM sykesd ***@***.***> wrote: Thanks for this library. It looks fantastic. However, it appears that the current implementation does not support any characters outside of the ASCII 0-127 range. Specifically, this condition in EdgeBag.get(char c) seems to trigger if a character with code > 127 appears in the input text: public Edge get(char c) { if (c != (char) (byte) c) { throw new IllegalArgumentException("Illegal input character " + c + "."); }... I am happy to dig in and try and implement support for at least the normal Java char range of characters, but before I do I was wondering if there is any inherent reason for the current limitation? My application that I am considering this library for is part of search function over a large text index, and I need to support multiple languages most of which use characters outside the range currently supported. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#5>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHoXQvIZnBACXtRAwfqmcxYbsFQziNkks5sYVLygaJpZM4O3TuR> .

sykesd · 2017-08-17T01:36:25Z

Thanks for replying.

I ended up forking your repo and implementing support for Unicode. It still needs some more testing involving actual surrogate pairs to ensure correctness, but for now it seems to work.

If you would like me to complete it and create a PR for you, let me know.

For now I have left your EdgeBag implementation, just modified it to allow any valid Unicode code point.

abahgat · 2017-09-06T02:26:21Z

Thanks, and apologies for the late reply.
I'm curious, do you have a sense of how the memory footprint of your solution compares with respect to replacing EdgeBag with a HashMap<Char, Edge>?

sykesd · 2017-09-07T01:16:06Z

No, I did not do an evaluation. I found it easier to just convert your existing code to work with int instead of byte. Memory footprint is acceptable for my purpose, even if not necessarily optimal.

…

On Wednesday, September 6, 2017, Alessandro Bahgat ***@***.***> wrote: Thanks, and apologies for the late reply. I'm curious, do you have a sense of how the memory footprint of your solution compares with respect to replacing EdgeBag with a HashMap<Char, Edge>? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#5 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABOQKmCBspnEdBIkp9R9T_v6rdEB6LFcks5sfgLOgaJpZM4O3TuR> .

-- "Those are my principles, and if you don't like them... well, I have others." - Groucho Marx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for non-ASCII characters #5

Support for non-ASCII characters #5

sykesd commented Aug 15, 2017

abahgat commented Aug 17, 2017 via email

sykesd commented Aug 17, 2017

abahgat commented Sep 6, 2017

sykesd commented Sep 7, 2017 via email

Support for non-ASCII characters #5

Support for non-ASCII characters #5

Comments

sykesd commented Aug 15, 2017

abahgat commented Aug 17, 2017 via email

sykesd commented Aug 17, 2017

abahgat commented Sep 6, 2017

sykesd commented Sep 7, 2017 via email