-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for non-ASCII characters #5
Comments
Thanks!
The context around that was that when I originally developed this I was
targeting a somewhat more specific use case: a low-memory environment and a
large number of ASCII strings to index, so using a very specialized
`EdgeBag` (compact) instead of a more general (but larger) `Map` made sense
at that time.
I later changed `EdgeBag` to implement a `Map`, so if you were to use (for
example) a `HashMap` for keeping track of edges, you would hopefully get
the behavior you need at the cost of possibly needing more memory.
…On Tue, Aug 15, 2017 at 4:12 AM sykesd ***@***.***> wrote:
Thanks for this library. It looks fantastic.
However, it appears that the current implementation does not support any
characters outside of the ASCII 0-127 range. Specifically, this condition
in EdgeBag.get(char c) seems to trigger if a character with code > 127
appears in the input text:
public Edge get(char c) {
if (c != (char) (byte) c) {
throw new IllegalArgumentException("Illegal input character " + c + ".");
}...
I am happy to dig in and try and implement support for at least the normal
Java char range of characters, but before I do I was wondering if there
is any inherent reason for the current limitation?
My application that I am considering this library for is part of search
function over a large text index, and I need to support multiple languages
most of which use characters outside the range currently supported.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#5>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAHoXQvIZnBACXtRAwfqmcxYbsFQziNkks5sYVLygaJpZM4O3TuR>
.
|
Thanks for replying. I ended up forking your repo and implementing support for Unicode. It still needs some more testing involving actual surrogate pairs to ensure correctness, but for now it seems to work. If you would like me to complete it and create a PR for you, let me know. For now I have left your |
Thanks, and apologies for the late reply. |
No, I did not do an evaluation. I found it easier to just convert your
existing code to work with int instead of byte. Memory footprint is
acceptable for my purpose, even if not necessarily optimal.
…On Wednesday, September 6, 2017, Alessandro Bahgat ***@***.***> wrote:
Thanks, and apologies for the late reply.
I'm curious, do you have a sense of how the memory footprint of your
solution compares with respect to replacing EdgeBag with a HashMap<Char,
Edge>?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#5 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABOQKmCBspnEdBIkp9R9T_v6rdEB6LFcks5sfgLOgaJpZM4O3TuR>
.
--
"Those are my principles, and if you don't like them... well, I have
others." - Groucho Marx
|
Thanks for this library. It looks fantastic.
However, it appears that the current implementation does not support any characters outside of the ASCII 0-127 range. Specifically, this condition in
EdgeBag.get(char c)
seems to trigger if a character with code > 127 appears in the input text:I am happy to dig in and try and implement support for at least the normal Java
char
range of characters, but before I do I was wondering if there is any inherent reason for the current limitation?My application that I am considering this library for is part of search function over a large text index, and I need to support multiple languages most of which use characters outside the range currently supported.
The text was updated successfully, but these errors were encountered: