-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pure Scala IDNA implementation #422
Comments
I'd love a pure Scala implementation that's used by JVM/JS/Native - preferably without any dependencies. We could do this either inline here or as a standalone zero dependency project under typelevel. |
http://www.unicode.org/reports/tr46/ is a good reference, including the mapping table and instructions for how to implement the conversions. |
I completely forgot that @isomarcte was also working on this issue for cats-uri. Did you get anywhere interesting with it? Here are some messages from Discord.
The security issues are very real. For example see libuv/libuv#2046. FWIW I have a branch that AFAIU implements "non-strict" IDNA 2008. Where by non-strict, I mean it doesn't reject invalid IDNs, but it does encode valid ones correctly to the best of my knowledge/understanding. This is fairly easy because IDNA 2008 doesn't specify a mapping step so it's basically just a Punycode implementation. https://github.com/armanbilge/ip4s/tree/feature/idna To make it strict we basically just need to load in this table of valid/invalid code points which would not be hard at all. But note that IDNA 2008 rejects all sorts of stuff. Consider:
All four of those are valid under IDNA 2003 and UTS46 but rejected by IDNA 2008. That's because IDNA 2008 rejects capital letters, emojis, and the So even though IDNA 2008 is 12 years old it just seems impractical to implement it without the UTS46 mapping step. At the very least it would be a breaking change compared to the IDNA 2003 currently used on the JVM. Moreover getting IDNs rejected for innocent stuff like capital letters seems extremely annoying (although IDNA 2008 has good reasons). As @mpilquist points out implementing UTS46 is more-or-less straightforward algorithmically, it just requires loading a bunch of codepoints and mappings and following the steps. This requires some cleverness for both the source-generator (so as not to generate huge bytecode / JS sources) and for the in-memory data structure (so that its memory/time efficient). I spent a stupid amount of time last week studying the UTS46 code-generators (all written in Python) for the various languages I linked above and trying to prototype my own. FTR If we just tweak one of those Python scripts to make Scala instead of JavaScript or whatever it's probably not too bad. I got a bit tangled up trying to rewrite the generator in Scala (so it can be part of the sbt build) and trying to exorcise all the mutation from it. I also became frustrated because still all those languages rely on additional unicode data and normalizing algorithms that we currently don't have on Scala Native. |
Ah, this seems like a relevant branch: |
👋 Yeah, after bootstring it seems pretty simple to get to IDNA2003/UTS46/IDNA2008. The bootstring implementation on that branch works, but needs a cleanup pass and some benchmarks. I was waiting to discuss this until I had a full IDNA2003/UTS46/IDNA2008 implementation done, but I don't think bootstring/IDNA should live in cats-uri. I was thinking either, two repos, one for bootstring one for IDNA/UTS46 or just one repo, maybe
@armanbilge I don't know how far along this is or how it compares to what I have on the cats-uri branch. How do you want to proceed here? FWIW, cats-uri is reasonably far along, but blocked by issues in case-insensitive. I think we've got those mostly fixed, I've just not been sure what to do about the glob matcher. I'm not certain a correct implementation of this can exist, as it assumes that any two caseless unicode strings have the same UTF-16 char length, which is not true (big old quagmire in there). If we were okay dropping the globbing, I could get case-insensitive 2.x.x wrapped pretty quick. Thoughts? |
@isomarcte thank you so much for chiming in! That's fantastic news :)
I just did a dumb Java -> Scala translation of the Punycode implementation in icu4j. What you have is better ... much better 😆 With that said, I think creating a
I don't think we need IDNA 2003 really. UTS46+IDNA 2008 are good though. So sounds like you are planning to work on this? No rush, just happy someone has a plan for it :)
This is great to hear, thanks for all your work on that project :) regarding the glob issues, would you mind opening an issue on case-insensitive about it? |
I like the way the plan is shaping up! If we have a typelevel/idna repo, would we need anything in ip4s to integrate with it? Or would we just deprecate |
I was imagining that typelevel/idna wouldn't introduce any new datatypes but provide implementations of the ip4s/jvm/src/main/scala/com/comcast/ip4s/IDNCompanionPlatform.scala Lines 22 to 26 in d84a662
But maybe @isomarcte has something else in mind? :) |
Oh I see, okay! |
My general approach has been to have both pure functions on the underlying type, as you have above @armanbilge, and also to have newtypes, e.g. sealed abstract class IDNA2008 {
def asUnicode: String
def asAscii: String
}
object IDNA2008 {
def fromAscii(value: String): Either[String, IDNA2008] = decode(value)
def fromUnicode(value: String): Either[String IDNA2008] = encode(value)
def toAscii(unicode: String): Either[String, String] = decodeRaw(value)
def toUnicode(ascii: String): Either[String, String] = encodeRaw(value)
} Something like that, and then let the user's decide how deep they want to get into newtypes. So I think it should be easy to integrate it. |
I extracted a new repository from @isomarcte's bootstring branch and got it publishing snapshots. https://github.com/typelevel/idna4s (apologies the "4s" just slipped out 😜 ) Everything subject to change/bikeshed of course :) I wanted to get something up so that we have some options for cross-building ip4s to Scala Native. I think as a short-term solution we could "implement" Furthermore, there aren't real bincompat concerns, considering:
|
The forthcoming Scala Native cross-build is currently using icu4c for IDNA. Although this works, it introduces various build complexities (see scala-native/scala-native#2778) and requires downstreams to have icu4c installed if they (indirectly) call IDNA related methods.
This motivates a pure Scala IDNA implementation.
Does it make sense to do it in ip4s, or an external dependency (hosted where)?
There is plenty of prior art to mimick. They all seem to use some form of source generator.
Should we swap out the JVM and JS implementations as well?
Critically, the current JS implementation is broken/wrong, because it uses punycode.js without nameprep.
The RFC specifically says:
https://www.ietf.org/rfc/rfc3490.txt
See also Indicate nameprep support mathiasbynens/punycode.js#40.
In theory we could shell out to an npm package but dealing with npm from Scala.js is terrible.
https://www.npmjs.com/package/idna-uts46
Meanwhile, the JVM implementation is using JDK APIs that are based on IDNA 2003 which IIUC is deprecated in favor of IDNA 2008. We could add a dependency to icu4j.
Thoughts? Since nobody is complaining about the existing broken/deprecated implementations, do we care enough to do this right?
The text was updated successfully, but these errors were encountered: