Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pure Scala IDNA implementation #422

Open
armanbilge opened this issue Aug 15, 2022 · 11 comments
Open

Pure Scala IDNA implementation #422

armanbilge opened this issue Aug 15, 2022 · 11 comments

Comments

@armanbilge
Copy link
Contributor

armanbilge commented Aug 15, 2022

The forthcoming Scala Native cross-build is currently using icu4c for IDNA. Although this works, it introduces various build complexities (see scala-native/scala-native#2778) and requires downstreams to have icu4c installed if they (indirectly) call IDNA related methods.

This motivates a pure Scala IDNA implementation.

  1. Does it make sense to do it in ip4s, or an external dependency (hosted where)?
    There is plenty of prior art to mimick. They all seem to use some form of source generator.

  2. Should we swap out the JVM and JS implementations as well?

Thoughts? Since nobody is complaining about the existing broken/deprecated implementations, do we care enough to do this right?

@mpilquist
Copy link
Member

I'd love a pure Scala implementation that's used by JVM/JS/Native - preferably without any dependencies. We could do this either inline here or as a standalone zero dependency project under typelevel.

@mpilquist
Copy link
Member

http://www.unicode.org/reports/tr46/ is a good reference, including the mapping table and instructions for how to implement the conversions.

@armanbilge
Copy link
Contributor Author

I completely forgot that @isomarcte was also working on this issue for cats-uri. Did you get anywhere interesting with it? Here are some messages from Discord.

I'm starting to prototype some of the IDNA2003/IDNA2008/UTS 46 stuff on a cats-uri branch. We'll see where that goes.

Punting where ip4s integrates for a moment, I think we may want to pull out IDNA/UTS 46 code into a new micro library. It could be shared between cats-uri and ip4s.

If we want to support IDNA2008, then we might have to have a pure implementation 😢 .

We could probably target IDNA2003 and platform out, but big DNS implementations like bind are already supporting it, and not supporting it can have both correctness and security issues...(according to UTS 46).
I mean
IDNA2008 (which is from 2010 🤦‍♂️) is already 12 years old at this point

The security issues are very real. For example see libuv/libuv#2046.


FWIW I have a branch that AFAIU implements "non-strict" IDNA 2008. Where by non-strict, I mean it doesn't reject invalid IDNs, but it does encode valid ones correctly to the best of my knowledge/understanding. This is fairly easy because IDNA 2008 doesn't specify a mapping step so it's basically just a Punycode implementation.

https://github.com/armanbilge/ip4s/tree/feature/idna

To make it strict we basically just need to load in this table of valid/invalid code points which would not be hard at all.
https://www.unicode.org/Public/idna/idna2008derived/Idna2008-14.0.0.txt

But note that IDNA 2008 rejects all sorts of stuff. Consider:

Example.com
comcast。com
i❤.ws
☃.net

All four of those are valid under IDNA 2003 and UTS46 but rejected by IDNA 2008.

https://util.unicode.org/UnicodeJsps/idna.jsp?a=Example.com%0D%0Acomcast%E3%80%82com%0D%0Ai%E2%9D%A4.ws%0D%0A%E2%98%83.net

That's because IDNA 2008 rejects capital letters, emojis, and the (I think). Hell, there's even a real website at https://☃.net aka https://xn--n3h.net/. I have no idea what happens to it in an IDNA 2008 world.

So even though IDNA 2008 is 12 years old it just seems impractical to implement it without the UTS46 mapping step. At the very least it would be a breaking change compared to the IDNA 2003 currently used on the JVM. Moreover getting IDNs rejected for innocent stuff like capital letters seems extremely annoying (although IDNA 2008 has good reasons).


As @mpilquist points out implementing UTS46 is more-or-less straightforward algorithmically, it just requires loading a bunch of codepoints and mappings and following the steps. This requires some cleverness for both the source-generator (so as not to generate huge bytecode / JS sources) and for the in-memory data structure (so that its memory/time efficient).

I spent a stupid amount of time last week studying the UTS46 code-generators (all written in Python) for the various languages I linked above and trying to prototype my own. FTR If we just tweak one of those Python scripts to make Scala instead of JavaScript or whatever it's probably not too bad.

I got a bit tangled up trying to rewrite the generator in Scala (so it can be part of the sbt build) and trying to exorcise all the mutation from it. I also became frustrated because still all those languages rely on additional unicode data and normalizing algorithms that we currently don't have on Scala Native.

@armanbilge
Copy link
Contributor Author

Ah, this seems like a relevant branch:
https://github.com/isomarcte/cats-uri/tree/bootstring

@isomarcte
Copy link

isomarcte commented Aug 26, 2022

👋 Yeah, after bootstring it seems pretty simple to get to IDNA2003/UTS46/IDNA2008.

The bootstring implementation on that branch works, but needs a cleanup pass and some benchmarks. I was waiting to discuss this until I had a full IDNA2003/UTS46/IDNA2008 implementation done, but I don't think bootstring/IDNA should live in cats-uri. I was thinking either, two repos, one for bootstring one for IDNA/UTS46 or just one repo, maybe typelevel/idna and we could have bootstring and idna modules. Of the two, I like the latter option of just having a single new repo that both cats-uri and ip4vs can use. Technically bootstring is a general purpose algorithm, but I don't know why anyone would use it for anything other than IDNA (I'm personally aware of no other uses.).

FWIW I have a branch that AFAIU implements "non-strict" IDNA 2008. Where by non-strict, I mean it doesn't reject invalid IDNs, but it does encode valid ones correctly to the best of my knowledge/understanding. This is fairly easy because IDNA 2008 doesn't specify a mapping step so it's basically just a Punycode implementation.

https://github.com/armanbilge/ip4s/tree/feature/idna

@armanbilge I don't know how far along this is or how it compares to what I have on the cats-uri branch. How do you want to proceed here?

FWIW, cats-uri is reasonably far along, but blocked by issues in case-insensitive. I think we've got those mostly fixed, I've just not been sure what to do about the glob matcher. I'm not certain a correct implementation of this can exist, as it assumes that any two caseless unicode strings have the same UTF-16 char length, which is not true (big old quagmire in there). If we were okay dropping the globbing, I could get case-insensitive 2.x.x wrapped pretty quick.

Thoughts?

@armanbilge
Copy link
Contributor Author

@isomarcte thank you so much for chiming in! That's fantastic news :)

I don't know how far along this is or how it compares to what I have on the cats-uri branch. How do you want to proceed here?

I just did a dumb Java -> Scala translation of the Punycode implementation in icu4j. What you have is better ... much better 😆

With that said, I think creating a typelevel/idna repo that can be shared makes sense to me. I can set this up and migrate your bootstring work, how does that sound?

I was waiting to discuss this until I had a full IDNA2003/UTS46/IDNA2008 implementation done

I don't think we need IDNA 2003 really. UTS46+IDNA 2008 are good though. So sounds like you are planning to work on this? No rush, just happy someone has a plan for it :)


FWIW, cats-uri is reasonably far along

This is great to hear, thanks for all your work on that project :) regarding the glob issues, would you mind opening an issue on case-insensitive about it?

@mpilquist
Copy link
Member

I like the way the plan is shaping up! If we have a typelevel/idna repo, would we need anything in ip4s to integrate with it? Or would we just deprecate com.comcast.ip4s.IDN in favor of some new type in typelevel/idna?

@armanbilge
Copy link
Contributor Author

I was imagining that typelevel/idna wouldn't introduce any new datatypes but provide implementations of the toAscii and toUnicode methods. In which case we would keep the IDN type here, just swap out the implementations here.

private[ip4s] def toAscii(value: String): Option[String] =
Try(java.net.IDN.toASCII(value)).toOption
private[ip4s] def toUnicode(value: String): String =
java.net.IDN.toUnicode(value)

But maybe @isomarcte has something else in mind? :)

@mpilquist
Copy link
Member

Oh I see, okay!

@isomarcte
Copy link

isomarcte commented Aug 26, 2022

My general approach has been to have both pure functions on the underlying type, as you have above @armanbilge, and also to have newtypes, e.g.

sealed abstract class IDNA2008 {
  def asUnicode: String
  def asAscii: String
}

object IDNA2008 {
  def fromAscii(value: String): Either[String, IDNA2008] = decode(value)
  def fromUnicode(value: String): Either[String IDNA2008] = encode(value)
  def toAscii(unicode: String): Either[String, String] = decodeRaw(value)
  def toUnicode(ascii: String): Either[String, String] = encodeRaw(value)
}

Something like that, and then let the user's decide how deep they want to get into newtypes. So I think it should be easy to integrate it.

@armanbilge
Copy link
Contributor Author

armanbilge commented Aug 26, 2022

I extracted a new repository from @isomarcte's bootstring branch and got it publishing snapshots.

https://github.com/typelevel/idna4s (apologies the "4s" just slipped out 😜 )

Everything subject to change/bikeshed of course :) I wanted to get something up so that we have some options for cross-building ip4s to Scala Native.

I think as a short-term solution we could "implement" IDN on Native just Punycode ... this is what the JS implementation is currently doing here /shrug. I'm no expert but I don't think there are security issues: IIUC without the mapping step, some otherwise valid IDNs would be transformed to ASCII that are invalid/impossible domains. Bad UX though.

Furthermore, there aren't real bincompat concerns, considering:

  1. low-adoption of Native
  2. unlikeliness of idna4s to be involved in a triangle at this early stage
  3. that the Native ecosystem will break in the near future anyway with the 0.5.x release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants