-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[stdlib] Un-deprecate String.__iter__()
#3984
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Your example only works, because 🔥 is one codepoint. Most emojis have Extend and ZWJ characters. |
Chris Lattner also said, that grapheme clusters should be default |
@leb-kuchen my tasty Christmas cookie friend, grapheme cluster indexing is very expensive even more so than unicode codepoint. IMO we should find a way to do it customizable through parameters. Python uses unicode indexing by default, and a lot of code in production works using that assumption. My main problem is going for byte indexing as default which will break a lot of Python code, and is very unintuitive for non ascii languages (since you have to explain utf8 to explain why taking the first x characters of a set of multi byte sequences gives back wrong results).
Don't get me wrong, I do agree they would be great as default (in theory), just that they are very expensive compared to just checking the first byte.
TL;DR: IMO we should support all 3 types of indexing, just using unicode codepoints as default. |
I’ve subscribed to follow Mojo’s development. Is there an option or indexing feature between the 3 - codepoint indexing, byte indexing, and grapheme cluster indexing or is that super niche. |
I would be more concerned about the tables, than about the iteration speed. If graphemes are not needed, than they should just not be used. Chars are based on bytes and graphemes are based on chars.
You don't need to check for a grapheme boundary, char boundaries are sufficient. Sure you can index into a grapheme boundary, but the string is still valid utf8. |
@rcghpge That would be a nice to have, but no we don't have those options currently. But I do think we can make it by parametrizing String with it's indexing style, or having an iterator that returns
@leb-kuchen I don't follow the logic. I can't say for certain that graphemes aren't used, but I don't think most code really needs that as default (as I said it also has a high cost [1] ).
Yes but you wouldn't be indexing by grapheme then. Say you want to shuffle some emojis, it wouldn't work correctly. [1]: Disclaimer: I haven't really looked that deep into how to implement graphemes, this is mostly based on reading parts of the spec here and there. |
Graphemes are just a standard, which is good enough for most languages. If you index into a grapheme boundary, the new string may have changed its meaning. The first and last grapheme cluster may be changed. However it is not catastrophic and may even be correct. When you look at the Cluster Boundary Rules, you will see GB9c, GB11 and GB12/GB13 can be expensive to check, if there is no state. |
Ok I didn't know it had legacy and extended versions, and that one can partially support sections of them. I'm not sure if partially supporting some will please the people who'd actually need this.
I'll summon @mzaks who AFAIK in other discussions on indexing etc. had some ideas for grapheme support (and has actually needed it), and also implemented a unicode codepoint indexing scheme keeping state |
Most of unicode is based on codepoints, so to check for case you only chars. Regarding performance, in the table are 1851 entries. Then it is fast, but the user experience because you have to move the cursor 6 times and it's zsh. |
I am summoned :) So in my opinion, a string has three requirements, which sound trivial but are actually quite complex, because of all the Unicode madness:
Those three are the building block to things like getting a sub string, performing search etc... CPython standard library decided to solve this by having a tagged union, where they analyze the string bytes, find the widest element, widen all elements to this width (1, 2, 3, 4 bytes), and then all 3 requirements listed above become trivial as everything is uniform. This however does not work for grapheme clusters, as grapheme cluster width can be even more irregular and this would blow up the memory if you would want all elements of a string to be uniform. PyPi has a different approach, it iterates over the string and computes the boundary of every Nth (I think N=32) element. The boundaries are stored in the string so is the count of the elements. This way it is trivial to answer the length question, iteration is same as the initial scan, and the index is O(1) to lookup where we need to start the iteration and then iterate till the required index which is always < 32 and hence can be considered O(1). The PyPi strategy is implemented for unicode code points, but can be easily done for grapheme cluster as the only difference is the iteration strategy. In my opinion we need a string data structure which can do both, where the user can define the strategy through a compile time parameter. But should it be par of standard library? I don't know. Half a year ago I wrote this one https://gist.github.com/mzaks/78f7d38f63fb234dadb1dae11f2ee3ae which was a POC for small strings optimization and code point indexing. I didn't take the time to implement grapheme based iteration though. |
@mzaks I thought they did UTF-32 🤯
My only problem with a state-full approach is that I'm not sure it's worth it for small strings. The footprint of each string is much bigger once it has to keep state which reduces the available size for small string optimization. The algorithm I was thinking about for indexing goes like this: use the given index and count the amount of utf8 continuation bytes up to that point, shift right by that amount. That algo is pretty fast, I implemented it with SIMD for longer sequences but sequential ops (can be pipelined better) for less than 8 bytes.
100% agree, I'm writing up a proposal. I'll ping everyone I know works with Strings in Mojo currently, IMO this is something that we need to define ASAP in a way in which most agree before we break too much code. |
Sheesh. I read into the Unicode project (when it was first proposed). This goes back back. Might want to talk to the MAX team. There are methods/features in low-level ML/DL that could help iron this out for Mojo. |
CrazyString (POC I mentioned above) is index + small strings optimization. Small strings, up to 22 bytes are stored inline and do not have an index. Strings between 22 and 32 bytes are stored with a heap allocation, but still do not need an index. Strings longer than 32 bytes (capacity) are stored with an index the index is adopted to the capacity so you pay only for what is needed and I do only one heap allocation. For more details you can read the code or watch a video I recorded. https://youtu.be/31Ka0bUTo2U?si=-aPrkYRxFcWbbEux |
Grapheme clusters can be arbitrarily long, e.g. if there are Extend characters. For Utf8 every code point is 1 to 4 bytes long. @martinvuyk, advancing code points can be done efficiently , but for graphemes it is a different story. You also have to consider that '\r' and '\n' are both are both graphemes , but "\r\n" is one grapheme cluster. This means you can append '\n' to a string and the grapheme cluster count does not change. Backspace for instance often deletes code points. If you look at Swift's String type. |
Generally if string is an owning data structure the data needs to be copied anyways, so the bytes need to travel through the CPU, hence it should be fine to count the user perceived chars, or the code points while you are copying. This is btw. what I did wrong in my POC, I do memcopy and then I do indexing / counting, where combining the two should be more efficient. In my POC I allow users to not create index, if they don't want to pay for the memory overhead. But generally the overhead is also string capacity dependent. For n bytes, I reserve |
Uhm, I know that? I implemented and optimized our current iterator. And yes graphemes complicate things.
@mzaks That is the case for |
For StringSlice there are 2 options IMHO:
Speaking of immutable, it is actually interesting to investigate how much do we gain from having a mutable string as we do know. It's great for in place mutation where we know that the capacity will not be exhausted and we do not need to perform a copy anyways. But it comes with a price! Is cost benefit equation well balanced though? |
I like this idea in the sense of prioritizing
I've also been curious to see whether we will evolve *: this constructor directly reserves the amount of bytes specified, it might not make much sense if something like #3988 is implemented. Or maybe the interpretation would depend on encoding, and the null terminator get automatically added 🤷♂️. But, if that is the only way in which we truly make good use of |
I don't agree with the decision to deprecate this. There is nothing ambiguous about iterating over unicode characters, which is what Python does. For grapheme cluster iterating it's another topic (it's so niche that it can be implemented in its own custom type, or its own iterator that uses the default unicode character one underneath).
This API is very important for allowing pythonic code which is easy to read and reason about, and very fast to write with the builtin
__iter__()
. There is also the fact that this API will be needed for generic iterator support.Potential issues I'm trying to guess about why this decision was made
I see this as unnecessary complexity. If more types of iterators are required (eg. #3858), they can fan out from the default "easy path" one (
_StringSliceIter
).Edit after reading the changelog: The
Char
method is not fully fleshed out yet and IMO introduces more complications than simply saying: "this method returns slices of strings, use the Char() constructor if you wish for instances of the type". So my above opinion still stands.I'll even say I'd like to move the
chars()
method over toStringSliceIter
so that the instance of the iterator then has a method.chars()
that returns aCharIter
. This makes much more sense to me from a design perspective than augmentingString
's public API surface area even more.example using PR #3858 and #3700