Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Un-deprecate String.__iter__() #3984

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

martinvuyk
Copy link
Contributor

@martinvuyk martinvuyk commented Feb 1, 2025

I don't agree with the decision to deprecate this. There is nothing ambiguous about iterating over unicode characters, which is what Python does. For grapheme cluster iterating it's another topic (it's so niche that it can be implemented in its own custom type, or its own iterator that uses the default unicode character one underneath).

This API is very important for allowing pythonic code which is easy to read and reason about, and very fast to write with the builtin __iter__(). There is also the fact that this API will be needed for generic iterator support.

Potential issues I'm trying to guess about why this decision was made

  • If the difference between value vs reference is the issue, then I don't see any. It's just a matter of building an instance of the value-semantic type inside the body of the loop.
  • If the issue is about allowing in-place mutation which Python doesn't, then I'd say those are some of the tiny differences between the languages that are easy to learn.

I see this as unnecessary complexity. If more types of iterators are required (eg. #3858), they can fan out from the default "easy path" one (_StringSliceIter).

Edit after reading the changelog: The Char method is not fully fleshed out yet and IMO introduces more complications than simply saying: "this method returns slices of strings, use the Char() constructor if you wish for instances of the type". So my above opinion still stands.

I'll even say I'd like to move the chars() method over to StringSliceIter so that the instance of the iterator then has a method .chars() that returns a CharIter. This makes much more sense to me from a design perspective than augmenting String's public API surface area even more.

example using PR #3858 and #3700

data = String("123 \n🔥")
for c in data:
  ... # StringSlice by default
for c in iter(data).chars():
  ... # Char type instances
for c in iter(data).split():
  ... # StringSlice split by all whitespace
for c in iter(data).splitlines():
  ... # StringSlice split by all newline characters

@leb-kuchen
Copy link

Your example only works, because 🔥 is one codepoint. Most emojis have Extend and ZWJ characters.
Iterating over characters is niche and so are extended grapheme clusters. But graphemes probably would be the best default.

@gryznar
Copy link
Contributor

gryznar commented Feb 3, 2025

Chris Lattner also said, that grapheme clusters should be default

@martinvuyk
Copy link
Contributor Author

@leb-kuchen my tasty Christmas cookie friend, grapheme cluster indexing is very expensive even more so than unicode codepoint. IMO we should find a way to do it customizable through parameters. Python uses unicode indexing by default, and a lot of code in production works using that assumption. My main problem is going for byte indexing as default which will break a lot of Python code, and is very unintuitive for non ascii languages (since you have to explain utf8 to explain why taking the first x characters of a set of multi byte sequences gives back wrong results).

But graphemes probably would be the best default.

Don't get me wrong, I do agree they would be great as default (in theory), just that they are very expensive compared to just checking the first byte.

  • For iteration: It would also have to make a memory read based on the amount of bytes to check if there is a continuation which is an "unpredictable" memory read (unless we find a way to signal to the compiler that the range of indices (1-3) is probably in L1 cache).
  • For slicing: not only do you need to check utf8 continuation bytes, you also need to check each multi-byte sequence's last byte to see if it's a grapheme, then do the same for the length of the slice.

TL;DR: IMO we should support all 3 types of indexing, just using unicode codepoints as default.

@rcghpge
Copy link

rcghpge commented Feb 3, 2025

I’ve subscribed to follow Mojo’s development. Is there an option or indexing feature between the 3 - codepoint indexing, byte indexing, and grapheme cluster indexing or is that super niche.

@leb-kuchen
Copy link

leb-kuchen commented Feb 3, 2025

  • For iteration: It would also have to make a memory read based on the amount of bytes to check if there is a continuation which is an "unpredictable" memory read (unless we find a way to signal to the compiler that the range of indices (1-3) is probably in L1 cache).

I would be more concerned about the tables, than about the iteration speed. If graphemes are not needed, than they should just not be used. Chars are based on bytes and graphemes are based on chars.

For slicing: not only do you need to check utf8 continuation bytes, you also need to check each multi-byte sequence's last byte to see if it's a grapheme, then do the same for the length of the slice.

You don't need to check for a grapheme boundary, char boundaries are sufficient. Sure you can index into a grapheme boundary, but the string is still valid utf8.

@martinvuyk
Copy link
Contributor Author

Is there an option or indexing feature between the 3 - codepoint indexing, byte indexing, and grapheme cluster indexing or is that super niche.

@rcghpge That would be a nice to have, but no we don't have those options currently. But I do think we can make it by parametrizing String with it's indexing style, or having an iterator that returns StringSlice graphemes instead of unicode codepoints. My personal preference would be the latter.

If graphemes are not needed, than they should just not be used. Chars are based on bytes and graphemes are based on chars.

@leb-kuchen I don't follow the logic. I can't say for certain that graphemes aren't used, but I don't think most code really needs that as default (as I said it also has a high cost [1] ).

You don't need to check for a grapheme boundary, char boundaries are sufficient. Sure you can index into a grapheme boundary, but the string is still valid utf8.

Yes but you wouldn't be indexing by grapheme then. Say you want to shuffle some emojis, it wouldn't work correctly.

[1]: Disclaimer: I haven't really looked that deep into how to implement graphemes, this is mostly based on reading parts of the spec here and there.

@leb-kuchen
Copy link

leb-kuchen commented Feb 4, 2025

Yes but you wouldn't be indexing by grapheme then. Say you want to shuffle some emojis, it wouldn't work correctly.

Graphemes are just a standard, which is good enough for most languages. If you index into a grapheme boundary, the new string may have changed its meaning. The first and last grapheme cluster may be changed. However it is not catastrophic and may even be correct. When you look at the Cluster Boundary Rules, you will see GB9c, GB11 and GB12/GB13 can be expensive to check, if there is no state.

@martinvuyk
Copy link
Contributor Author

Yes but you wouldn't be indexing by grapheme then. Say you want to shuffle some emojis, it wouldn't work correctly.

Graphemes are just a standard, which is good enough for most languages. If you index into a grapheme boundary, the new string may have changed its meaning. The first and last grapheme cluster may be changed. However it is not catastrophic and may even be correct.

Ok I didn't know it had legacy and extended versions, and that one can partially support sections of them. I'm not sure if partially supporting some will please the people who'd actually need this.

When you look at the Cluster Boundary Rules, you will see GB9c, GB11 and GB12/GB13 can be expensive to check, if there is no state.

I'll summon @mzaks who AFAIK in other discussions on indexing etc. had some ideas for grapheme support (and has actually needed it), and also implemented a unicode codepoint indexing scheme keeping state

@leb-kuchen
Copy link

I don't follow the logic. I can't say for certain that graphemes aren't used, but I don't think most code really needs that as default (as I said it also has a high cost [1] ).

Most of unicode is based on codepoints, so to check for case you only chars.
But if you need to display a part of the string, then you should probably use graphemes.

Regarding performance, in the table are 1851 entries.
'For each entry you need the lower and upper bound, so 4 bytes each. With stride you probably only need 4 bytes if the remaining bits of Char are used.
So there 7.2 KiB to 14.5 KiB for the tables in each code gen unit. Of course you have more branches, thus mispredictions, and more memory because of failed prefetching and table lookups. However CR, LF and Control characters in ASCII can be optimized.

Then it is fast, but the user experience because you have to move the cursor 6 times and it's zsh.

@mzaks
Copy link
Contributor

mzaks commented Feb 5, 2025

I am summoned :)

So in my opinion, a string has three requirements, which sound trivial but are actually quite complex, because of all the Unicode madness:

  1. Give me the length of the string
  2. Iterate over string
  3. Index into the string

Those three are the building block to things like getting a sub string, performing search etc...

CPython standard library decided to solve this by having a tagged union, where they analyze the string bytes, find the widest element, widen all elements to this width (1, 2, 3, 4 bytes), and then all 3 requirements listed above become trivial as everything is uniform. This however does not work for grapheme clusters, as grapheme cluster width can be even more irregular and this would blow up the memory if you would want all elements of a string to be uniform.

PyPi has a different approach, it iterates over the string and computes the boundary of every Nth (I think N=32) element. The boundaries are stored in the string so is the count of the elements. This way it is trivial to answer the length question, iteration is same as the initial scan, and the index is O(1) to lookup where we need to start the iteration and then iterate till the required index which is always < 32 and hence can be considered O(1).

The PyPi strategy is implemented for unicode code points, but can be easily done for grapheme cluster as the only difference is the iteration strategy.

In my opinion we need a string data structure which can do both, where the user can define the strategy through a compile time parameter. But should it be par of standard library? I don't know.

Half a year ago I wrote this one https://gist.github.com/mzaks/78f7d38f63fb234dadb1dae11f2ee3ae which was a POC for small strings optimization and code point indexing. I didn't take the time to implement grapheme based iteration though.

@martinvuyk
Copy link
Contributor Author

CPython standard library decided to solve this by having a tagged union, where they analyze the string bytes, find the widest element, widen all elements to this width (1, 2, 3, 4 bytes)

@mzaks I thought they did UTF-32 🤯

PyPi has a different approach, it iterates over the string and computes the boundary of every Nth (I think N=32) element. The boundaries are stored in the string so is the count of the elements. This way it is trivial to answer the length question, iteration is same as the initial scan, and the index is O(1) to lookup where we need to start the iteration and then iterate till the required index which is always < 32 and hence can be considered O(1).
[...]
Half a year ago I wrote this one https://gist.github.com/mzaks/78f7d38f63fb234dadb1dae11f2ee3ae which was a POC for small strings optimization and code point indexing. I didn't take the time to implement grapheme based iteration though.

My only problem with a state-full approach is that I'm not sure it's worth it for small strings. The footprint of each string is much bigger once it has to keep state which reduces the available size for small string optimization. The algorithm I was thinking about for indexing goes like this: use the given index and count the amount of utf8 continuation bytes up to that point, shift right by that amount. That algo is pretty fast, I implemented it with SIMD for longer sequences but sequential ops (can be pipelined better) for less than 8 bytes.

In my opinion we need a string data structure which can do both, where the user can define the strategy through a compile time parameter. But should it be par of standard library? I don't know.

100% agree, I'm writing up a proposal. I'll ping everyone I know works with Strings in Mojo currently, IMO this is something that we need to define ASAP in a way in which most agree before we break too much code.

@rcghpge
Copy link

rcghpge commented Feb 5, 2025

Sheesh. I read into the Unicode project (when it was first proposed). This goes back back. Might want to talk to the MAX team. There are methods/features in low-level ML/DL that could help iron this out for Mojo.

@mzaks
Copy link
Contributor

mzaks commented Feb 5, 2025

CrazyString (POC I mentioned above) is index + small strings optimization. Small strings, up to 22 bytes are stored inline and do not have an index. Strings between 22 and 32 bytes are stored with a heap allocation, but still do not need an index. Strings longer than 32 bytes (capacity) are stored with an index the index is adopted to the capacity so you pay only for what is needed and I do only one heap allocation. For more details you can read the code or watch a video I recorded. https://youtu.be/31Ka0bUTo2U?si=-aPrkYRxFcWbbEux

@leb-kuchen
Copy link

leb-kuchen commented Feb 5, 2025

PyPi has a different approach, it iterates over the string and computes the boundary of every Nth (I think N=32) element. The boundaries are stored in the string so is the count of the elements. This way it is trivial to answer the length question, iteration is same as the initial scan, and the index is O(1) to lookup where we need to start the iteration and then iterate till the required index which is always < 32 and hence can be considered O(1).

The PyPi strategy is implemented for unicode code points, but can be easily done for grapheme cluster as the only difference is the iteration strategy.

In my opinion we need a string data structure which can do both, where the user can define the strategy through a compile time parameter. But should it be par of standard library? I don't know.

Grapheme clusters can be arbitrarily long, e.g. if there are Extend characters. For Utf8 every code point is 1 to 4 bytes long. @martinvuyk, advancing code points can be done efficiently , but for graphemes it is a different story. You also have to consider that '\r' and '\n' are both are both graphemes , but "\r\n" is one grapheme cluster. This means you can append '\n' to a string and the grapheme cluster count does not change. Backspace for instance often deletes code points.

If you look at Swift's String type.
Appending is relatively efficient, but indexing requires calculating an index, i.e. iterating.
I think it stores the cluster count so that getting the "length" and comparing strings with unequal cluster count under Unicode normalization is O(1).

@mzaks
Copy link
Contributor

mzaks commented Feb 6, 2025

Grapheme clusters can be arbitrarily long, e.g. if there are Extend characters. For Utf8 every code point is 1 to 4 bytes long. @martinvuyk, advancing code points can be done efficiently , but for graphemes it is a different story. You also have to consider that '\r' and '\n' are both are both graphemes , but "\r\n" is one grapheme cluster. This means you can append '\n' to a string and the grapheme cluster count does not change. Backspace for instance often deletes code points.

If you look at Swift's String type. Appending is relatively efficient, but indexing requires calculating an index, i.e. iterating. I think it stores the cluster count so that getting the "length" and comparing strings with unequal cluster count under Unicode normalization is O(1).

Generally if string is an owning data structure the data needs to be copied anyways, so the bytes need to travel through the CPU, hence it should be fine to count the user perceived chars, or the code points while you are copying. This is btw. what I did wrong in my POC, I do memcopy and then I do indexing / counting, where combining the two should be more efficient. In my POC I allow users to not create index, if they don't want to pay for the memory overhead. But generally the overhead is also string capacity dependent. For n bytes, I reserve (floor(n / 32) + 1) * ceil((log2(n)) / 8) which is about 6% of overhead for strings shorter than 64K bytes.

@martinvuyk
Copy link
Contributor Author

Grapheme clusters can be arbitrarily long, e.g. if there are Extend characters. For Utf8 every code point is 1 to 4 bytes long. @martinvuyk, advancing code points can be done efficiently , but for graphemes it is a different story.

Uhm, I know that? I implemented and optimized our current iterator. And yes graphemes complicate things.

Generally if string is an owning data structure the data needs to be copied anyways, so the bytes need to travel through the CPU, hence it should be fine to count the user perceived chars, or the code points while you are copying.

@mzaks That is the case for String, but not for StringSlice where you don't copy anything. We might be able to implement your stateful approach for String (I sure hope so, it's awesome), but StringSlice will need to count every time since the underlying state of the owning String could've changed.

@mzaks
Copy link
Contributor

mzaks commented Feb 7, 2025

For StringSlice there are 2 options IMHO:

  1. The StringSlice is parametrized with the iteration type of the origin string and it does not allow indexing, just iteration. If you want to index slice etc... you need to materialize it to a String.
  2. The StringSlice has three 8 bytes fields - pointer to string, start and length (thinking of it, it could potentially need fourth for stride). This way we index and iterate over the string itself, the String needs to be flipped to immutable though.

Speaking of immutable, it is actually interesting to investigate how much do we gain from having a mutable string as we do know. It's great for in place mutation where we know that the capacity will not be exhausted and we do not need to perform a copy anyways. But it comes with a price! Is cost benefit equation well balanced though?

@martinvuyk
Copy link
Contributor Author

For StringSlice there are 2 options IMHO:

  1. The StringSlice is parametrized with the iteration type of the origin string and it does not allow indexing, just iteration. If you want to index slice etc... you need to materialize it to a String.

I like this idea in the sense of prioritizing StringSliceIter. But we can also make all slicing return an iterator instead of the owning type itself (see #3653). This way we wouldn't need to materialize it.

  1. The StringSlice has three 8 bytes fields - pointer to string, start and length (thinking of it, it could potentially need fourth for stride). This way we index and iterate over the string itself, the String needs to be flipped to immutable though.

Speaking of immutable, it is actually interesting to investigate how much do we gain from having a mutable string as we do know. It's great for in place mutation where we know that the capacity will not be exhausted and we do not need to perform a copy anyways. But it comes with a price! Is cost benefit equation well balanced though?

I've also been curious to see whether we will evolve String into immutability. But based on what I've seen with the writer API, or the ability to do var a = String(capacity=10); for i in range(9): a += "a";* and avoiding allocations, I'm not sure.

*: this constructor directly reserves the amount of bytes specified, it might not make much sense if something like #3988 is implemented. Or maybe the interpretation would depend on encoding, and the null terminator get automatically added 🤷‍♂️.

But, if that is the only way in which we truly make good use of String's mutability, then I'd say we can provide alternatives. I tried to build a String concatenating tool that I'll revisit some time in the future. The basic idea is this: It is a struct with a List[StringSlice[ImmutableAnyOrigin]] of all the strings to be appended. Then once str(accum) is called it reserves the whole buffer according the the length of each item, and appends them all at once. This is much faster than reserving and using the public API of appending since it does a lot of checks and might actually resize at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants