-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
utf-8 support in strings #749
Conversation
This looks awesome and something definitely needed, thank you for this!! Sorry I've been so slow on this, been a bit hectic for me as of late!
I would expect this to return the character length not the byte length. I like the We will need to make sure this also works with the stdlib modules too, quite a big change!! |
Hey, ive continued on this. I think the changes are needed, but i want to make clear it introduces a overhead into indexing/slicing and allocation of strings since simple indexing does not work anymore and is iterative now, further when allocating the character length is computed, here we could also compute it only when required but since the length is commonly used i opted to compute it upfront Ive went with approaching that most string functions will throw errors when provided with invalid utf-8, i do not know how the library would behave when given invalid utf-8. Exceptions are indexing and slicing which will fallback to the byte handling. |
The computing for indexing etc is a tradeoff we will just have to accept unfortunately for support of UTF 8 (and also the route I would have went down personally) so I'm happy with that! Yeah it's a valid concern and something we could potentially add to the stdlib, I have a feeling the chances of invalid utf 8 actually being used would be pretty low though so something we can potentially think about in the future if needs be. Again, appreciate all the work you've done on this! |
I am not entirely sure why the windows tests fail, there does not seam to be a clear test which is failing |
Trying to waitpid before reading the piped content will lead to a blocked pipe and get the child process "stuck". So it should be read BEFORE waitpid is called. Further there was a bug within the reallocation logic leading to heap corruption because the comparison size did not include the just read chunk, leading to heap corruption, which itself led to reallocate failiing.
This reverts commit 4f856d5.
@Jason2605 I believe this is good for review / merge now. I have fixed the issue with the windows builds, which turned out to be a error on my side but it took a good while to find. I see that theres a macos failure but i have run the tests on my mb and they all pass. |
Nice one thank you! I don't have an ARM mac myself to exactly test unfortunately but will have a see if I can track someone down that does. @briandowns be good to get your thoughts on this PR too! |
i have a arm mac, infact i only have apple sillicon machines |
The way runners fail also make it hard to see the error. The log does not really give you a hint or so to start debugging, but i ran all the tests on m2 air last night |
Oh interesting I see. Yeah the logs are no help at all really 😂 Usually when it's just a dump on code 1 it's a segfault somewhere, so will need to see if we can track it |
Aight maybe then a adresssanitizer can help to find it if its a heap corruption, Il try that |
@Jason2605 was able to repro with a debug build and fixed it: f8d4818 |
Incredible stuff, this is so so so nice!!!!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments, most are nitpick style ones. Again thank you so much for this, I appreciate the effort that has gone into this :)
src/vm/datatypes/strings.c
Outdated
@@ -239,7 +270,13 @@ static Value findString(DictuVM *vm, int argCount, Value *args) { | |||
runtimeError(vm, "find() takes either 1 or 2 arguments (%d given)", argCount); | |||
return EMPTY_VAL; | |||
} | |||
|
|||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the extra scope here? It seems to happen throughout quite a few times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have refactored them to not do that anymore, it was for name shadowing but the fix was pretty easy
src/vm/datatypes/strings.c
Outdated
if (argCount != 1) { | ||
runtimeError(vm, "contains() takes 1 argument (%d given)", argCount); | ||
return EMPTY_VAL; | ||
} | ||
|
||
{ | ||
ObjString *string = AS_STRING(args[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're checking utf8 legality would we also need to check the needle (args[1])?
Co-authored-by: Jason_000 <[email protected]>
Co-authored-by: Jason_000 <[email protected]>
Co-authored-by: Jason_000 <[email protected]>
Co-authored-by: Jason_000 <[email protected]>
Co-authored-by: Jason_000 <[email protected]>
@Jason2605 i ran clang-format over the file in order to fix indentation |
Noice yeah I should probably add something to the GH actions for that! Very much appreciated, once again thank you for this! |
Add Unicode support to strings
Resolves: #317
What's Changed:
This updates the implementation of the string API to support unicode using the
utf8.h library.
Theres still a few incomplete things, some fuctions do not need unicode versions(strip functions). But on the other side there are things where the library does not provide full functionality which is needed for
isUpper
/isLower
for example which do not wrong, forcing the usage of the c std lib apis.Should
String.len()
return byte length or character length, what if the string has invalid utf8? calculating the length upfront for large strings, always might be expensive.And not enough tests are added.
Type of Change:
Housekeeping:
Screenshots (If Applicable):