Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf-8 support in strings #749

Merged
merged 33 commits into from
Nov 8, 2024
Merged

Conversation

liz3
Copy link
Contributor

@liz3 liz3 commented Sep 18, 2024

Add Unicode support to strings

Resolves: #317

What's Changed:

This updates the implementation of the string API to support unicode using the
utf8.h
library.

Theres still a few incomplete things, some fuctions do not need unicode versions(strip functions). But on the other side there are things where the library does not provide full functionality which is needed for isUpper/isLower for example which do not wrong, forcing the usage of the c std lib apis.

Should String.len() return byte length or character length, what if the string has invalid utf8? calculating the length upfront for large strings, always might be expensive.

And not enough tests are added.

Type of Change:

  • Bug fix
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Housekeeping:

  • Tests have been updated to reflect the changes done within this PR (if applicable).
  • Documentation has been updated to reflect the changes done within this PR (if applicable).

Screenshots (If Applicable):

@Jason2605
Copy link
Member

Jason2605 commented Oct 4, 2024

This looks awesome and something definitely needed, thank you for this!! Sorry I've been so slow on this, been a bit hectic for me as of late!

Should String.len() return byte length or character length

I would expect this to return the character length not the byte length. I like the byteLen addition! I'm more than happy with adding some extra test cases!

We will need to make sure this also works with the stdlib modules too, quite a big change!!

@liz3
Copy link
Contributor Author

liz3 commented Oct 26, 2024

This looks awesome and something definitely needed, thank you for this!! Sorry I've been so slow on this, been a bit hectic for me as of late!

Should String.len() return byte length or character length

I would expect this to return the character length not the byte length. I like the byteLen addition! I'm more than happy with adding some extra test cases!

We will need to make sure this also works with the stdlib modules too, quite a big change!!

Hey, ive continued on this. I think the changes are needed, but i want to make clear it introduces a overhead into indexing/slicing and allocation of strings since simple indexing does not work anymore and is iterative now, further when allocating the character length is computed, here we could also compute it only when required but since the length is commonly used i opted to compute it upfront
One entirely opposite approach is to store the strings as unicode in memory, this would mean a massive increase in memory usage but retain O(1) indexing capabilities and so on, im not entirely sure what the better tradeoff is here but the more computational expensive seamed more reasonable since a lot of strings are ascii.

Ive went with approaching that most string functions will throw errors when provided with invalid utf-8, i do not know how the library would behave when given invalid utf-8. Exceptions are indexing and slicing which will fallback to the byte handling.
It might also be worth adding a function which lets the user know if the string is valid utf-8 maybe?

@Jason2605
Copy link
Member

The computing for indexing etc is a tradeoff we will just have to accept unfortunately for support of UTF 8 (and also the route I would have went down personally) so I'm happy with that!

Yeah it's a valid concern and something we could potentially add to the stdlib, I have a feeling the chances of invalid utf 8 actually being used would be pretty low though so something we can potentially think about in the future if needs be.

Again, appreciate all the work you've done on this!

@liz3 liz3 marked this pull request as ready for review October 29, 2024 21:05
@liz3 liz3 changed the title wip: utf-8 support in strings utf-8 support in strings Oct 29, 2024
@liz3
Copy link
Contributor Author

liz3 commented Oct 29, 2024

I am not entirely sure why the windows tests fail, there does not seam to be a clear test which is failing
Sadly at the moment i don't have access to a Window system.

liz3 added 10 commits October 30, 2024 22:15
Trying to waitpid before reading the piped content will lead to a blocked pipe and
get the child process "stuck". So it should be read BEFORE waitpid is called.
Further there was a bug within the reallocation logic leading to heap corruption
because the comparison size did not include the just read chunk, leading to heap corruption,
which itself led to reallocate failiing.
@liz3
Copy link
Contributor Author

liz3 commented Nov 6, 2024

@Jason2605 I believe this is good for review / merge now. I have fixed the issue with the windows builds, which turned out to be a error on my side but it took a good while to find. I see that theres a macos failure but i have run the tests on my mb and they all pass.

@Jason2605
Copy link
Member

Nice one thank you! I don't have an ARM mac myself to exactly test unfortunately but will have a see if I can track someone down that does.

@briandowns be good to get your thoughts on this PR too!

@liz3
Copy link
Contributor Author

liz3 commented Nov 6, 2024

Nice one thank you! I don't have an ARM mac myself to exactly test unfortunately but will have a see if I can track someone down that does.

@briandowns be good to get your thoughts on this PR too!

i have a arm mac, infact i only have apple sillicon machines

@liz3
Copy link
Contributor Author

liz3 commented Nov 6, 2024

The way runners fail also make it hard to see the error. The log does not really give you a hint or so to start debugging, but i ran all the tests on m2 air last night

@Jason2605
Copy link
Member

The way runners fail also make it hard to see the error. The log does not really give you a hint or so to start debugging, but i ran all the tests on m2 air last night

Oh interesting I see. Yeah the logs are no help at all really 😂 Usually when it's just a dump on code 1 it's a segfault somewhere, so will need to see if we can track it

@liz3
Copy link
Contributor Author

liz3 commented Nov 6, 2024

The way runners fail also make it hard to see the error. The log does not really give you a hint or so to start debugging, but i ran all the tests on m2 air last night

Oh interesting I see. Yeah the logs are no help at all really 😂 Usually when it's just a dump on code 1 it's a segfault somewhere, so will need to see if we can track it

Aight maybe then a adresssanitizer can help to find it if its a heap corruption, Il try that

@liz3
Copy link
Contributor Author

liz3 commented Nov 7, 2024

@Jason2605 was able to repro with a debug build and fixed it: f8d4818

@liz3
Copy link
Contributor Author

liz3 commented Nov 7, 2024

image
okay that's satisfying to be entirely honest!

@Jason2605
Copy link
Member

Incredible stuff, this is so so so nice!!!!!

Copy link
Member

@Jason2605 Jason2605 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments, most are nitpick style ones. Again thank you so much for this, I appreciate the effort that has gone into this :)

src/vm/datatypes/strings.c Outdated Show resolved Hide resolved
src/vm/datatypes/strings.c Outdated Show resolved Hide resolved
@@ -239,7 +270,13 @@ static Value findString(DictuVM *vm, int argCount, Value *args) {
runtimeError(vm, "find() takes either 1 or 2 arguments (%d given)", argCount);
return EMPTY_VAL;
}

{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the extra scope here? It seems to happen throughout quite a few times

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have refactored them to not do that anymore, it was for name shadowing but the fix was pretty easy

if (argCount != 1) {
runtimeError(vm, "contains() takes 1 argument (%d given)", argCount);
return EMPTY_VAL;
}

{
ObjString *string = AS_STRING(args[0]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're checking utf8 legality would we also need to check the needle (args[1])?

src/vm/datatypes/strings.c Outdated Show resolved Hide resolved
src/vm/datatypes/strings.c Outdated Show resolved Hide resolved
src/vm/datatypes/strings.c Outdated Show resolved Hide resolved
@liz3
Copy link
Contributor Author

liz3 commented Nov 8, 2024

@Jason2605 i ran clang-format over the file in order to fix indentation

@Jason2605
Copy link
Member

Noice yeah I should probably add something to the GH actions for that!

Very much appreciated, once again thank you for this!

@Jason2605 Jason2605 merged commit 643f2c9 into dictu-lang:develop Nov 8, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add unicode support
2 participants