utf-8 support in strings #749

liz3 · 2024-09-18T01:17:44Z

Add Unicode support to strings

Resolves: #317

What's Changed:

This updates the implementation of the string API to support unicode using the
utf8.h library.

Theres still a few incomplete things, some fuctions do not need unicode versions(strip functions). But on the other side there are things where the library does not provide full functionality which is needed for isUpper/isLower for example which do not wrong, forcing the usage of the c std lib apis.

Should String.len() return byte length or character length, what if the string has invalid utf8? calculating the length upfront for large strings, always might be expensive.

And not enough tests are added.

Type of Change:

Bug fix
New feature
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Housekeeping:

Tests have been updated to reflect the changes done within this PR (if applicable).
Documentation has been updated to reflect the changes done within this PR (if applicable).

Screenshots (If Applicable):

Jason2605 · 2024-10-04T21:03:35Z

This looks awesome and something definitely needed, thank you for this!! Sorry I've been so slow on this, been a bit hectic for me as of late!

Should String.len() return byte length or character length

I would expect this to return the character length not the byte length. I like the byteLen addition! I'm more than happy with adding some extra test cases!

We will need to make sure this also works with the stdlib modules too, quite a big change!!

liz3 · 2024-10-26T22:49:55Z

This looks awesome and something definitely needed, thank you for this!! Sorry I've been so slow on this, been a bit hectic for me as of late!

Should String.len() return byte length or character length

I would expect this to return the character length not the byte length. I like the byteLen addition! I'm more than happy with adding some extra test cases!

We will need to make sure this also works with the stdlib modules too, quite a big change!!

Hey, ive continued on this. I think the changes are needed, but i want to make clear it introduces a overhead into indexing/slicing and allocation of strings since simple indexing does not work anymore and is iterative now, further when allocating the character length is computed, here we could also compute it only when required but since the length is commonly used i opted to compute it upfront
One entirely opposite approach is to store the strings as unicode in memory, this would mean a massive increase in memory usage but retain O(1) indexing capabilities and so on, im not entirely sure what the better tradeoff is here but the more computational expensive seamed more reasonable since a lot of strings are ascii.

Ive went with approaching that most string functions will throw errors when provided with invalid utf-8, i do not know how the library would behave when given invalid utf-8. Exceptions are indexing and slicing which will fallback to the byte handling.
It might also be worth adding a function which lets the user know if the string is valid utf-8 maybe?

Jason2605 · 2024-10-28T10:20:05Z

The computing for indexing etc is a tradeoff we will just have to accept unfortunately for support of UTF 8 (and also the route I would have went down personally) so I'm happy with that!

Yeah it's a valid concern and something we could potentially add to the stdlib, I have a feeling the chances of invalid utf 8 actually being used would be pretty low though so something we can potentially think about in the future if needs be.

Again, appreciate all the work you've done on this!

…racters

liz3 · 2024-10-29T22:52:36Z

I am not entirely sure why the windows tests fail, there does not seam to be a clear test which is failing
Sadly at the moment i don't have access to a Window system.

Trying to waitpid before reading the piped content will lead to a blocked pipe and get the child process "stuck". So it should be read BEFORE waitpid is called. Further there was a bug within the reallocation logic leading to heap corruption because the comparison size did not include the just read chunk, leading to heap corruption, which itself led to reallocate failiing.

This reverts commit 4f856d5.

liz3 · 2024-11-06T01:32:41Z

@Jason2605 I believe this is good for review / merge now. I have fixed the issue with the windows builds, which turned out to be a error on my side but it took a good while to find. I see that theres a macos failure but i have run the tests on my mb and they all pass.

Jason2605 · 2024-11-06T18:59:09Z

Nice one thank you! I don't have an ARM mac myself to exactly test unfortunately but will have a see if I can track someone down that does.

@briandowns be good to get your thoughts on this PR too!

liz3 · 2024-11-06T20:06:18Z

Nice one thank you! I don't have an ARM mac myself to exactly test unfortunately but will have a see if I can track someone down that does.

@briandowns be good to get your thoughts on this PR too!

i have a arm mac, infact i only have apple sillicon machines

liz3 · 2024-11-06T20:11:01Z

The way runners fail also make it hard to see the error. The log does not really give you a hint or so to start debugging, but i ran all the tests on m2 air last night

Jason2605 · 2024-11-06T20:28:22Z

The way runners fail also make it hard to see the error. The log does not really give you a hint or so to start debugging, but i ran all the tests on m2 air last night

Oh interesting I see. Yeah the logs are no help at all really 😂 Usually when it's just a dump on code 1 it's a segfault somewhere, so will need to see if we can track it

liz3 · 2024-11-06T20:29:45Z

The way runners fail also make it hard to see the error. The log does not really give you a hint or so to start debugging, but i ran all the tests on m2 air last night

Oh interesting I see. Yeah the logs are no help at all really 😂 Usually when it's just a dump on code 1 it's a segfault somewhere, so will need to see if we can track it

Aight maybe then a adresssanitizer can help to find it if its a heap corruption, Il try that

liz3 · 2024-11-07T01:02:46Z

@Jason2605 was able to repro with a debug build and fixed it: f8d4818

liz3 · 2024-11-07T01:05:34Z

okay that's satisfying to be entirely honest!

Jason2605 · 2024-11-07T20:46:33Z

Incredible stuff, this is so so so nice!!!!!

Jason2605

Few comments, most are nitpick style ones. Again thank you so much for this, I appreciate the effort that has gone into this :)

src/vm/datatypes/strings.c

Jason2605 · 2024-11-07T20:51:22Z

src/vm/datatypes/strings.c

@@ -239,7 +270,13 @@ static Value findString(DictuVM *vm, int argCount, Value *args) {
        runtimeError(vm, "find() takes either 1 or 2 arguments (%d given)", argCount);
        return EMPTY_VAL;
    }
-
+    {


Do we need the extra scope here? It seems to happen throughout quite a few times

I have refactored them to not do that anymore, it was for name shadowing but the fix was pretty easy

Jason2605 · 2024-11-07T20:52:24Z

src/vm/datatypes/strings.c

    if (argCount != 1) {
        runtimeError(vm, "contains() takes 1 argument (%d given)", argCount);
        return EMPTY_VAL;
    }
-
+    {
+        ObjString *string = AS_STRING(args[0]);


If we're checking utf8 legality would we also need to check the needle (args[1])?

src/vm/datatypes/strings.c

Co-authored-by: Jason_000 <[email protected]>

liz3 · 2024-11-08T00:18:16Z

@Jason2605 i ran clang-format over the file in order to fix indentation

Jason2605 · 2024-11-08T21:35:20Z

Noice yeah I should probably add something to the GH actions for that!

Very much appreciated, once again thank you for this!

liz3 added 3 commits September 18, 2024 03:09

wip: utf-8 support in strings

224dff4

fix compile error

28d47a7

validity check

9b3f32a

liz3 added 4 commits October 26, 2024 23:02

fix: need this fix for it to compile

57d601a

implement general utf-8 check, implement the indexing, slicing

67ad5fe

use the length version of the take/copyString functions

cb1198d

need to cast this

e6ee09f

liz3 added 3 commits October 27, 2024 01:51

oh my god it took me forever to figure this out

8269774

add isValidUtf8

ce27a06

tests: add unicode slicing and indexing tests

571c73e

liz3 added 4 commits October 28, 2024 17:33

fix: only run index and slice utf8 version if there are multibyte cha…

a98d2d0

…racters

fix: correctly reimplement findLast

e8a3d35

tests: implement unicode tests for other string functions

742c89f

fix errors

c253114

liz3 marked this pull request as ready for review October 29, 2024 21:05

liz3 changed the title ~~wip: utf-8 support in strings~~ utf-8 support in strings Oct 29, 2024

liz3 added 10 commits October 30, 2024 22:15

length of this is 36

218579f

Merge branch 'liz3/symbolic-links' into liz3/utf8-experiments

3d68c67

reenable this

f85654b

UUIDS are length 36

a676173

Merge remote-tracking branch 'origin/develop' into liz3/utf8-experiments

a9d86db

fix: refactor scanner to use utf8, windows is confusing me

4f856d5

Revert "fix: refactor scanner to use utf8, windows is confusing me"

2aed1ec

This reverts commit 4f856d5.

fix: need to set the null byte here

24cd958

remove unused var

e8ceb82

fix issue with uuid module

f8d4818

chore: some reformatting

9d89a15

Jason2605 reviewed Nov 7, 2024

View reviewed changes

liz3 and others added 7 commits November 8, 2024 00:45

Update src/vm/datatypes/strings.c

44bc1ec

Co-authored-by: Jason_000 <[email protected]>

Update src/vm/datatypes/strings.c

aaac64e

Co-authored-by: Jason_000 <[email protected]>

Update src/vm/datatypes/strings.c

21e3d17

Co-authored-by: Jason_000 <[email protected]>

refactor scopes

522d432

Update src/vm/datatypes/strings.c

d0bf48b

Co-authored-by: Jason_000 <[email protected]>

Update src/vm/datatypes/strings.c

c0f3533

Co-authored-by: Jason_000 <[email protected]>

style: format file

9fa46e3

Jason2605 merged commit 643f2c9 into dictu-lang:develop Nov 8, 2024
9 checks passed

Jason2605 mentioned this pull request Nov 8, 2024

[BUG] Test for String.find falsely passes #748

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utf-8 support in strings #749

utf-8 support in strings #749

liz3 commented Sep 18, 2024 •

edited

Loading

Jason2605 commented Oct 4, 2024 •

edited

Loading

liz3 commented Oct 26, 2024 •

edited

Loading

Jason2605 commented Oct 28, 2024

liz3 commented Oct 29, 2024

liz3 commented Nov 6, 2024

Jason2605 commented Nov 6, 2024

liz3 commented Nov 6, 2024

liz3 commented Nov 6, 2024

Jason2605 commented Nov 6, 2024

liz3 commented Nov 6, 2024

liz3 commented Nov 7, 2024

liz3 commented Nov 7, 2024

Jason2605 commented Nov 7, 2024

Jason2605 left a comment

Jason2605 Nov 7, 2024

liz3 Nov 8, 2024

Jason2605 Nov 7, 2024

liz3 commented Nov 8, 2024

Jason2605 commented Nov 8, 2024

utf-8 support in strings #749

utf-8 support in strings #749

Conversation

liz3 commented Sep 18, 2024 • edited Loading

Add Unicode support to strings

What's Changed:

Type of Change:

Housekeeping:

Screenshots (If Applicable):

Jason2605 commented Oct 4, 2024 • edited Loading

liz3 commented Oct 26, 2024 • edited Loading

Jason2605 commented Oct 28, 2024

liz3 commented Oct 29, 2024

liz3 commented Nov 6, 2024

Jason2605 commented Nov 6, 2024

liz3 commented Nov 6, 2024

liz3 commented Nov 6, 2024

Jason2605 commented Nov 6, 2024

liz3 commented Nov 6, 2024

liz3 commented Nov 7, 2024

liz3 commented Nov 7, 2024

Jason2605 commented Nov 7, 2024

Jason2605 left a comment

Choose a reason for hiding this comment

Jason2605 Nov 7, 2024

Choose a reason for hiding this comment

liz3 Nov 8, 2024

Choose a reason for hiding this comment

Jason2605 Nov 7, 2024

Choose a reason for hiding this comment

liz3 commented Nov 8, 2024

Jason2605 commented Nov 8, 2024

liz3 commented Sep 18, 2024 •

edited

Loading

Jason2605 commented Oct 4, 2024 •

edited

Loading

liz3 commented Oct 26, 2024 •

edited

Loading