Improving spawn and exec syscalls #4785

mohanson · 2025-01-17T02:00:41Z

What problem does this PR solve?

Problem Summary:

The spawn and exec system calls have some usage issues. The overall operation is too complicated and requires additional checks when reading args from the vm. We found that these checks are unnecessary and can be avoided by other means.

Note: This PR is based on ckb-vm develop branch; need to wait for this PR to be merged: nervosnetwork/ckb-vm#450 and then release ckb-vm v0.24.13

What is changed and how it works?

What's Changed:

Use ckb-vm's new FlattenedArgsReader to simplify reading args in spawn and exec syscalls.
Add exec v2 implementation.

Related changes

PR to update owner/repo:
Need to cherry-pick to the release branch

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code ci-runs-only: [ quick_checks,linters ]

Side effects

Performance regression
Breaking backward compatibility

Release note

Note: Add a note under the PR title in the release note.

eval-exec

Please rebase onto develop.

script/src/types.rs

xxuejie · 2025-01-17T03:23:50Z

Just a note that a significant portion of the changes here, is due to the fact that the PR is now applied directly to the develop branch. Per CKB-VM's release schedule, CKB will for now stick to v0.24.x series of CKB-VM. So when the PR at CKB-VM is merged, cherry-picked, and release to a proper v0.24.x series, many of changes here will be reverted.

eval-exec · 2025-01-17T11:20:28Z

script/src/syscalls/exec_v2.rs

+                machine.set_register(A0, Mac::REG::from_u8(SLICE_OUT_OF_BOUND));
+                return Ok(true);
+            }
+        }


What should ExecV2#ecall do when length == 0?

It will try to load an empty elf file, which will result in elf parsing errors.

Should we handle the check length == 0 here?

Please forgive me, I made a mistake earlier, when length == 0, data[offset...] will be read instead of empty bytes;

The current code behavior is correct.

eval-exec · 2025-01-17T11:23:15Z

script/src/syscalls/exec_v2.rs

+            return Ok(true);
+        }
+        if length > 0 {
+            let end = offset.checked_add(length).ok_or(VMError::MemOutOfBound)?;


both offset and length are <= u32::MAX, so offset.checked_add(length) will be always a Some
So I think L92->96 can be omitted.

both offset and length are <= u32::MAX, so offset.checked_add(length) will be always a Some.

This description is correct, and we can probably remove this redundant check.

L92->L96 is used to determine whether the data offset exceeds the total length of the data. Why can it be omitted?

L92->L96 is used to determine whether the data offset exceeds the total length of the data. Why can it be omitted?

I think

if offset >= full_length { machine.set_register(A0, Mac::REG::from_u8(SLICE_OUT_OF_BOUND)); return Ok(true); } if length > 0 { let end = offset.checked_add(length).ok_or(VMError::MemOutOfBound)?; if end > full_length { machine.set_register(A0, Mac::REG::from_u8(SLICE_OUT_OF_BOUND)); return Ok(true); } }

is equal to:

let end = offset.checked_add(length).ok_or(VMError::MemOutOfBound)?; if end >= full_length { machine.set_register(A0, Mac::REG::from_u8(SLICE_OUT_OF_BOUND)); return Ok(true); }

Is above replacement correct?
And should ExecV2#ecall check length == 0 here?

~~You are right, we can combine the two checks and don't need to check the case where length == 0.~~

We can't combine it. Considering offset == 0 and length == data.len(), we should expect it to succeed, but in the code you gave, it will return SLICE_OUT_OF_BOUND error,

eval-exec · 2025-01-17T11:31:22Z

script/src/syscalls/exec_v2.rs

+        let argc = machine.registers()[A4].clone();
+        let argv = machine.registers()[A5].clone();


L53->L55 is:

let index = machine.registers()[A0].to_u64(); let mut source = machine.registers()[A1].to_u64(); let place = machine.registers()[A2].to_u64();

So how about use .to_u64() too

Suggested change

let argc = machine.registers()[A4].clone();

let argv = machine.registers()[A5].clone();

let argc = machine.registers()[A4].to_u64();

let argv = machine.registers()[A5].to_u64();

Yes, we don't have to clone the data here.

eval-exec · 2025-01-17T11:44:44Z

script/src/scheduler.rs

+        data_piece_id: &DataPieceId,
+        offset: u64,
+        length: u64,
+        args: Option<(u64, u64, u64)>,


How about creating a new struct:

struct VmArgs { vm_id: u64, argc: u64, argv: u64, }

to replace the anonymouse turple (u64, u64, u64)? This would improve code readability and make the parameters more self-explanatory.

In the previous internal review, Xuejie suggested that I use Option<(u64, u64, u64)>. I am not sure whether I should make a struct for them now.

I'm not sure either.

I did re-check offline communication logs, previously this is represented using the following structure:

pub enum BootArgsType { Static, Stream { vm_id: u64, argc: u64, argp: u64 }, }

Since the enum only had 2 variant: one that has 3 arguments, while the other is empty. I just believe that this 2-value enum, could in fact simply be an Option.

I have no opinion whether it is Option<(u64, u64, u64)> or Option<VmArgs>, both feel the same to me. I never say (u64, u64, u64) is better in readability to VmArgs or ArgvPointer or another name for the structure, both feel the same to me but I do understand a different opinion might occur.

xxuejie · 2025-01-20T05:42:18Z

script/src/scheduler.rs

+                        args.length,
+                        Some((vm_id, args.argc, args.argv)),
+                    )?;
+                    self.instantiated.insert(vm_id, (context, new_machine));


Please cap the number of maximum instantiated virtual machines here.

Exec will always remove an old running vm instance and then join a new one, so the maximum number of instantiations will not be triggered.

Ah I see, in that sense this code is correct, but can we change the code to the following flow to better illustrate this fact:

let machine_n_context = self.instantiated.get_mut(&vm_id)... { let old_machine = &machine_n_context.1; ... } ... *machine_n_context = (context, new_machine);

The latter insert line can be quite confusing and maybe one day we forgot about this underlying logic.

In this way of writing, machine_n_context borrows self as mutable, This will cause self.create_dummy_vm and self.load_vm_program to no longer be usable

I see, you are right here. In this sence I recommend the following line as a hint:

debug_assert!(self.instantiated.contains_key(&vm_id)); self.instantiated.insert(vm_id, (context, new_machine));

OK. I'll add a comment as well.

xxuejie · 2025-01-20T05:48:20Z

script/src/types.rs

+pub struct ExecV2Args {
+    pub data_piece_id: DataPieceId,
+    pub offset: u64,
+    pub length: u64,


Notice both ExecV2Args and SpawnArgs contain data_piece_id, offset and length. All 3 are required to locate a program. We should make them a struct to distinguish them from other parameters.

I will add a new structure

pub struct DataLocation { pub data_piece_id: DataPieceId, pub offset: u64, pub length: u64, }

xxuejie · 2025-01-20T05:50:39Z

script/src/syscalls/exec_v2.rs

+        let offset = bounds >> 32;
+        let length = bounds as u32 as u64;
+
+        // We are fetching the actual cell here for some in-place validation


Is there a chance we can move those validations to the scheduler part, so we only loads the actual program once?

That should be possible, let me simplify it.

In the current implementation, it is not possible to load data only once.

Because in fact, they are two different calls. The first one is sc.load_data(&data_piece_id, 0, 0) called in exec_v2, the purpose is to obtain the total length of cell_data or witness, and the second one is sc.load_data(&location.data_piece_id, location.offset, location.length), the purpose is to obtain the actual program to be executed from cell_data or witness.

What if we move the validation code till later part when the length and the data are fetched in the scheduler part? Total length is only there for verification purposes, and loading per length, should already do many of the verification work.

Still we might hit other roadblockers, I just wonder if the code can be simplified in v 2

I tried to do this, but encountered some compatibility issues, and fixing these compatibility issues may involve modules that this PR does not care about.

For example, in DataSource, sc.load_data always returns a Bytes array (possibly an empty array) regardless of offset and length. Therefore, when the developer incorrectly passes in offset and length, in your modification plan, the scheduler will incorrectly execute the elf parsing step and return VMError.

In the old exec implementation, we would first do a bounds check and then try to parse elf.

But the DataSource does no bounds checking -- if you go out of bounds, it returns empty bytes.

Both spawn and exec_v2 contain this validation code. I created a function to avoid duplication of code. Do you think this is OK? 74fd84b

I believe we now have a second chance to re-design exec in v2, the old exec implementation can indeed provide inspiration but I don't believe we should strictly follow the old rules exactly. Might be worthwhile to rethink what we really need here

I've been toying with the code here, and this commit fully captures my opinions. Of course it changes current exec's behavior, so out-of-bound offset/length does not result in errors anymore, they only chop the loaded binary as much as possible(e.g., when offset is bigger than data length, an empty buffer will be returned).

In a v2 design, we are allowed to change exec's behavior

If we really think about this piece of code, I'm not sure if they make sense anymore. when loading any other type of data, CKB never generates out-of-bound errors, it simply returns as much data as possible. I personally do not have a convincing reason why loading binaries should be any different.

So on a personal opinion, I would remove those redundant checks on offset/length in exec's V2 design. We might not be able to do the same thing this time for spawn(since it already is activated on testnet), but I do believe it makes sense to alter exec's design now, and also change spawn's design when we have a chance.

I have made the changes according to your opinions.

This reverts commit 74fd84b.

xxuejie

Just some minor comments on the necessity of following old code behaviors

xxuejie · 2025-01-21T05:01:40Z

script/src/scheduler.rs

+                            args.location.length,
+                        ) {
+                            Ok(val) => val,
+                            Err(Error::SnapshotDataLoadError) => {


Sorry I missed this, for a V2 clean design, can we just return the snapshot data load error here? I think there is no need to mimic v1 behavior.

Yes, we should return a VMError

xxuejie · 2025-01-21T05:03:12Z

script/src/syscalls/exec_v2.rs

+        let index = machine.registers()[A0].to_u64();
+        let mut source = machine.registers()[A1].to_u64();
+        let place = machine.registers()[A2].to_u64();
+        // To keep compatible with the old behavior. When Source is wrong, a


I do want to ask here: if old behavior is not a concern, what a proper solution here should be?

Same question goes for Place parsing below

I suggest removing this part of the compatibility code.

mohanson · 2025-01-22T01:51:28Z

@eval-exec I have no idea why pthread lock: Invalid argument, can you take a look? https://github.com/nervosnetwork/ckb/actions/runs/12885771772/job/35965802195?pr=4785

eval-exec · 2025-01-22T03:07:25Z

@eval-exec I have no idea why pthread lock: Invalid argument, can you take a look? https://github.com/nervosnetwork/ckb/actions/runs/12885771772/job/35965802195?pr=4785

Ok, That error occurred before, and it seems to be related to ckb-rocksdb. Investigating...

eval-exec · 2025-01-22T06:15:45Z

@eval-exec I have no idea why pthread lock: Invalid argument, can you take a look? https://github.com/nervosnetwork/ckb/actions/runs/12885771772/job/35965802195?pr=4785

Latest unit test failed:

──── STDERR:             ckb-script verify::tests::ckb_2023::features_since_v2021::check_typical_secp256k1_blake160_2_in_2_out_tx_with_state
thread 'verify::tests::ckb_2023::features_since_v2021::check_typical_secp256k1_blake160_2_in_2_out_tx_with_state' panicked at script/src/verify/tests/ckb_latest/features_since_v2021.rs:878:9:
step_cycles 3242342

https://github.com/nervosnetwork/ckb/actions/runs/12885771772/job/35970732805?pr=4785

mohanson · 2025-01-22T06:29:13Z

@eval-exec I have no idea why pthread lock: Invalid argument, can you take a look? https://github.com/nervosnetwork/ckb/actions/runs/12885771772/job/35965802195?pr=4785

Latest unit test failed:
──── STDERR:             ckb-script verify::tests::ckb_2023::features_since_v2021::check_typical_secp256k1_blake160_2_in_2_out_tx_with_state
thread 'verify::tests::ckb_2023::features_since_v2021::check_typical_secp256k1_blake160_2_in_2_out_tx_with_state' panicked at script/src/verify/tests/ckb_latest/features_since_v2021.rs:878:9:
step_cycles 3242342
https://github.com/nervosnetwork/ckb/actions/runs/12885771772/job/35970732805?pr=4785

The incorrect test case has been fixed, please re-execute ci

mohanson requested a review from a team as a code owner January 17, 2025 02:00

mohanson requested review from quake and removed request for a team January 17, 2025 02:00

eval-exec reviewed Jan 17, 2025

View reviewed changes

eval-exec added the m:vm label Jan 17, 2025

eval-exec requested review from doitian, xxuejie, zhangsoledad and driftluo January 17, 2025 02:02

mohanson marked this pull request as draft January 17, 2025 02:08

xxuejie reviewed Jan 17, 2025

View reviewed changes

script/src/types.rs Outdated Show resolved Hide resolved

mohanson force-pushed the args_reader branch from 672649a to 8e0553f Compare January 17, 2025 02:28

mohanson force-pushed the args_reader branch from 1690ab6 to 0d6496b Compare January 17, 2025 08:56

mohanson marked this pull request as ready for review January 17, 2025 09:01

mohanson added 4 commits January 17, 2025 17:02

Implement FlattenedArgsReader

87f41c9

Add exec v2

ed787ce

Rename load_c_string => load_c_string_byte_by_byte

34676ba

Use ckb-vm's release-0.24 branch

872809a

mohanson force-pushed the args_reader branch from 37fdbb5 to 872809a Compare January 17, 2025 09:02

Update ckb-vm to v0.24.13

98c41ab

mohanson requested review from xxuejie and eval-exec January 17, 2025 09:22

eval-exec reviewed Jan 17, 2025

View reviewed changes

eval-exec previously approved these changes Jan 17, 2025

View reviewed changes

Follow eval-exec's comments

007704d

mohanson dismissed eval-exec’s stale review via 007704d January 17, 2025 12:38

Fix ci lint

41a154e

xxuejie reviewed Jan 20, 2025

View reviewed changes

mohanson added 5 commits January 20, 2025 15:28

Follow xuejie's comments

6429889

Use add_cycles_no_checking instead of add_cycles in exec_v2

ac0c2b0

Create a function to avoid duplication of code

74fd84b

Revert "Create a function to avoid duplication of code"

cf2d2ef

This reverts commit 74fd84b.

Remove out of bounds check in exec_v2

5136613

xxuejie reviewed Jan 21, 2025

View reviewed changes

Follow xuejie's comments

9136fbc

xxuejie previously approved these changes Jan 21, 2025

View reviewed changes

Fix failed test case

b8a6620

mohanson dismissed xxuejie’s stale review via b8a6620 January 21, 2025 11:14

Fix ci

4672878

eval-exec approved these changes Jan 22, 2025

View reviewed changes

eval-exec added the t:enhancement Type: Feature, refactoring. label Jan 22, 2025

xxuejie approved these changes Jan 23, 2025

View reviewed changes

		let argc = machine.registers()[A4].clone();
		let argv = machine.registers()[A5].clone();

Improving spawn and exec syscalls #4785

Are you sure you want to change the base?

Improving spawn and exec syscalls #4785

Conversation

mohanson commented Jan 17, 2025 • edited Loading

What problem does this PR solve?

What is changed and how it works?

Related changes

Check List

Release note

eval-exec left a comment

Choose a reason for hiding this comment

xxuejie commented Jan 17, 2025

eval-exec Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eval-exec Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eval-exec Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eval-exec Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

mohanson Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eval-exec Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xxuejie Jan 20, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mohanson Jan 20, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xxuejie left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mohanson commented Jan 22, 2025 • edited Loading

eval-exec commented Jan 22, 2025 • edited Loading

eval-exec commented Jan 22, 2025

mohanson commented Jan 22, 2025

mohanson commented Jan 17, 2025 •

edited

Loading

eval-exec Jan 17, 2025 •

edited

Loading

eval-exec Jan 17, 2025 •

edited

Loading

eval-exec Jan 17, 2025 •

edited

Loading

eval-exec Jan 17, 2025 •

edited

Loading

mohanson Jan 17, 2025 •

edited

Loading

eval-exec Jan 17, 2025 •

edited

Loading

xxuejie Jan 20, 2025 •

edited

Loading

mohanson Jan 20, 2025 •

edited

Loading

mohanson commented Jan 22, 2025 •

edited

Loading

eval-exec commented Jan 22, 2025 •

edited

Loading