Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving spawn and exec syscalls #4785

Open
wants to merge 15 commits into
base: develop
Choose a base branch
from

Conversation

mohanson
Copy link
Contributor

@mohanson mohanson commented Jan 17, 2025

What problem does this PR solve?

Problem Summary:

The spawn and exec system calls have some usage issues. The overall operation is too complicated and requires additional checks when reading args from the vm. We found that these checks are unnecessary and can be avoided by other means.

Note: This PR is based on ckb-vm develop branch; need to wait for this PR to be merged: nervosnetwork/ckb-vm#450 and then release ckb-vm v0.24.13

What is changed and how it works?

What's Changed:

  • Use ckb-vm's new FlattenedArgsReader to simplify reading args in spawn and exec syscalls.
  • Add exec v2 implementation.

Related changes

  • PR to update owner/repo:
  • Need to cherry-pick to the release branch

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code ci-runs-only: [ quick_checks,linters ]

Side effects

  • Performance regression
  • Breaking backward compatibility

Release note

Note: Add a note under the PR title in the release note.

@mohanson mohanson requested a review from a team as a code owner January 17, 2025 02:00
@mohanson mohanson requested review from quake and removed request for a team January 17, 2025 02:00
Copy link
Collaborator

@eval-exec eval-exec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase onto develop.

@eval-exec eval-exec added the m:vm label Jan 17, 2025
@mohanson mohanson marked this pull request as draft January 17, 2025 02:08
script/src/types.rs Outdated Show resolved Hide resolved
@xxuejie
Copy link
Collaborator

xxuejie commented Jan 17, 2025

Just a note that a significant portion of the changes here, is due to the fact that the PR is now applied directly to the develop branch. Per CKB-VM's release schedule, CKB will for now stick to v0.24.x series of CKB-VM. So when the PR at CKB-VM is merged, cherry-picked, and release to a proper v0.24.x series, many of changes here will be reverted.

@mohanson mohanson marked this pull request as ready for review January 17, 2025 09:01
@mohanson mohanson requested review from xxuejie and eval-exec January 17, 2025 09:22
machine.set_register(A0, Mac::REG::from_u8(SLICE_OUT_OF_BOUND));
return Ok(true);
}
}
Copy link
Collaborator

@eval-exec eval-exec Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should ExecV2#ecall do when length == 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will try to load an empty elf file, which will result in elf parsing errors.

Copy link
Collaborator

@eval-exec eval-exec Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we handle the check length == 0 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please forgive me, I made a mistake earlier, when length == 0, data[offset...] will be read instead of empty bytes;

The current code behavior is correct.

return Ok(true);
}
if length > 0 {
let end = offset.checked_add(length).ok_or(VMError::MemOutOfBound)?;
Copy link
Collaborator

@eval-exec eval-exec Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both offset and length are <= u32::MAX, so offset.checked_add(length) will be always a Some
So I think L92->96 can be omitted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both offset and length are <= u32::MAX, so offset.checked_add(length) will be always a Some.

This description is correct, and we can probably remove this redundant check.

L92->L96 is used to determine whether the data offset exceeds the total length of the data. Why can it be omitted?

Copy link
Collaborator

@eval-exec eval-exec Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L92->L96 is used to determine whether the data offset exceeds the total length of the data. Why can it be omitted?

I think

        if offset >= full_length {
            machine.set_register(A0, Mac::REG::from_u8(SLICE_OUT_OF_BOUND));
            return Ok(true);
        }
        if length > 0 {
            let end = offset.checked_add(length).ok_or(VMError::MemOutOfBound)?;
            if end > full_length {
                machine.set_register(A0, Mac::REG::from_u8(SLICE_OUT_OF_BOUND));
                return Ok(true);
            }
        }

is equal to:

        let end = offset.checked_add(length).ok_or(VMError::MemOutOfBound)?;
        if end >= full_length {
            machine.set_register(A0, Mac::REG::from_u8(SLICE_OUT_OF_BOUND));
            return Ok(true);
        }

Is above replacement correct?
And should ExecV2#ecall check length == 0 here?

Copy link
Contributor Author

@mohanson mohanson Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, we can combine the two checks and don't need to check the case where length == 0.

We can't combine it. Considering offset == 0 and length == data.len(), we should expect it to succeed, but in the code you gave, it will return SLICE_OUT_OF_BOUND error,

Comment on lines 104 to 105
let argc = machine.registers()[A4].clone();
let argv = machine.registers()[A5].clone();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L53->L55 is:

        let index = machine.registers()[A0].to_u64();
        let mut source = machine.registers()[A1].to_u64();
        let place = machine.registers()[A2].to_u64();

So how about use .to_u64() too

Suggested change
let argc = machine.registers()[A4].clone();
let argv = machine.registers()[A5].clone();
let argc = machine.registers()[A4].to_u64();
let argv = machine.registers()[A5].to_u64();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we don't have to clone the data here.

data_piece_id: &DataPieceId,
offset: u64,
length: u64,
args: Option<(u64, u64, u64)>,
Copy link
Collaborator

@eval-exec eval-exec Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about creating a new struct:

struct VmArgs {
    vm_id: u64,
    argc: u64,
    argv: u64,
}

to replace the anonymouse turple (u64, u64, u64)? This would improve code readability and make the parameters more self-explanatory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous internal review, Xuejie suggested that I use Option<(u64, u64, u64)>. I am not sure whether I should make a struct for them now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure either.

Copy link
Collaborator

@xxuejie xxuejie Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did re-check offline communication logs, previously this is represented using the following structure:

pub enum BootArgsType {
    Static,
    Stream { vm_id: u64, argc: u64, argp: u64 },
}

Since the enum only had 2 variant: one that has 3 arguments, while the other is empty. I just believe that this 2-value enum, could in fact simply be an Option.

I have no opinion whether it is Option<(u64, u64, u64)> or Option<VmArgs>, both feel the same to me. I never say (u64, u64, u64) is better in readability to VmArgs or ArgvPointer or another name for the structure, both feel the same to me but I do understand a different opinion might occur.

eval-exec
eval-exec previously approved these changes Jan 17, 2025
args.length,
Some((vm_id, args.argc, args.argv)),
)?;
self.instantiated.insert(vm_id, (context, new_machine));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please cap the number of maximum instantiated virtual machines here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exec will always remove an old running vm instance and then join a new one, so the maximum number of instantiations will not be triggered.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, in that sense this code is correct, but can we change the code to the following flow to better illustrate this fact:

let machine_n_context = self.instantiated.get_mut(&vm_id)...
{
  let old_machine = &machine_n_context.1;
  ...
}
...
*machine_n_context = (context, new_machine);

The latter insert line can be quite confusing and maybe one day we forgot about this underlying logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this way of writing, machine_n_context borrows self as mutable, This will cause self.create_dummy_vm and self.load_vm_program to no longer be usable

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you are right here. In this sence I recommend the following line as a hint:

debug_assert!(self.instantiated.contains_key(&vm_id));
self.instantiated.insert(vm_id, (context, new_machine));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll add a comment as well.

pub struct ExecV2Args {
pub data_piece_id: DataPieceId,
pub offset: u64,
pub length: u64,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice both ExecV2Args and SpawnArgs contain data_piece_id, offset and length. All 3 are required to locate a program. We should make them a struct to distinguish them from other parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a new structure

pub struct DataLocation {
    pub data_piece_id: DataPieceId,
    pub offset: u64,
    pub length: u64,
}

let offset = bounds >> 32;
let length = bounds as u32 as u64;

// We are fetching the actual cell here for some in-place validation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a chance we can move those validations to the scheduler part, so we only loads the actual program once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be possible, let me simplify it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation, it is not possible to load data only once.

Because in fact, they are two different calls. The first one is sc.load_data(&data_piece_id, 0, 0) called in exec_v2, the purpose is to obtain the total length of cell_data or witness, and the second one is sc.load_data(&location.data_piece_id, location.offset, location.length), the purpose is to obtain the actual program to be executed from cell_data or witness.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we move the validation code till later part when the length and the data are fetched in the scheduler part? Total length is only there for verification purposes, and loading per length, should already do many of the verification work.

Still we might hit other roadblockers, I just wonder if the code can be simplified in v 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to do this, but encountered some compatibility issues, and fixing these compatibility issues may involve modules that this PR does not care about.

For example, in DataSource, sc.load_data always returns a Bytes array (possibly an empty array) regardless of offset and length. Therefore, when the developer incorrectly passes in offset and length, in your modification plan, the scheduler will incorrectly execute the elf parsing step and return VMError.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the old exec implementation, we would first do a bounds check and then try to parse elf.

But the DataSource does no bounds checking -- if you go out of bounds, it returns empty bytes.

Copy link
Contributor Author

@mohanson mohanson Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both spawn and exec_v2 contain this validation code. I created a function to avoid duplication of code. Do you think this is OK? 74fd84b

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we now have a second chance to re-design exec in v2, the old exec implementation can indeed provide inspiration but I don't believe we should strictly follow the old rules exactly. Might be worthwhile to rethink what we really need here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been toying with the code here, and this commit fully captures my opinions. Of course it changes current exec's behavior, so out-of-bound offset/length does not result in errors anymore, they only chop the loaded binary as much as possible(e.g., when offset is bigger than data length, an empty buffer will be returned).

  1. In a v2 design, we are allowed to change exec's behavior
  2. If we really think about this piece of code, I'm not sure if they make sense anymore. when loading any other type of data, CKB never generates out-of-bound errors, it simply returns as much data as possible. I personally do not have a convincing reason why loading binaries should be any different.

So on a personal opinion, I would remove those redundant checks on offset/length in exec's V2 design. We might not be able to do the same thing this time for spawn(since it already is activated on testnet), but I do believe it makes sense to alter exec's design now, and also change spawn's design when we have a chance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the changes according to your opinions.

Copy link
Collaborator

@xxuejie xxuejie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor comments on the necessity of following old code behaviors

args.location.length,
) {
Ok(val) => val,
Err(Error::SnapshotDataLoadError) => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this, for a V2 clean design, can we just return the snapshot data load error here? I think there is no need to mimic v1 behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should return a VMError

let index = machine.registers()[A0].to_u64();
let mut source = machine.registers()[A1].to_u64();
let place = machine.registers()[A2].to_u64();
// To keep compatible with the old behavior. When Source is wrong, a
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do want to ask here: if old behavior is not a concern, what a proper solution here should be?

Same question goes for Place parsing below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest removing this part of the compatibility code.

xxuejie
xxuejie previously approved these changes Jan 21, 2025
@mohanson
Copy link
Contributor Author

mohanson commented Jan 22, 2025

@eval-exec I have no idea why pthread lock: Invalid argument, can you take a look? https://github.com/nervosnetwork/ckb/actions/runs/12885771772/job/35965802195?pr=4785

@eval-exec
Copy link
Collaborator

eval-exec commented Jan 22, 2025

@eval-exec I have no idea why pthread lock: Invalid argument, can you take a look? https://github.com/nervosnetwork/ckb/actions/runs/12885771772/job/35965802195?pr=4785

Ok, That error occurred before, and it seems to be related to ckb-rocksdb. Investigating...

@eval-exec
Copy link
Collaborator

@eval-exec I have no idea why pthread lock: Invalid argument, can you take a look? https://github.com/nervosnetwork/ckb/actions/runs/12885771772/job/35965802195?pr=4785

Latest unit test failed:

──── STDERR:             ckb-script verify::tests::ckb_2023::features_since_v2021::check_typical_secp256k1_blake160_2_in_2_out_tx_with_state
thread 'verify::tests::ckb_2023::features_since_v2021::check_typical_secp256k1_blake160_2_in_2_out_tx_with_state' panicked at script/src/verify/tests/ckb_latest/features_since_v2021.rs:878:9:
step_cycles 3242342

https://github.com/nervosnetwork/ckb/actions/runs/12885771772/job/35970732805?pr=4785

@mohanson
Copy link
Contributor Author

@eval-exec I have no idea why pthread lock: Invalid argument, can you take a look? https://github.com/nervosnetwork/ckb/actions/runs/12885771772/job/35965802195?pr=4785

Latest unit test failed:

──── STDERR:             ckb-script verify::tests::ckb_2023::features_since_v2021::check_typical_secp256k1_blake160_2_in_2_out_tx_with_state
thread 'verify::tests::ckb_2023::features_since_v2021::check_typical_secp256k1_blake160_2_in_2_out_tx_with_state' panicked at script/src/verify/tests/ckb_latest/features_since_v2021.rs:878:9:
step_cycles 3242342

https://github.com/nervosnetwork/ckb/actions/runs/12885771772/job/35970732805?pr=4785

The incorrect test case has been fixed, please re-execute ci

@eval-exec eval-exec added the t:enhancement Type: Feature, refactoring. label Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
m:vm t:enhancement Type: Feature, refactoring.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants