-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving spawn and exec syscalls #4785
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rebase onto develop
.
Just a note that a significant portion of the changes here, is due to the fact that the PR is now applied directly to the develop branch. Per CKB-VM's release schedule, CKB will for now stick to |
script/src/syscalls/exec_v2.rs
Outdated
machine.set_register(A0, Mac::REG::from_u8(SLICE_OUT_OF_BOUND)); | ||
return Ok(true); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should ExecV2#ecall
do when length == 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will try to load an empty elf file, which will result in elf parsing errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we handle the check length == 0
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please forgive me, I made a mistake earlier, when length == 0, data[offset...] will be read instead of empty bytes;
The current code behavior is correct.
script/src/syscalls/exec_v2.rs
Outdated
return Ok(true); | ||
} | ||
if length > 0 { | ||
let end = offset.checked_add(length).ok_or(VMError::MemOutOfBound)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both offset
and length
are <= u32::MAX
, so offset.checked_add(length)
will be always a Some
So I think L92->96 can be omitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both offset and length are <= u32::MAX, so offset.checked_add(length) will be always a Some.
This description is correct, and we can probably remove this redundant check.
L92->L96 is used to determine whether the data offset exceeds the total length of the data. Why can it be omitted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L92->L96 is used to determine whether the data offset exceeds the total length of the data. Why can it be omitted?
I think
if offset >= full_length {
machine.set_register(A0, Mac::REG::from_u8(SLICE_OUT_OF_BOUND));
return Ok(true);
}
if length > 0 {
let end = offset.checked_add(length).ok_or(VMError::MemOutOfBound)?;
if end > full_length {
machine.set_register(A0, Mac::REG::from_u8(SLICE_OUT_OF_BOUND));
return Ok(true);
}
}
is equal to:
let end = offset.checked_add(length).ok_or(VMError::MemOutOfBound)?;
if end >= full_length {
machine.set_register(A0, Mac::REG::from_u8(SLICE_OUT_OF_BOUND));
return Ok(true);
}
Is above replacement correct?
And should ExecV2#ecall
check length == 0
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, we can combine the two checks and don't need to check the case where length == 0.
We can't combine it. Considering offset == 0 and length == data.len(), we should expect it to succeed, but in the code you gave, it will return SLICE_OUT_OF_BOUND error,
script/src/syscalls/exec_v2.rs
Outdated
let argc = machine.registers()[A4].clone(); | ||
let argv = machine.registers()[A5].clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L53->L55 is:
let index = machine.registers()[A0].to_u64();
let mut source = machine.registers()[A1].to_u64();
let place = machine.registers()[A2].to_u64();
So how about use .to_u64()
too
let argc = machine.registers()[A4].clone(); | |
let argv = machine.registers()[A5].clone(); | |
let argc = machine.registers()[A4].to_u64(); | |
let argv = machine.registers()[A5].to_u64(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we don't have to clone the data here.
data_piece_id: &DataPieceId, | ||
offset: u64, | ||
length: u64, | ||
args: Option<(u64, u64, u64)>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about creating a new struct:
struct VmArgs {
vm_id: u64,
argc: u64,
argv: u64,
}
to replace the anonymouse turple (u64, u64, u64)
? This would improve code readability and make the parameters more self-explanatory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the previous internal review, Xuejie suggested that I use Option<(u64, u64, u64)>
. I am not sure whether I should make a struct for them now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did re-check offline communication logs, previously this is represented using the following structure:
pub enum BootArgsType {
Static,
Stream { vm_id: u64, argc: u64, argp: u64 },
}
Since the enum only had 2 variant: one that has 3 arguments, while the other is empty. I just believe that this 2-value enum, could in fact simply be an Option
.
I have no opinion whether it is Option<(u64, u64, u64)>
or Option<VmArgs>
, both feel the same to me. I never say (u64, u64, u64)
is better in readability to VmArgs
or ArgvPointer
or another name for the structure, both feel the same to me but I do understand a different opinion might occur.
args.length, | ||
Some((vm_id, args.argc, args.argv)), | ||
)?; | ||
self.instantiated.insert(vm_id, (context, new_machine)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please cap the number of maximum instantiated virtual machines here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exec will always remove an old running vm instance and then join a new one, so the maximum number of instantiations will not be triggered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, in that sense this code is correct, but can we change the code to the following flow to better illustrate this fact:
let machine_n_context = self.instantiated.get_mut(&vm_id)...
{
let old_machine = &machine_n_context.1;
...
}
...
*machine_n_context = (context, new_machine);
The latter insert line can be quite confusing and maybe one day we forgot about this underlying logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this way of writing, machine_n_context
borrows self
as mutable, This will cause self.create_dummy_vm
and self.load_vm_program
to no longer be usable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, you are right here. In this sence I recommend the following line as a hint:
debug_assert!(self.instantiated.contains_key(&vm_id));
self.instantiated.insert(vm_id, (context, new_machine));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'll add a comment as well.
pub struct ExecV2Args { | ||
pub data_piece_id: DataPieceId, | ||
pub offset: u64, | ||
pub length: u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice both ExecV2Args
and SpawnArgs
contain data_piece_id
, offset
and length
. All 3 are required to locate a program. We should make them a struct to distinguish them from other parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a new structure
pub struct DataLocation {
pub data_piece_id: DataPieceId,
pub offset: u64,
pub length: u64,
}
script/src/syscalls/exec_v2.rs
Outdated
let offset = bounds >> 32; | ||
let length = bounds as u32 as u64; | ||
|
||
// We are fetching the actual cell here for some in-place validation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a chance we can move those validations to the scheduler part, so we only loads the actual program once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should be possible, let me simplify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current implementation, it is not possible to load data only once.
Because in fact, they are two different calls. The first one is sc.load_data(&data_piece_id, 0, 0)
called in exec_v2, the purpose is to obtain the total length of cell_data or witness, and the second one is sc.load_data(&location.data_piece_id, location.offset, location.length)
, the purpose is to obtain the actual program to be executed from cell_data or witness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we move the validation code till later part when the length and the data are fetched in the scheduler part? Total length is only there for verification purposes, and loading per length
, should already do many of the verification work.
Still we might hit other roadblockers, I just wonder if the code can be simplified in v 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to do this, but encountered some compatibility issues, and fixing these compatibility issues may involve modules that this PR does not care about.
For example, in DataSource, sc.load_data always returns a Bytes array (possibly an empty array) regardless of offset and length. Therefore, when the developer incorrectly passes in offset and length, in your modification plan, the scheduler will incorrectly execute the elf parsing step and return VMError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the old exec implementation, we would first do a bounds check and then try to parse elf.
But the DataSource does no bounds checking -- if you go out of bounds, it returns empty bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both spawn
and exec_v2
contain this validation code. I created a function to avoid duplication of code. Do you think this is OK? 74fd84b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we now have a second chance to re-design exec in v2, the old exec implementation can indeed provide inspiration but I don't believe we should strictly follow the old rules exactly. Might be worthwhile to rethink what we really need here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been toying with the code here, and this commit fully captures my opinions. Of course it changes current exec's behavior, so out-of-bound offset/length does not result in errors anymore, they only chop the loaded binary as much as possible(e.g., when offset is bigger than data length, an empty buffer will be returned).
- In a v2 design, we are allowed to change exec's behavior
- If we really think about this piece of code, I'm not sure if they make sense anymore. when loading any other type of data, CKB never generates out-of-bound errors, it simply returns as much data as possible. I personally do not have a convincing reason why loading binaries should be any different.
So on a personal opinion, I would remove those redundant checks on offset/length in exec's V2 design. We might not be able to do the same thing this time for spawn(since it already is activated on testnet), but I do believe it makes sense to alter exec's design now, and also change spawn's design when we have a chance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made the changes according to your opinions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor comments on the necessity of following old code behaviors
script/src/scheduler.rs
Outdated
args.location.length, | ||
) { | ||
Ok(val) => val, | ||
Err(Error::SnapshotDataLoadError) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed this, for a V2 clean design, can we just return the snapshot data load error here? I think there is no need to mimic v1 behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should return a VMError
script/src/syscalls/exec_v2.rs
Outdated
let index = machine.registers()[A0].to_u64(); | ||
let mut source = machine.registers()[A1].to_u64(); | ||
let place = machine.registers()[A2].to_u64(); | ||
// To keep compatible with the old behavior. When Source is wrong, a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do want to ask here: if old behavior is not a concern, what a proper solution here should be?
Same question goes for Place
parsing below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest removing this part of the compatibility code.
@eval-exec I have no idea why |
Ok, That error occurred before, and it seems to be related to ckb-rocksdb. Investigating... |
Latest unit test failed:
https://github.com/nervosnetwork/ckb/actions/runs/12885771772/job/35970732805?pr=4785 |
The incorrect test case has been fixed, please re-execute ci |
What problem does this PR solve?
Problem Summary:
The spawn and exec system calls have some usage issues. The overall operation is too complicated and requires additional checks when reading args from the vm. We found that these checks are unnecessary and can be avoided by other means.
What is changed and how it works?
What's Changed:
FlattenedArgsReader
to simplify reading args in spawn and exec syscalls.Related changes
owner/repo
:Check List
Tests
Side effects
Release note