Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

boot directly throws null cap #30 error (embedded zcu102) #263

Open
Vincer239 opened this issue Jan 14, 2025 · 12 comments
Open

boot directly throws null cap #30 error (embedded zcu102) #263

Vincer239 opened this issue Jan 14, 2025 · 12 comments

Comments

@Vincer239
Copy link

Hi,
this might be a continuation of #253. Or at least within this context, I found this bug.

For testing reasons, I set up a simple microkit configuration.

<?xml version="1.0" encoding="UTF-8"?>
<system>
    <memory_region name="axi-lite" size="0x1_000" phys_addr="0x8000_0000"/>
    <memory_region name="dma-out" size="0x1_000" phys_addr="0x7001_0000"/>
    <memory_region name="dma-in" size="0x1_000" phys_addr="0x7002_0000"/>
    
    <protection_domain name="handler1" priority="210" pp="true">
        <map mr="axi-lite" vaddr="0x4000_0000" perms="rw" cached="false" setvar_vaddr="axi_lite_vaddr"/>
        <map mr="dma-out" vaddr="0x4001_0000" perms="rw" cached="false" setvar_vaddr="dma_out_vaddr"/>
        <map mr="dma-in" vaddr="0x4002_0000" perms="rw" cached="false" setvar_vaddr="dma_in_vaddr"/>
        <program_image path="handler1.elf" />   
        <!-- 121 is first ID for IRQ from PL->PS -->
        <irq irq="121" id="5"/>
    </protection_domain>
</system>

I want to write some data in a memory region dma-out and copy that content through PL DMA into another memory region dma-in while communicating with PL over AXI Lite located behind memory region axi-lite. The location of the memory regions was chosen according to Xilinx Documentation (DDR Low bank without memory mapping to hardware).

Microkit compiles without errors. But once I load it to actual hardware, I get:

## Starting application at 0x40000000 ...
LDR|INFO: altloader for seL4 starting
LDR|INFO: Flags:                0x0000000000000001
             seL4 configured as hypervisor
LDR|INFO: Kernel:      entry:   0x0000008000000000
LDR|INFO: Root server: physmem: 0x0000000000254000 -- 0x000000000025c000
LDR|INFO:              virtmem: 0x000000008a000000 -- 0x000000008a008000
LDR|INFO:              entry  : 0x000000008a000000
LDR|INFO: region: 0x00000000   addr: 0x0000000000000000   size: 0x0000000000246000   offset: 0x0000000000000000   type: 0x0000000000000001
LDR|INFO: region: 0x00000001   addr: 0x0000000000254000   size: 0x00000000000070f0   offset: 0x0000000000246000   type: 0x0000000000000001
LDR|INFO: region: 0x00000002   addr: 0x0000000000246000   size: 0x0000000000001788   offset: 0x000000000024d0f0   type: 0x0000000000000001
LDR|INFO: region: 0x00000003   addr: 0x0000000000248000   size: 0x0000000000004ea0   offset: 0x000000000024e878   type: 0x0000000000000001
LDR|INFO: region: 0x00000004   addr: 0x000000000024d000   size: 0x0000000000000060   offset: 0x0000000000253718   type: 0x0000000000000001
LDR|INFO: region: 0x00000005   addr: 0x000000000024e000   size: 0x0000000000004900   offset: 0x0000000000253778   type: 0x0000000000000001
LDR|INFO: region: 0x00000006   addr: 0x0000000000253000   size: 0x0000000000000040   offset: 0x0000000000258078   type: 0x0000000000000001
LDR|INFO: copying region 0x00000000
LDR|INFO: copying region 0x00000001
LDR|INFO: copying region 0x00000002
LDR|INFO: copying region 0x00000003
LDR|INFO: copying region 0x00000004
LDR|INFO: copying region 0x00000005
LDR|INFO: copying region 0x00000006
LDR|INFO: Setting all interrupts to Group 1
LDR|INFO: GICv2 ITLinesNumber: 0x00000005
LDR|INFO: CurrentEL=EL2
LDR|INFO: Resetting CNTVOFF
LDR|INFO: enabling MMU
LDR|INFO: jumping to kernel
Warning:  gpt_cntfrq 33333332, expected 100000000
Bootstrapping kernel
available phys memory regions: 2
  [0..80000000]
  [800000000..880000000]
reserved virt address space regions: 3
  [8000000000..8000246000]
  [8000246000..8000254000]
  [8000254000..800025c000]
Booting all finished, dropped to user space
MON|INFO: Microkit Bootstrap
MON|INFO: bootinfo untyped list matches expected list
MON|INFO: Number of bootstrap invocations: 0x00000009
MON|INFO: Number of system invocations:    0x00000054
MON|INFO: completed bootstrap invocations
<<seL4(CPU 0) [decodeInvocation/646 T0x887ffe7400 "" @8a000198]: Attempted to invoke a null cap #30.>>
ERROR: 0x0000000000000002 seL4_InvalidCapability  invocation idx: 0x00000016.0x00000000
FAIL: invocation error

I didn't collide with any other memory allocated in the report.txt and on hardware side it's an open memory region particularly for these use cases.
I really dislike that microkit tells me it compiled correctly without errors and then giving me a null cap.

@Ivan-Velickovic
Copy link
Collaborator

Could you please send the report?

I really dislike that microkit tells me it compiled correctly without errors and then giving me a null cap.

It’s not supposed to so there’s definitely a bug somewhere.

@Ivan-Velickovic
Copy link
Collaborator

Hmm. I unfortunately cannot reproduce the issue on 1.4.1 or the latest commit on our ZCU102. However, my setup is slightly different since I don't have the same ELF as you.

I modified the 'hello' example in the repository to have this system description:

<?xml version="1.0" encoding="UTF-8"?>
<system>
    <memory_region name="axi-lite" size="0x1_000" phys_addr="0x8000_0000"/>
    <memory_region name="dma-out" size="0x1_000" phys_addr="0x7001_0000"/>
    <memory_region name="dma-in" size="0x1_000" phys_addr="0x7002_0000"/>

    <protection_domain name="hello" priority="210">
        <program_image path="hello.elf" />
        <map mr="axi-lite" vaddr="0x4000_0000" perms="rw" cached="false"/>
        <map mr="dma-out" vaddr="0x4001_0000" perms="rw" cached="false" />
        <map mr="dma-in" vaddr="0x4002_0000" perms="rw" cached="false" />
        <!-- 121 is first ID for IRQ from PL->PS -->
        <irq irq="121" id="5"/>
    </protection_domain>
</system>

but I still get hello world just fine.

@Ivan-Velickovic
Copy link
Collaborator

Ivan-Velickovic commented Jan 15, 2025

I've tried inflating hello.elf to 16MB in case it's the larger ELF size that's causing the error but I still can't trigger it.

@Vincer239
Copy link
Author

Vincer239 commented Jan 15, 2025

This is the report generated. It produces the above listed error.
Maybe a note to the hardware: on the PL, there is an AXI Lite configured which is mapped to 0x8000_0000. There is also a DMA engine (this one) configured, which can access potentially the address space from 0x0 to 0x7fff_ffff. However, it is just configured and unused on the PL.
We also use microkit-1.4.1, I don't know what specifically you mean with latest commits on zcu102 though.

Edit: I just saw that the report above has another PD involved. I just tried it with just the described PD from above and the same error occurs.
Second PD just for completeness:

<protection_domain name="handler2" priority="200" pp="true">
   <program_image path="handler2.elf" />   
</protection_domain> 

@Ivan-Velickovic
Copy link
Collaborator

Thanks for sending the report.

I've made a system that produces almost the exact same report, the only differences are regarding the ELFs. I have the exact same 0x00000016 invocation in my report that is failing at run-time for you. Unfortunately, I can't reproduce the error when running on a physical ZCU102 or via QEMU's emulation of the ZCU102.

There's a couple things that would be useful to know:

  • is it certain that there is no DMA happening while the system is initialising? Given the DMA engine can access the region of memory where the kernel and Microkit loader and monitor live it could be corrupting their data.
  • is this the smallest possible reproducible example? If you remove these DMA memory regions does everything work?
  • are you able to send me the exact image that is failing to see if I can get the error on the ZCU102 I'm using to try reproduce it?

We also use microkit-1.4.1, I don't know what specifically you mean with latest commits on zcu102 though.

I mean that I tested the latest commit of Microkit and version 1.4.1. I wasn't sure what version you were using.

@Vincer239
Copy link
Author

Thank you for testing it out and your patience.

To answer your questions:

  • I am certain that DMA is not running or doing anything during system initializing. All DMA controls are mapped to the AXI Lite interface and will only be written after initialization. All PL parameters are initialized with 0s, disabling the DMA.
  • I have not tested the system without these DMA memory regions, I can certainly try it. However, I am not in the office for the rest of the week so I can't test with hardware... I can give you more feedback concerning this on Monday.
  • I can append both elfs and the image here, no problem. As said, I can do that on Monday.

We on our side have been thinking that it might be an issue with the boot process of our hardware?

  • We do use sightly different hardware (your ZCU102 -which is a ZU9EG under the hood- and our ZU4EV)
  • We use a patched uboot which fixes the EL2 Problem (was an issue in the docs some time ago) -> here we use our nix flake to build it and patch it (so I guess we are using the wrong uboot for the board)
  • FSBL and bitstream generation is done via Vivado and Vitis for our board.
  • We load the microkit image via tftp into the flash and let uboot extract and boot it

I do not know if this hardware info is of any use to you. I figured it might make a difference.
I any case, I will update you on Monday once I have done some more hardware tests regarding your questions.

@Vincer239
Copy link
Author

As promised, I did some further tests.
First off, here is a zip containing my devboard configuration I used today. It contains the bif, the corresponding binary and the hardware hand-off description file generated by Vivado. There is no DMA or AXI Lite core on the PL as it is just a blank Vivado project compiled and flashed.

Concerning Microkit:

  • I rechecked that we have 1.4.1 (which we do)
  • I simplified the code as far as possible and the error persisted.

This is the image and the elf, as well as the report.txt and the debug console printout.

@Ivan-Velickovic
Copy link
Collaborator

Great, thanks for sending all of that.

It is late in Sydney so I won't be able to do some proper debugging until tomorrow but I quickly ran the provided image, attached is the serial output (no error, unfortunately).

ivan_zcu102_issue263_log.txt

My next step will be to try to reproduce your setup as close as possible, this might take me a couple days, will update you when I have more info.

Regarding the ZU9EG vs ZU4EV, I believe they both use Cortex-A53s but do they have the exact same amount of main memory and is the location of the main memory the same?

On the ZCU102 main memory is at 0x0..0x80000000 and 0x800000000..0x880000000.

@Vincer239
Copy link
Author

Yes, they both have Cortex-A53s and as far as I can see, both have main memory from 0x0..0x8000_0000 and 0x8_0000_0000..0x8_8000_0000 (or at least if you believe the xparameters.h compiled by Vitis)

I was wondering.. You are loading both sel4-image and local-kernel to 0x4000_0000. What does that do, other than overwriting the previous code?

@Ivan-Velickovic
Copy link
Collaborator

Ivan-Velickovic commented Jan 24, 2025

I was wondering.. You are loading both sel4-image and local-kernel to 0x4000_0000. What does that do, other than overwriting the previous code?

Just a quirk of our boot scripts, it's not intentional. Those names point to the same image.

Right now I only have some more questions, haven't had a chance to try further to reproduce it sorry:

  • If you remove the dma-out region from the simplified system that you sent me I assume everything works? Just want to double check.
  • Are there any reserved regions within main memory, something like TrustZone or anything else that's running at a higher-privilege level than seL4? EDIT: looks like there is in boot_image.bif. Do you know where in memory it might be placed?
  • Through the process of configuring the hardware, is there any artifact that shows me the memory map of the platform (I haven't done FPGA stuff before so my knowledge is lacking in this area)? Would it be possible for you to send me this file ../bootloader_export/u-boot/system_devboard.dtb as well which is mentioned in boot_image.bif?

@Vincer239
Copy link
Author

Vincer239 commented Jan 24, 2025

If you remove the dma-out region from the simplified system that you sent me I assume everything works? Just want to double check.

That is correct. And somehow this is related to the addresses themselves. 0x7000_0000 and 0x4000_0000 have this problem, 0x2000_0000, However, works (current address I am working with). I didn't try to find any further addresses.

Do you know where in memory it might be placed?

I've been looking.. I could not find the direct location.
With xsct, we do targets -set -nocase -filter {name =~ "*A53 #0*"} dow $atf_file, so directly to the core itself. I have no idea how that translates to addresses..
This raises a counter question: Do you not use an Arm Trusted Firmware and if you do, at what exception level do you run it?

Through the process of configuring the hardware, is there any artifact that shows me the memory map of the platform

Yes, the device tree should tell you, where you find what interface (and memory). I was told by my colleague that the device tree generated by Vitis is basically blank and he patches all the information into it. Anyway, here it is.

@Ivan-Velickovic
Copy link
Collaborator

Do you not use an Arm Trusted Firmware and if you do, at what exception level do you run it?

No, I don't think we have it running our ZCU102. Unless there's something specific needed from it I don't think there's any real benefit to running it in addition to seL4.

Yes, the device tree should tell you, where you find what interface (and memory). I was told by my colleague that the device tree generated by Vitis is basically blank and he patches all the information into it. Anyway, here it is.

Thanks, I think the device tree confirms my suspicions, the memory node looks like this:

        memory@0 {
                device_type = "memory";
                reg = <0x00 0x00 0x00 0x7ff00000 0x08 0x00 0x00 0x80000000>;
        };

where as seL4 gets the main memory regions from a memory node that looks like this:

	memory@0 {
		device_type = "memory";
		reg = < 0x00 0x00 0x00 0x80000000 0x08 0x00 0x00 0x80000000 >;
	};

0x7000_0000 and 0x4000_0000 have this problem,

I would expect 0x7000_0000 but I'm not sure why 0x4000_0000 would have problems. In boot_image.bif there seems to be some PMU firmware being loaded but I can't tell if it's loaded into main memory or not.

For now, do you want to try changing the memory node to match the device tree you sent me and check that the original system you opened the issue with works now? Here's a patch to the kernel to make:

diff --git a/tools/dts/zynqmp.dts b/tools/dts/zynqmp.dts
index 2e9cc89ab..75d3a30bc 100644
--- a/tools/dts/zynqmp.dts
+++ b/tools/dts/zynqmp.dts
@@ -1124,7 +1124,7 @@

        memory@0 {
                device_type = "memory";
-               reg = < 0x00 0x00 0x00 0x80000000 0x08 0x00 0x00 0x80000000 >;
+               reg = <0x00 0x00 0x00 0x7ff00000 0x08 0x00 0x00 0x80000000>;
        };

        gpio-keys {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants