Added docs

ros-realtime · Mar 26, 2022 · c43bcda · c43bcda
1 parent c71d228
commit c43bcda
Show file tree

Hide file tree

Showing 3 changed files with 190 additions and 111 deletions.
diff --git a/README.md b/README.md
@@ -12,7 +12,6 @@ This is a custom image builder for the Raspberry Pi 4. Some features:
     Raspberry Pi, so I had to resort to this method.
   - The default customization in this repo is made for ROS2 with `PREEMPT_RT`
     applied.
-  - More information about how this works is below.
 - With two stages of setup scripts, executing in lock step both inside and
   outside the chroot, we can cross compile code (via something like CMake
   toolchain) on the host and copy it into the chroot for making the final
@@ -63,6 +62,8 @@ Thus, you'll need a Linux machine with root and the following tools installed:
 `cut`, `grep`, `parted`, `pv`, `rsync`, `truncate`, `wget`, `systemd-nspawn`,
 and `qemu-aarch64-static`.
 
+You will also need `python3`.
+
 To build the `focal-rt-ros2` image, you'll also need: `zip`.
 
 For Ubuntu, you can simply run:
@@ -82,114 +83,7 @@ You can then `dd` this to a SD card.
 
 You can see a demo of this in [CI](https://github.com/shuhaowu/ros-realtime-rpi4-image/actions). CI builds quite slowly. On my computer this whole process only takes a few minutes.
 
-### Debugging build failures if you changed stuff
-
-_This is covered in [How it works](#how-it-works)._
-
-How it works
-------------
-
-_Note: the best way to understand the details is to start reading from
-`builder/main.sh` and tracing every function call. Most functions are defined
-in the order they are called in `builder/core.sh` and thus the code should read
-somewhat linearly._
-
-The builder defined in `builder/` is relatively generic and can theoretically
-be used for a number of situations where a vendor provides you with a `.img`
-file and you need to customize it. It follows these major steps:
-
-1. Reads a custom `vars.sh` to populate all the variables needed to build.
-2. Download the (ubuntu) image from the vendor.
-3. Extract the image via a custom bash function defined in `vars.sh` to handle
-   different compression algorithms.
-4. Increase the size of the image and resize the file system, as the vendor
-   image may be too small to install a large number of images.
-5. Setup a loop device.
-6. Mount the file systems in the loop device so they're accessible by the host.
-7. Copy things like resolv.conf and qemu-user-static into the mounted FS.
-8. Copy the rootfs overlay defined via `ROOTFS_OVERLAY` into the mounted FS.
-  - If you have setup scripts running inside the chroot, or other files
-    supporting the setup, put it here and copy it in.
-9. Run the _host-side phase1 setup script_ (`setup_script_phase1_outside_chroot`).
-  - Variables `export`ed by `main.sh` and `vars.sh` are usable in this scripts,
-    as well as the other user-defined scripts below.
-  - For the RT setup, the RT kernel is downloaded from Github in this step.
-10. Run the _chroot-side phase1 setup script_ (`setup_script_phase1_inside_chroot`).
-    This runs the script inside the mounted FS via a chroot in the target
-    architecture via qemu-user-static.
-  - This is where you install things like the ROS2 packages.
-11. Run the _host-side phase2 setup script_ (`setup_script_phase2_outside_chroot`).
-  - If you want to cross compile, this is your chance, as the rootfs is readily
-    available and mounted. In CMake, you can set your `CMAKE_FIND_ROOT_PATH` and
-    `CMAKE_SYSROOT` to the path to the mounted FS.
-12. Run the _chroot-side phase2 setup script_ (`setup_script_phase2_inside_chroot`).
-    This also runs the script inside the mounted FS.
-  - Usually this is when you remove any setup files you copied in the chroot
-    and perform some cleanup. resolv.conf and other files setup by the builder
-    will be cleaned by the builder, tho
-13. Cleanup the chroot by removing resolv.conf and qemu-user-static.
-14. Unmount everything and get rid of the loop back device.
-
-To customize this process, look at the `vars.sh` file(s) in this repo. There
-should be extensive comments there.
-
-In the future, maybe it's better to figure out how Canonical generate their
-official Ubuntu images. However, I can't find how they built their images when
-I looked. It is likely that a number of steps here will be needed anyway, so
-the existing structure is not that bad of an idea.
-
-### How interrupt and resume works
-
-Each step defined in the code is resumable (see `main.sh`). If they fail, the
-builder won't clean everything up to give the developer a chance to debug and
-possibly fix things manually before continuing. The workflow I take is as
-follows:
-
-1. Edit the build scripts. Run them.
-2. Encounter an error.
-3. Manually go into the chroot via systemd-nspawn.
-4. Figure out the right commands to run and change the script.
-5. Change `cache/session.txt` to make sure I can resume from the right spot (by removing and adding steps into the file, see below).
-6. Run the builder again (just `make`, or `builder/main.sh ...`).
-
-While this is not perfect, it still saves a lot of time, as you don't always
-have to restart from the beginning when encountering an error. The way this
-works is via two files saved in `cache`:
-
-- `cache/session.txt`: contains a list of steps executed, one per line. You can
-  freely change this file if you know what you're doing to selectively
-  execute/skip steps when working with this system. See `builder/main.sh` for
-  the steps (`run_step <step_name>`).
-  - If this file is removed, then the builder will restart from scratch.
-- `cache/session-loop-device.txt`: This saves the loop device the chroot is
-  mounted to, since the loop devices may be different each time we run this.
-
-There's also a `PAUSE_AFTER` variable that can be set in `vars.sh` to instruct
-the builder to stop after a particular step. This allow you to do some
-interactive experimentation, which also speeds things up.
-
-To get into the chroot to experiment, run the command:
-
-```
-$ sudo systemd-nspawn -D /tmp/rpi4-image-build/ bash # the path is whatever CHROOT_PATH is
-```
-
-### Interrupt and resume if you change the setup scripts for inside the chroot
-
-Sometimes you will change the scripts running inside the chroot and then resume.
-You'll find this doesn't work, because the script you're changing is not copied
-into the chroot. To get around this problem and resume, simply delete the
-step `copy_files_to_chroot` in `cache/session.txt` and rerun the builder.  The
-builder will then copy all the files into the chroot again and continue from
-where it failed.
-
-### How to reset your host system if something horribly goes wrong
-
-- Try running `./scripts/cleanup.sh`
-- Try running the commands in `umount_everything` manually (see
-  `builder/core.sh`).
-  - Can do this by adding every step in `builder/main.sh` into
-  `cache/session.txt` except the umount everything step.
-  - TODO: I should create simpler command to run this step only.
-- If that doesn't work, try restarting your computer :(.
+Customization guide
+-------------------
 
+See [`docs/BuilderDesignAndUsageGuide.md`](docs/BuilderDesignAndUsageGuide.md).
diff --git a/docs/BuilderDesignAndUsageGuide.md b/docs/BuilderDesignAndUsageGuide.md
@@ -0,0 +1,178 @@
+Image builder design and usage guide
+====================================
+
+Before reading this, please make sure you read the README of the project.
+
+Overview
+--------
+
+The image builder's main workflow consists of the following few steps (see
+`Builder.build()` in the code for more up-to-date details):
+
+- Download and extract the vendor image.
+- Mount the image via loop devices locally.
+- Setup a "container" via `systemd-nspawn`. For images with foreign
+  architecture (such as aarch64), qemu-user-static is copied to the mounts.
+- Copy configuration files to the image.
+- Run any setup scripts both outside the container and inside the container to
+  customize the image (such as installing packages, or cross compiling).
+- Cleanup the temporary files copied into the container and unmount everything.
+
+This process is generic for most single board computers (SBC) where a flashable
+image is provided by the vendor. The image builder allows one to configure this
+process with by writing a set of well-defined ["build profiles
+files"](#build-profile-format) without having to worry about book-keeping
+tasks like mounting the image. This in principle should allow the builder to
+build images not just for the Raspberry Pi, but for other SBCs as well.
+Further, the design allow the build profile files to be overlaid on top
+of each other, allowing for a common base profile to be built into
+multiple image types. For example, we can use a common real-time Ubuntu base to
+build images with different ROS2 distros without having to duplicate too much
+code.
+
+The builder also has time-saving features such as the ability to [pause and
+resume](#pause-and-resume) at each step. For example, if one of the custom
+setup scripts being executed fails, the builder will not cleanup right away.
+This allows the developer to go into the container and experiment. In my
+experience, this drastically cuts down on the development time for this kind of
+image-building work. See [here for tips and tricks on how to debug and work on
+this setup in a time-efficient manner](#tips-and-tricks).
+
+Build profile format
+--------------------------
+
+The build profile is defined as a directory in which contains the following
+structure:
+
+- `config.ini`: Specifies the [build config](#configini).
+- `scripts/`
+  - `extract-image`: This script extracts the image downloaded. It's not
+    easy to generically infer how to extract an image, so this is parameterized
+    as a script.
+  - `loop-device-setup`: This script performs setup on the host against the
+    loop device after it is setup by the builder. A common list of operations
+    done here is to `fsck` the file system of the rootfs and resize it to the
+    maximum allocated size as per config.ini via `resize2fs`.
+  - `phase1-host`: Optional. This scripts runs on the host machine immediately
+    after the image is mounted and ready to go.
+  - `phase1-target`: Optional. This script runs in the image via
+    `systemd-nspawn` after the `phase1-host`. For example, you can install
+    packages in the image by calling `apt`.
+  - `phase2-host`: Optional. This scripts runs on the host after
+    `phase1-target`. An example usage of this would be to perform cross
+    compilation.
+  - `phase2-target`: Optional. This script runs in the image via
+    `systemd-nspawn` after `phase2-host`.
+- `rootfs/`: Any files and directory within this directory will be copied to
+  the root of the image being built. The copy of files occur before any of the
+  phase1/phase2 scripts are executed.
+
+A full example of this directory can be seen in [`focal-rt`](../focal-rt).
+
+### `config.ini`
+
+This file specifies the build profile under the section `[build]`. This section
+contains the following variables:
+
+- `debug`: boolean. If true, verbose logging will be turned on.
+- `image_url`: string. The URL of the image to be downloaded by the builder.
+- `image_mounts`: string. A comma separated list of the partition's mount
+  point. Mounting occurs in the reverse order by the builder, because usually
+  the / mount point is the last partition and it needs to be mounted first.
+- `image_size`: string. This is passed to `truncate --size=<image_size>
+  <path to .img file>` and is needed because sometimes the vendor image is too
+  small.
+- `output_filename`: string. The output image file name (not path, just the
+  file name).
+
+There's another section in this file, `[env]`, which are a list of environment
+variables that will be passed to [all scripts](#phase-1-and-phase-2-scripts)
+called during the build process.
+
+### Phase 1 and Phase 2 scripts
+
+As noted above, there are 4 scripts that are called. These scripts are called
+with the environment variables specified in `config.ini` and a few additional
+variables:
+
+- `CHROOT_PATH`: the path of the mounted image seen from the host.
+- `OUT_DIR`: the directory of the output file.
+- `CACHE_DIR`: a place to put cached data.
+
+Typically, the scripts may be structured as follows:
+
+1. `phase1-host`: downloads additional dependencies from the internet and
+   copies it to the image via `CHROOT_PATH`.
+2. `phase1-target`: installs dependencies within the image.
+3. `phase2-host`: Perform cross compilation with `$CHROOT_PATH` as the sysroot.
+4. `phase2-target`: Perform any additional setup is needed after the cross
+   compilation has been installed.
+
+### Overlaying multiple build profiles
+
+The builder is designed for the layering of build profiles. This works by
+telling the builder to build with a list of profile directories. The builder
+will then:
+
+- Merge the variables inside `config.ini`, where variables from later profiles
+  overrides the ones from earlier ones.
+- Copy `rootfs` files from each build profile in sequential order. Later
+  profiles can override files from earlier ones.
+- Run the phase1/phase2 scripts in sequential order. For example, when running
+  the phase1 host scripts, the builder will run the script from the first
+  profile, then the second, then the third, and so on.
+
+Pause and resume
+----------------
+
+The builder can pause after each step defined in the code (see
+`Builder.build()`). If a step fails, the builder won't cleanup anything to give
+the developer a chance to debug and fix things manually before continuing. The
+way this works is with two files:
+
+- `cache/session.txt`: contains a list of steps executed, one per line. You can
+  freely change this file if you know what you're doing to selectively
+  execute/skip steps when working with this system. 
+  - If this file is removed, then the builder will restart from scratch.
+- `cache/session-loop-device.txt`: This saves the loop device the chroot is
+  mounted to, since the loop devices may be different each time we run this.
+
+Tips and tricks
+---------------
+
+If you ever used tools like Ansible, or Packer, you will know that a lot of the
+times, to fix a simple typo, you will have to wait for the entire script to run
+from the beginning, taking up valuable time. With the ability to pause and
+resume, the workflow I take with this repo is as follows:
+
+1. Edit the build scripts. Run them.
+2. Encounter an error.
+3. Manually go into the chroot via systemd-nspawn.
+4. Figure out the right commands to run and change the script.
+5. Change `cache/session.txt` to make sure I can resume from the right spot (by
+   removing and adding steps into the file, see below).
+6. Run the builder again.
+7. When all the problems are worked out, run the builder from the beginning to
+   ensure it works for a fresh build.
+
+While this is not a perfect process, it can save a lot of time.
+
+To get into the chroot to experiment, run the command:
+
+```
+$ sudo systemd-nspawn -D /tmp/rpi4-image-build/ bash # the path is whatever CHROOT_PATH is
+```
+
+At some point, we can refactor the command above into the builder directly.
+
+### How to reset your host system if something horribly goes wrong
+
+THIS IS OUTDATED and the functionality need to be restored.
+
+- Try running `./scripts/cleanup.sh`
+- Try running the commands in `umount_everything` manually (see
+  `builder/core.sh`).
+  - Can do this by adding every step in `builder/main.sh` into
+  `cache/session.txt` except the umount everything step.
+  - TODO: I should create simpler command to run this step only.
+- If that doesn't work, try restarting your computer :(.
diff --git a/focal-rt/config.ini b/focal-rt/config.ini
@@ -17,9 +17,16 @@ image_size = 4G
 # The filename of the output image
 output_filename = ubuntu-20.04.4-rt-arm64+raspi.img
 
+# TODO: this shouldn't really be a part of the build configuration, because it
+# is more like a host-level configuration. Instead of putting it here, it should
+# be passed in as an argument to the builder. This refactor should be done when
+# build.py becomes a real command-line utility.
+#
 # The host path to the qemu-user-static binary required for the above image
 qemu_user_static_path = /usr/bin/qemu-aarch64-static
 
+# TODO: same as qemu_user_static_path. This should be specified via
+# command-line arguments as opposed to build configuration.
 # Uncomment this if you want to pause the builder after a particular stage to
 # debug/experiment.
 # pause_after = cleanup_chroot