🚀 3.5 sec to Linux userspace code
Motivation
A while ago, the SolarCamPi project, a off-grid solar-powered WiFi camera, was built.
In this project, a Raspberry Pi Zero 2 W is being booted into Linux, a picture is taken, WiFi connectivity is established and the Pi is shut down again (to save power). This repeats every couple of minutes to always deliver a fresh image to a cloud service.
Each second the Pi Zero is powered up uses valuable electricity, which is a scarce resource in a solar-powered device (at least in West European winters…).
The user space application (server connection, picture upload, etc.) was already optimized as best as possible.
The electronics setup was also specifically designed to use as little power as possible while asleep.
There a 2 possible ways to reduce total energy consumption further:
- decrease power consumption / current
- decrease time spent running
However, in some situations a balance needs to be found between the two. For example: Disabling CPU turbo just to save some current consumption is a bad choice, because the resulting extra time will use more energy than just getting the job done quickly and shutting off. We want the least area under the graph (of current vs. time) possible.
Hardware setup
Having a short cycle time between making a change and actually seeing it run is critical when optimizing embedded boot processes. Swapping SD cards, messing with card readers and power supplies while working is distracting and annoying.
In order to avoid this, a number of useful tools exist:
Power Profiler Kit
The Power Profiler Kit II (now called PPK) can supply power to a device-under-test (DUT) and will measure it accurately over time. You can enable/disable the DUT, see the power consumption at any point and also see the status of 8 digital inputs! We’ll connect one of the digital inputs to a GPIO pin on the Raspberry Pi.
This way, the first action of “our application” (aka the finish line) will be to toggle the GPIO pin. We then just have to measure the time between power-up and GPIO toggle.
USB-SD-Mux
The USB-SD-Mux is a very useful tool for hardware hackers - it’s an interposer between a microSD card and a DUT with a USB-C interface. A computer can “steal” the microSD card from the DUT, rewrite its contents and then plug the microSD card back into the DUT, without ever having to touch the device.
This makes the workflow of testing changes much easier and faster by avoiding unplugging the card, plugging it into a microSD reader, flashing it, plugging the card back into the DUT, etc. It can even be used to automate the reset or power of the DUT with on-board GPIOs.
USB-UART converter
Some form of UART interface is pretty much required. These changes will break system boot, WiFi connectivity, etc. at some point and without a UART console we would be flying blind. A standard CP2102, FTDI, etc. will work well.
Measurement / Test setup
On a clean Debian 12 (bookworm) arm64 Lite image, the /boot/firmware/cmdline.txt
file was modified to include init=/init.sh
.
This means that the kernel will execute the script at /init.sh
as the very first thing in userspace (before running systemd or anything else).
Such an init.sh
script might look like this:
#!/bin/bash
gpioset 0 4=0
sleep 1
gpioset 0 4=1
sleep 1
gpioset 0 4=0
exec /sbin/init
which will toggle the GPIO4 and then resume normal boot by replacing itself with /sbin/init
(aka systemd).
In this screenshot from Nordic’s Power Profiler software, you can see the current consumption of the Raspberry Pi (at 5V) while booting.
After about 12 seconds, digital input 0 is going low, showing that our init.sh
was executed.
In doing so, a total charge of 1.90 coulomb (coulomb and ampere-seconds are equivalent) was used.
Calculating 1.9As * 5.0V
comes out to 9.5Ws
energy usage for this boot process.
For reference: A single AA-alkaline battery can deliver about 13500 Ws of energy.
Reducing current
Let’s get the easy part out of the way first and reduce the operating current as much as possible.
Disabling HDMI
We can disable the HDMI encoder entirely. Disabling the GPU is not possible, because we need it to encode our camera data. If your application doesn’t require camera/GPU support, try disabling the GPU entirely.
This reduces the current consumption from 136.7mA down to 122.6mA (over 10%!).
Relevant config.txt parameters:
# disable HDMI (saves power)
dtoverlay=vc4-kms-v3d,nohdmi
max_framebuffers=1
disable_fw_kms_setup=1
disable_overscan=1
# disable composite video output
enable_tvout=0
Disabling Activity LED
Just by disabling the activity LED, we can save 2mA (122.6mA down to 120.6mA).
dtparam=act_led_trigger=none
dtparam=act_led_activelow=on
Disabling Camera LED
Repeat the same for the camera LED (if present). It will also reduce the chance of the LED reflecting back into the image.
Turbo tweaking
As mentioned before, saving current while wasting time might not be ideal.
With our current changes, the Pi can boot while using 1.62As.
force_turbo=0
initial_turbo=10
arm_boost=0
Without forced turbo mode, 1.58As were used:
For some, unknown reason, disabling the turbo/boost mode also inverts the default state of GPIO4 (thus I’ve switched the polarity in init.sh).
Reducing time
The ~13% reduction in current is helpful, but there’s still a long way to go.
The Pi takes 8s (while consuming ~1As) before the first line of Linux output appears on the console.
Luckily, there a number of ways to get more info about those 8 seconds.
Debug boot
In the boot process of the Raspberry Pi family, the GPU initializes first.
It talks to the SD card and looks for a bootcode.bin
file (Pi 4 and newer use an EEPROM instead).
We can modify this bootcode.bin to enable detailed UART logging:
sed -i -e "s/BOOT_UART=0/BOOT_UART=1/" /boot/firmware/bootcode.bin
Backup the original bootcode.bin first, this process is potentially destructive.
Rebooting with BOOT_UART
enabled gives us loads of nice information:
Raspberry Pi Bootcode
Found SD card, config.txt = 1, start.elf = 1, recovery.elf = 0, timeout = 0
Read File: config.txt, 1322 (bytes)
Raspberry Pi Bootcode
Read File: config.txt, 1322
Read File: start.elf, 2981376 (bytes)
Read File: fixup.dat, 7303 (bytes)
MESS:00:00:01.295242:0: brfs: File read: /mfs/sd/config.txt
MESS:00:00:01.300131:0: brfs: File read: 1322 bytes
MESS:00:00:01.335680:0: HDMI0:EDID error reading EDID block 0 attempt 0
[..]
MESS:00:00:01.392537:0: HDMI0:EDID error reading EDID block 0 attempt 9
MESS:00:00:01.398632:0: HDMI0:EDID giving up on reading EDID block 0
MESS:00:00:01.406335:0: brfs: File read: /mfs/sd/config.txt
MESS:00:00:01.411272:0: gpioman: gpioman_get_pin_num: pin LEDS_PWR_OK not defined
MESS:00:00:01.918176:0: gpioman: gpioman_get_pin_num: pin LEDS_PWR_OK not defined
MESS:00:00:01.923999:0: *** Restart logging
MESS:00:00:01.927872:0: brfs: File read: 1322 bytes
MESS:00:00:01.933328:0: hdmi: HDMI0:EDID error reading EDID block 0 attempt 0
[..]
MESS:00:00:01.995436:0: hdmi: HDMI0:EDID error reading EDID block 0 attempt 9
MESS:00:00:02.002052:0: hdmi: HDMI0:EDID giving up on reading EDID block 0
MESS:00:00:02.007955:0: hdmi: HDMI0:EDID error reading EDID block 0 attempt 0
[..]
MESS:00:00:02.070610:0: hdmi: HDMI0:EDID error reading EDID block 0 attempt 9
MESS:00:00:02.077225:0: hdmi: HDMI0:EDID giving up on reading EDID block 0
MESS:00:00:02.082840:0: hdmi: HDMI:hdmi_get_state is deprecated, use hdmi_get_display_state instead
MESS:00:00:02.091586:0: HDMI0: hdmi_pixel_encoding: 162000000
MESS:00:00:02.799203:0: brfs: File read: /mfs/sd/initramfs8
MESS:00:00:02.803082:0: Loaded 'initramfs8' to 0x0 size 0xb0898e
MESS:00:00:02.821799:0: initramfs loaded to 0x1b4e7000 (size 0xb0898e)
MESS:00:00:02.836318:0: dtb_file 'bcm2710-rpi-zero-2-w.dtb'
MESS:00:00:02.840194:0: brfs: File read: 11569550 bytes
MESS:00:00:02.849171:0: brfs: File read: /mfs/sd/bcm2710-rpi-zero-2-w.dtb
MESS:00:00:02.854262:0: Loaded 'bcm2710-rpi-zero-2-w.dtb' to 0x100 size 0x8258
MESS:00:00:02.876038:0: brfs: File read: 33368 bytes
MESS:00:00:02.892755:0: brfs: File read: /mfs/sd/overlays/overlay_map.dtb
MESS:00:00:02.927145:0: brfs: File read: 5255 bytes
MESS:00:00:02.933541:0: brfs: File read: /mfs/sd/config.txt
MESS:00:00:02.937568:0: dtparam: audio=on
MESS:00:00:02.948005:0: brfs: File read: 1322 bytes
MESS:00:00:02.971952:0: brfs: File read: /mfs/sd/overlays/vc4-kms-v3d.dtbo
MESS:00:00:03.023016:0: Loaded overlay 'vc4-kms-v3d'
MESS:00:00:03.026278:0: dtparam: nohdmi=true
MESS:00:00:03.031105:0: dtparam: act_led_trigger=none
MESS:00:00:03.048180:0: dtparam: act_led_activelow=on
MESS:00:00:03.149316:0: brfs: File read: 2760 bytes
MESS:00:00:03.154502:0: brfs: File read: /mfs/sd/cmdline.txt
MESS:00:00:03.158504:0: Read command line from file 'cmdline.txt':
MESS:00:00:03.164369:0: 'console=serial0,115200 console=tty1 root=PARTUUID=26bbce6b-02 rootfstype=ext4 fsck.repair=yes rootwait cfg80211.ieee80211_regdom=DE init=/init.sh'
MESS:00:00:03.195926:0: gpioman: gpioman_get_pin_num: pin EMMC_ENABLE not defined
MESS:00:00:03.269361:0: brfs: File read: 146 bytes
MESS:00:00:03.812401:0: brfs: File read: /mfs/sd/kernel8.img
MESS:00:00:03.816343:0: Loaded 'kernel8.img' to 0x200000 size 0x8d8bd7
MESS:00:00:05.364579:0: Device tree loaded to 0x1b4de900 (size 0x8605)
MESS:00:00:05.370571:0: uart: Set PL011 baud rate to 103448.300000 Hz
MESS:00:00:05.377080:0: uart: Baud rate change done...
MESS:00:00:05.380495:0: uart: Baud rate[ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd034]
Disabling HDMI probing
The bootloader spends a lot of time trying to auto-detect video parameters for a possibly attached HDMI monitor.
We don’t have HDMI (it’s disabled anyway, remember?), so it doesn’t make much sense to wait for an I2C response with EDID (resolution, frame rate, etc.) information.
By simply hardcoding an EDID string, we can disable any probing:
# don't try to read HDMI eeprom
hdmi_blanking=2
hdmi_ignore_edid=0xa5000080
hdmi_ignore_cec_init=1
hdmi_ignore_cec=1
Disable HAT, PoE and LCD probing
The boot process will additionally try to detect I2C EEPROMs on HATs, will try to detect a PoE hat (which needs a fan) and some other things. We can safely disable those:
# all these options cause a wait for an I2C bus response, we don't need any of them, so let's disable them.
force_eeprom_read=0
disable_poe_fan=1
ignore_lcd=1
disable_touchscreen=1
disable_fw_kms_setup=1
Disable camera & display probing
Probing for an attached MIPI camera or display will also take some time. We know which camera is attached (HQ Camera, IMX477 in this case), so let’s hardcode this:
# no autodetection for anything (will wait for I2C answers)
camera_auto_detect=0
display_auto_detect=0
# load HQ camera IMX477 sensor manually
dtoverlay=imx477
Disabling initramfs
The above changes brought the (self reported) boot time from 5.38s down to 4.75s.
We can disable the initramfs entirely by removing auto_initramfs=1
.
Savings depend on the size of the initramfs of course, but this brings us down to 4.47s.
Tested, with no significant difference
Overclocking the SD peripheral to 100 MHz is often recommended online but did not create a measurable difference in boot performance.
# not recommended! data corruption risk!
dtoverlay=sdtweak,overclock_50=100
Operating the SD peripheral at such high speeds also risks data corruption (on write accesses), which is very undesirable in remote IoT devices.
Kernel load
At this point, loading the kernel is one of the slowest operations:
MESS:00:00:03.816343:0: Loaded 'kernel8.img' to 0x200000 size 0x8d8bd7
MESS:00:00:05.364579:0: Device tree loaded to 0x1b4de900 (size 0x8605)
Loading 9276375 Bytes takes about 1.54s -> about 6 MiB/s transfer speed.
This load is being done by the GPU (!) with the internal, proprietary VideoCoreIV processor.
It’s possible that the loader code is just inefficient and slow, it’s also possible that it is using very conservative settings.
Sadly, it’s a black box and we can’t touch registers or mess with the parameters in any other useful way.
I haven’t found a good way to optimize this yet, so a smaller kernel is needed.
Overclocking the GPU processor core is theoretically possible with
# Overclock GPU VideoCore IV processor (not recommended!)
core_freq_min=500
core_freq=550
which does lead to a 20% reduction in kernel load time. The side effects (reliability, etc.) of this are unknown.
Buildroot / Custom kernel
It’s time to migrate the system from Raspbian/Debian to a custom built Buildroot distro (especially to get the custom kernel).
Using buildroot 2024.02.1, a very stripped down system was configured.
Native aarch64 toolchain, still with full glibc and the Raspberry Pi userland tools (like camera utilities).
The kernel was configured:
- without sound support
- without most of the block device & filesystem drivers (except SD/MMC and ext4)
- without RAID support
- without USB support
- without HID support
- without DVB support
- without video & framebuffer support (HDMI is disabled anyway)
- without advanced networking features (tunnels, bridging, firewalling, etc.)
- uncompressed (not Gzip)
- modules uncompressed (not Gzip)
In testing, having both the kernel and the modules uncompressed results in a net-positive energy result (even if more time is spent in the GPU loading the kernel). Decompressing Gzip takes a lot of energy (and effectively involves another relocation step).
A security feature called KASLR was also disabled.
KASLR randomizes the load address of the kernel in memory, making it harder to write exploit code (because the memory location of the kernel is unknown).
This requires the kernel to be re-located after it has been loaded by the GPU.
In our usecase, the network attack surface is very limited, so KASLR can be disabled (all application software runs as root anyway). Mitigations for speculative execution vulnurabilies like Spectre were also disabled.
The resulting kernel is 8.5MiB (uncompressed) in size, 4.1MiB compressed as Gzip (which isn’t used here, just for comparison).
The original Raspbian kernel was 25 MiB (uncompressed), 8.9 MiB compressed as Gzip.
Final result
We can now boot into a Linux user space program in less than 3.5s!
~400ms is spent in the Linux kernel (difference between pin 0 and pin 1)
Total energy consumption: 0.364 As * 5.0 V = 1.82 Ws
We reduced the energy required by a factor of 5 (compared to stock Debian at 9.5 Ws until user space).
Reducing input voltage
After publishing this blog post, Graham Sutherland / Polynomial pointed out that the regulators in the Pi Zero aren’t very efficient at 5.0V input.
This might not be applicable in all situations, but in our test scenario and also finished product, we can just reduce the input voltage down to 4.0V.
Running at 5.0V:
Take good note of the units at play here. The mC (milli Coloumb / milli Ampere seconds) increase by switching to 4.0V (because of the higher current), but the total energy decreases significantly!
350.94mAs * 5.0V = 1.754 Ws
Running at 4.0V:
390.77mAs * 4.0V = 1.563 Ws
We can go down even further:
Running at 3.6V:
399.60mAs * 3.6V = 1.438 Ws
We just decreased the energy consumption by another 20%, just by operating the switch-mode-regulators at a more ideal operating point! This requires further testing for stability/reliability of course (as it’s technically out-of-spec), but this is a very impressive result.