In embedded software programming there’s often the need to use assembly-level instructions to reach all the functionalities of the processing core. But when the developing is done almost exclusively in C, it is sometimes a burden, often an added complexity and possibly a nuisance to code separate assembly files containing the needed functions. Other than that, sometimes the fact that the C code must call the assembly function located in another module is a performance hit that could slow down execution to a level that is not acceptable. The solution that I prefer is using inline assembly inside small C functions, and let the compiler optimize it into my code.
As an example, I wanted to use the “rotate right” (ROR) instruction on an ARM core, both with an immediate operand and with a register operand. Some details on the ROR instructions are in this Quick Reference from ARM Infocenter: ARM and Thumb-2 Instruction Set Quick Reference Card. Here is some test code that uses inline assembly quite efficiently:
#include <stdio.h>
#include <stdint.h>
static inline __attribute__((always_inline))
uint32_t arm_ror_imm(uint32_t v, uint32_t sh) {
uint32_t d;
asm ("ROR %[Rd], %[Rm], %[Is]" : [Rd] "=r" (d) : [Rm] "r" (v), [Is] "i" (sh));
return d;
}
static inline __attribute__((always_inline))
uint32_t arm_ror(uint32_t v, uint32_t sh) {
uint32_t d;
asm ("ROR %[Rd], %[Rm], %[Rs]" : [Rd] "=r" (d) : [Rm] "r" (v), [Rs] "r" (sh));
return d;
}
int main() {
uint32_t val;
val = 0x22;
printf("val = 0x%08X\n", val);
printf("arm_ror(val, 1) = 0x%08X\n", arm_ror(val, 1));
printf("arm_ror(val, 2) = 0x%08X\n", arm_ror(val, 2));
printf("arm_ror_imm(val, 3) = 0x%08X\n", arm_ror_imm(val, 3));
printf("arm_ror_imm(val, 4) = 0x%08X\n", arm_ror_imm(val, 4));
return 0;
}
I wanted the ROR instruction to be used seamlessly and efficiently inside my C code, so I created a couple of C function (one for the ROR with immediate, one for the ROR with register) for which I added the inline C keyword and the always_inline GCC attribute, to make sure that the functions are always inlined. Then inside the functions I used the inline assembly functionality of GCC (be aware that “inline” for assembly has not the same meaning of the “inline” keyword for functions). In particular I made use of the “extended ASM” functionalities, that allow to describe in details the interaction between the assembly and the C. I defined in extended ASM format the input and the output of the assembly instruction: in this case the destination register Rd and the operands Rm, Is and Rs respectively; I also connected them to the correct C variables.
To compile this program into an ELF executable I used CodeSourcery arm-linux-gnueabi toolchain, and to emulate the execution I used QEMU ARM user-level emulator, to hide the complexity of having an operating system (or bare metal) below. This QEMU emulator can be found in the qemu package in Debian distribution and in qemu-extras package in Ubuntu distribution.
Here is the command line:
$ arm-linux-gnueabi-gcc -mcpu=cortex-a8 -O2 -marm -mcpu=cortex-a8 -static asm_test.c -o asm_test $ qemu-arm -cpu cortex-a8 asm_test val = 0x00000022 arm_ror(val, 1) = 0x00000011 arm_ror(val, 2) = 0x80000008 arm_ror_imm(val, 3) = 0x40000004 arm_ror_imm(val, 4) = 0x20000002
The result shows that the value is rotated correctly. But what is the machine code that has been generated by this compilation? Is it what I want? Is it efficient? To investigate, I disassembled the ELF executable with the following command:
$ arm-linux-gnueabi-objdump -d asm_test > asm_test.dis
Then I searched for the “main” function in the resulting file, and here is the machine code:
000081e0 <main>:
81e0: e92d4010 push {r4, lr}
81e4: e3a01022 mov r1, #34 ; 0x22
81e8: e30505e4 movw r0, #21988 ; 0x55e4
81ec: e3400006 movt r0, #6
81f0: e1a04001 mov r4, r1
81f4: eb000383 bl 9008 <_IO_printf>
81f8: e3a00001 mov r0, #1
81fc: e1a01074 ror r1, r4, r0
8200: e30505f4 movw r0, #22004 ; 0x55f4
8204: e3400006 movt r0, #6
8208: eb00037e bl 9008 <_IO_printf>
820c: e3a01002 mov r1, #2
8210: e3050610 movw r0, #22032 ; 0x5610
8214: e3400006 movt r0, #6
8218: e1a01174 ror r1, r4, r1
821c: eb000379 bl 9008 <_IO_printf>
8220: e305062c movw r0, #22060 ; 0x562c
8224: e1a011e4 ror r1, r4, #3
8228: e3400006 movt r0, #6
822c: eb000375 bl 9008 <_IO_printf>
8230: e305064c movw r0, #22092 ; 0x564c
8234: e1a01264 ror r1, r4, #4
8238: e3400006 movt r0, #6
823c: eb000371 bl 9008 <_IO_printf>
8240: e3a00000 mov r0, #0
8244: e8bd8010 pop {r4, pc}
The ROR instructions have been inlined into the main function, and they are correctly using the register operand (addresses 81fc and 8218) or the immediate operand (addresses 8224 and 8234).
Note that the compiler will also be smart enough to tell you when he can’t satisfy your constraints, for example when you try to call arm_ror_imm with a shift operand that can’t be translated into an immediate because it’s not constant.
In this example I used a simple instruction but the same method can be applied effectively also on bigger portions of assembly code, but the extended ASM operand definitions will become more complex. Sometimes for single instructions there are already built-in functions in the compiler, for example here is the list of GCC ARM built-in functions.
I hope I managed to show why I like this method of using assembly inside C programs; to summarize, in my opinion the advantages are:
- Very fast run-time execution
- Lowest code size impact
- No need for assembly files
- Easy separation of non-C code (inline assembly) from pure C code.
- The compiler can optimize-out the function calls if they are not needed
- No use of preprocessor (that can lead to unclear coding)
- Type-checking and automatic cast
- Can be applied on any CPU architecture as long as GCC (or a compiler with the same functionalities) is used
See also:
Entries
May 19th, 2011 → 00:09
[...] Inline assembly instructions in GCC In embedded software programming there’s often the need to use assembly-level instructions to reach all the functionalities of the processing core. But when the developing is done almost exclusively in C, it is sometimes a burden, often an added complexity and possibly a nuisance to code separate assembly files containing the needed functions. Other than that, sometimes the fact that the C code must call the assembly function located in another module is a performance hit that could slow down execution to a level that is not acceptable. The solution that I prefer is using inline assembly inside small C functions, and let the compiler optimize it into my code. [...]