Inline assembly instructions in GCC

Posted on 2011/05/17

1


In embedded software programming there’s often the need to use assembly-level instructions to reach all the functionalities of the processing core. But when the developing is done almost exclusively in C, it is sometimes a burden, often an added complexity and possibly a nuisance to code separate assembly files containing the needed functions. Other than that, sometimes the fact that the C code must call the assembly function located in another module is a performance hit that could slow down execution to a level that is not acceptable. The solution that I prefer is using inline assembly inside small C functions, and let the compiler optimize it into my code.

As an example, I wanted to use the “rotate right” (ROR) instruction on an ARM core, both with an immediate operand and with a register operand. Some details on the ROR instructions are in this Quick Reference from ARM Infocenter: ARM and Thumb-2 Instruction Set Quick Reference Card. Here is some test code that uses inline assembly quite efficiently:

#include <stdio.h>
#include <stdint.h>

static inline __attribute__((always_inline))
uint32_t arm_ror_imm(uint32_t v, uint32_t sh) {
  uint32_t d;
  asm ("ROR %[Rd], %[Rm], %[Is]" : [Rd] "=r" (d) : [Rm] "r" (v), [Is] "i" (sh));
  return d;
}

static inline __attribute__((always_inline))
uint32_t arm_ror(uint32_t v, uint32_t sh) {
  uint32_t d;
  asm ("ROR %[Rd], %[Rm], %[Rs]" : [Rd] "=r" (d) : [Rm] "r" (v), [Rs] "r" (sh));
  return d;
}

int main() {
  uint32_t val;

  val = 0x22;
  printf("val = 0x%08X\n", val);
  printf("arm_ror(val, 1) = 0x%08X\n", arm_ror(val, 1));
  printf("arm_ror(val, 2) = 0x%08X\n", arm_ror(val, 2));
  printf("arm_ror_imm(val, 3) = 0x%08X\n", arm_ror_imm(val, 3));
  printf("arm_ror_imm(val, 4) = 0x%08X\n", arm_ror_imm(val, 4));
  return 0;
}

I wanted the ROR instruction to be used seamlessly and efficiently inside my C code, so I created a couple of C function (one for the ROR with immediate, one for the ROR with register) for which I added the inline C keyword and the always_inline GCC attribute, to make sure that the functions are always inlined. Then inside the functions I used the inline assembly functionality of GCC (be aware that “inline” for assembly has not the same meaning of the “inline” keyword for functions). In particular I made use of the “extended ASM” functionalities, that allow to describe in details the interaction between the assembly and the C. I defined in extended ASM format the input and the output of the assembly instruction: in this case the destination register Rd and the operands Rm, Is and Rs respectively; I also connected them to the correct C variables.

To compile this program into an ELF executable I used CodeSourcery arm-linux-gnueabi toolchain, and to emulate the execution I used QEMU ARM user-level emulator, to hide the complexity of having an operating system (or bare metal) below. This QEMU emulator can be found in the qemu package in Debian distribution and in qemu-extras package in Ubuntu distribution.

Here is the command line:

$ arm-linux-gnueabi-gcc -mcpu=cortex-a8 -O2 -marm  -mcpu=cortex-a8 -static asm_test.c -o asm_test
$ qemu-arm -cpu cortex-a8 asm_test
val = 0x00000022
arm_ror(val, 1) = 0x00000011
arm_ror(val, 2) = 0x80000008
arm_ror_imm(val, 3) = 0x40000004
arm_ror_imm(val, 4) = 0x20000002

The result shows that the value is rotated correctly. But what is the machine code that has been generated by this compilation? Is it what I want? Is it efficient? To investigate, I disassembled the ELF executable with the following command:

$ arm-linux-gnueabi-objdump -d asm_test > asm_test.dis

Then I searched for the “main” function in the resulting file, and here is the machine code:

000081e0 <main>:
81e0:    e92d4010     push    {r4, lr}
81e4:    e3a01022     mov    r1, #34    ; 0x22
81e8:    e30505e4     movw    r0, #21988    ; 0x55e4
81ec:    e3400006     movt    r0, #6
81f0:    e1a04001     mov    r4, r1
81f4:    eb000383     bl    9008 <_IO_printf>
81f8:    e3a00001     mov    r0, #1
81fc:    e1a01074     ror    r1, r4, r0
8200:    e30505f4     movw    r0, #22004    ; 0x55f4
8204:    e3400006     movt    r0, #6
8208:    eb00037e     bl    9008 <_IO_printf>
820c:    e3a01002     mov    r1, #2
8210:    e3050610     movw    r0, #22032    ; 0x5610
8214:    e3400006     movt    r0, #6
8218:    e1a01174     ror    r1, r4, r1
821c:    eb000379     bl    9008 <_IO_printf>
8220:    e305062c     movw    r0, #22060    ; 0x562c
8224:    e1a011e4     ror    r1, r4, #3
8228:    e3400006     movt    r0, #6
822c:    eb000375     bl    9008 <_IO_printf>
8230:    e305064c     movw    r0, #22092    ; 0x564c
8234:    e1a01264     ror    r1, r4, #4
8238:    e3400006     movt    r0, #6
823c:    eb000371     bl    9008 <_IO_printf>
8240:    e3a00000     mov    r0, #0
8244:    e8bd8010     pop    {r4, pc}

The ROR instructions have been inlined into the main function, and they are correctly using the register operand (addresses 81fc and 8218) or the immediate operand (addresses 8224 and 8234).

Note that the compiler will also be smart enough to tell you when he can’t satisfy your constraints, for example when you try to call arm_ror_imm with a shift operand that can’t be translated into an immediate because it’s not constant.

In this example I used a simple instruction but the same method can be applied effectively also on bigger portions of assembly code, but the extended ASM operand definitions will become more complex. Sometimes for single instructions there are already built-in functions in the compiler, for example here is the list of GCC ARM built-in functions.

I hope I managed to show why I like this method of using assembly inside C programs; to summarize, in my opinion the advantages are:

  • Very fast run-time execution
  • Lowest code size impact
  • No need for assembly files
  • Easy separation of non-C code (inline assembly) from pure C code.
  • The compiler can optimize-out the function calls if they are not needed
  • No use of preprocessor (that can lead to unclear coding)
  • Type-checking and automatic cast
  • Can be applied on any CPU architecture as long as GCC (or a compiler with the same functionalities) is used

See also:

Posted in: Embedded