Analyzing C source files dependencies in a program

Posted on 2013/11/24

13


Introduction

Software projects can easily become complicated, with many source files depending on one another in ways that are not obvious. Moreover, developing for embedded software and micro-controllers has specific complexities such as optimizing and chiselling a program in order to fit it into a small device with little memory. For these reasons, I wanted a tool to visualize the interdependency of the modules that compose a C program, with the ability to analyse it and point to potential design problems and improvements. I developed a small proof of concept that I am going to show in this post.

The idea is to use a diagnostic output of the linker, called the “cross reference table”, to graph and plot the dependencies of functions and variables across the linked modules. The table will be parsed by a python script and visualized with Graphviz.

Implementation

What is the cross reference table? The “GNU ld” linker is able to output a map of the program that is being linked, with information on the different objects, their sections and their placement in memory. This map file will also optionally contain a table containing the references and definitions of symbols (which are functions and variables with external linkage) indicating what module is defining them and what module is referencing to them. This is the option that enables it (taken from the help pages):

--cref
Output a cross reference table. If a linker map file is being generated, the cross reference table is printed to the map file. Otherwise, it is printed on the standard output.
The format of the table is intentionally simple, so that it may be easily processed by a script if necessary. The symbols are printed out, sorted by name. For each symbol, a list of file names is given. If the symbol is defined, the first file listed is the location of the definition. The remaining files contain references to the symbol.

With this information it is possible to create a graph, with nodes consisting of object files (often related 1:1 to C source files) and libraries, and arrows between them representing symbols that a module uses but is defined in another module. For example, if module “module1.c” defines the function “module1_func1()” and the module “program.c” uses it, the linker will link together “program.o” and “module1.o” object files and the graph will have two nodes, called “program.o” and “module1.o” with an arrow from the “program.o” node to the “module1.o” node representing the reference to the function “module1_func1()“.

Sample dependency graph

Sample dependency graph

The tool that I chose to create this graph is NetworkX, a python library that creates, analyses and draws many types of graphs. It is able to output a Graphvizdot” file, so that I can use Graphviz tools to create images and show the results visually.

The following is a python script, that I called “read_cref.py“, that takes two parameters: the input map file containing the cross reference table, and the output dot file that will contain the depencency graph.

#!/usr/bin/env python

from sys import argv
from io import open
from networkx import MultiDiGraph, write_dot

def find_cref(inmap):
    while True:
        l = inmap.readline()
        if l.strip() == 'Cross Reference Table':
            break;
        if len(l) == 0: return False
    while True:
        l = inmap.readline()
        if len(l) == 0: return False
        words = l.split()
        if len(words) == 2:
            if words[0] == 'Symbol' and words[1] == 'File':
                return True

def read_cref(inmap):
    modules = MultiDiGraph()
    while True:
        l = inmap.readline()
        words = l.split()
        if len(words) == 2:
            last_symbol = words[0]
            last_module = words[1]
        elif len(words) == 1:
            modules.add_edge(words[0], last_module, label=last_symbol);
        elif len(l) == 0:
            break
    return modules

inmap = open(argv[1], 'r')
if find_cref(inmap):
    modules = read_cref(inmap)
    write_dot(modules, argv[2])
else:
    print 'error: cross reference table not found.'

Note that the tool has very little syntax checking, so it may misbehave on usage errors or different versions of the linker than the one I use (GNU ld (GNU Binutils for Debian) 2.23.90.20131017), and it will surely misbehave when the directories or filenames contain spaces.

Example

As an example, I put together some C source files that depend on one another. Be aware of the distinction between definition and declaration in C. This is “program.c“, that references the “printf” function and some other functions and variables defined in other modules:

#include <stdio.h>
#include "program.h"

int main()
{
    printf("hello %d\n", module1_func1(0));
    module1_func1(module2_var2);
    module2_func1(module2_var1);
    module2_func2(0);
    return 0;
}

This is program.h, containing the declarations of the functions and variables defined in the modules.

extern int module1_func1(int);
extern int module1_func2(int);
extern int module1_var1;
extern int module1_var2;

extern int module2_func1(int);
extern int module2_func2(int);
extern int module2_var1;
extern int module2_var2;

This is “module1.c“, that defines the “module1_*” symbols (functions and variables):

#include "program.h"

int module1_var1 = 1;
int module1_var2 = 2;

int module1_func1(int a)
{
    return module1_var1 + module2_var1 + a;
}

int module1_func2(int a)
{
    return module1_var2 + a;
}

This is “module2.c“, that defines the “module2_*” symbols:

#include "program.h"

int module2_var1 = 1;
int module2_var2 = 2;

int module2_func1(int a)
{
    return module1_var2 + module2_var1 + a;
}

int module2_func2(int a)
{
    return module2_var2 + module1_var1 + module2_func2(a);
}

Assuming that the 5 files are all in the current directory, and that gcc, python, networkx and graphviz are installed on the system, I run the following commands to output an image of the graph:

gcc -I./ -c -o program.o program.c
gcc -I./ -c -o module1.o module1.c
gcc -I./ -c -o module2.o module2.c
gcc -Xlinker -Map=program.map -Xlinker --cref  program.o module1.o module2.o -o program
./read_cref.py  program.map program.dot
dot -Tpng program.dot -o program.png

The first three commands compile the C source files into their respective object files. The fourth command is the linking step, and the “-Xlinker” options are there to pass the correct options to the underlying ld linker. Then I invoke the python script (that has execution permission) and execute the dot Graphviz tool to generate a PNG image.

Example results

One output of the linking step (the main output is the program itself) is the map file “program.map“, containing the cross reference table (many rows omitted):

...
Cross Reference Table

Symbol                                            File
GCC_3.0                                           /lib/i386-linux-gnu/libc.so.6
GLIBC_2.0                                         /lib/i386-linux-gnu/libc.so.6
                                                  /lib/i386-linux-gnu/ld-linux.so.2
...
module1_func1                                     module1.o
                                                  program.o
module1_func2                                     module1.o
module1_var1                                      module1.o
                                                  module2.o
module1_var2                                      module1.o
                                                  module2.o
module2_func1                                     module2.o
                                                  program.o
module2_func2                                     module2.o
                                                  program.o
module2_var1                                      module2.o
                                                  module1.o
                                                  program.o
module2_var2                                      module2.o
                                                  program.o
...

This table has been read by the python script and transformed into a graph, which in turn has been rendered by Graphviz “dot” command into this “program.png” image:

Dependency graph

Dependency graph

There are many things that this graph shows:

  • Our three modules are not the only things that get linked into a program: there are also system libraries and startup code.
  • The “module1.o” and “module2.o” nodes depend on one another. Even if “program.c” contained only “module1_*” references, “module2.c” would still be needed.
  • There are two “head” nodes with no arrows pointing to them (no other module depends on them). They are “ld-linux.so.2“, which is the starting point of most Linux userspace programs, and “crtbegin.o” that manages C++ constructors (or C functions with the GCC “constructor” attribute) that we don’t use.
  • It answers a question that many developers had at least once “My program starts with main, but who calls main?”. The answer in this case is somewhere in “crt1.o“.

As a real-world example, I tried it on U-Boot compilation for the Versatile PB platform (that I tackled in a past blog post), and this is the resulting graph:

U-Boot dependency graph

U-Boot dependency graph

Note that to simplify the gigantic graph that was initially produced, I retouched the python script using DiGraph instead of MultiDiGraph (to remove multiple arrows) and I removed the symbol labels.

From U-Boot graph we can see that there are many libraries that are linked but not referenced, such as “libfs.o” (file system management), “libinput.o” (keyboard management) and “libspi.o” (SPI management), so it seems that the configuration could be improved further by removing these libraries from the build.

Conclusions and improvements

We have seen a way to draw graphs representing the dependency of C source files and modules throughout a complete program. We exploited the functionalities of GNU ld to create a cross reference table, we used NetworkX to create a graph and Graphviz to visualize it. We applied the method to a simple example and a real embedded application.

This approach in visualizing programs has many benefits. New developers can familiarize with the source code by taking a look at its structure, and maintainers can check quickly if something seems “off” or something else can be designed better. It’s also a cheap way to document the module organization.

It is also possible to expand from this concept and analyse the dependency graph in useful ways. It’s easy to pick an “entry point” and from there extract the resulting sub-graph. By adding more information, such as code size, one could measure the impact of removing or adding new modules. Designers can extract dependency “cycles” and re-factor the code to remove them, improving the modularity and testability of the project.

In any case, I believe it is a useful approach, that is simple enough to be implemented by any developer using the right tools.

Posted in: Software