Analyzing C source files dependencies in a program

Posted on 2013/11/24

Introduction

Software projects can easily become complicated, with many source files depending on one another in ways that are not obvious. Moreover, developing for embedded software and micro-controllers has specific complexities such as optimizing and chiselling a program in order to fit it into a small device with little memory. For these reasons, I wanted a tool to visualize the interdependency of the modules that compose a C program, with the ability to analyse it and point to potential design problems and improvements. I developed a small proof of concept that I am going to show in this post.

The idea is to use a diagnostic output of the linker, called the “cross reference table”, to graph and plot the dependencies of functions and variables across the linked modules. The table will be parsed by a python script and visualized with Graphviz.

Implementation

What is the cross reference table? The “GNU ld” linker is able to output a map of the program that is being linked, with information on the different objects, their sections and their placement in memory. This map file will also optionally contain a table containing the references and definitions of symbols (which are functions and variables with external linkage) indicating what module is defining them and what module is referencing to them. This is the option that enables it (taken from the help pages):

--cref
Output a cross reference table. If a linker map file is being generated, the cross reference table is printed to the map file. Otherwise, it is printed on the standard output.
The format of the table is intentionally simple, so that it may be easily processed by a script if necessary. The symbols are printed out, sorted by name. For each symbol, a list of file names is given. If the symbol is defined, the first file listed is the location of the definition. The remaining files contain references to the symbol.

With this information it is possible to create a graph, with nodes consisting of object files (often related 1:1 to C source files) and libraries, and arrows between them representing symbols that a module uses but is defined in another module. For example, if module “module1.c” defines the function “module1_func1()” and the module “program.c” uses it, the linker will link together “program.o” and “module1.o” object files and the graph will have two nodes, called “program.o” and “module1.o” with an arrow from the “program.o” node to the “module1.o” node representing the reference to the function “module1_func1()“.

Sample dependency graph

The tool that I chose to create this graph is NetworkX, a python library that creates, analyses and draws many types of graphs. It is able to output a Graphviz “dot” file, so that I can use Graphviz tools to create images and show the results visually.

The following is a python script, that I called “read_cref.py“, that takes two parameters: the input map file containing the cross reference table, and the output dot file that will contain the depencency graph.

#!/usr/bin/env python

from sys import argv
from io import open
from networkx import MultiDiGraph, write_dot

def find_cref(inmap):
    while True:
        l = inmap.readline()
        if l.strip() == 'Cross Reference Table':
            break;
        if len(l) == 0: return False
    while True:
        l = inmap.readline()
        if len(l) == 0: return False
        words = l.split()
        if len(words) == 2:
            if words[0] == 'Symbol' and words[1] == 'File':
                return True

def read_cref(inmap):
    modules = MultiDiGraph()
    while True:
        l = inmap.readline()
        words = l.split()
        if len(words) == 2:
            last_symbol = words[0]
            last_module = words[1]
        elif len(words) == 1:
            modules.add_edge(words[0], last_module, label=last_symbol);
        elif len(l) == 0:
            break
    return modules

inmap = open(argv[1], 'r')
if find_cref(inmap):
    modules = read_cref(inmap)
    write_dot(modules, argv[2])
else:
    print 'error: cross reference table not found.'

Note that the tool has very little syntax checking, so it may misbehave on usage errors or different versions of the linker than the one I use (GNU ld (GNU Binutils for Debian) 2.23.90.20131017), and it will surely misbehave when the directories or filenames contain spaces.

Example

As an example, I put together some C source files that depend on one another. Be aware of the distinction between definition and declaration in C. This is “program.c“, that references the “printf” function and some other functions and variables defined in other modules:

#include <stdio.h>
#include "program.h"

int main()
{
    printf("hello %d\n", module1_func1(0));
    module1_func1(module2_var2);
    module2_func1(module2_var1);
    module2_func2(0);
    return 0;
}

This is program.h, containing the declarations of the functions and variables defined in the modules.

extern int module1_func1(int);
extern int module1_func2(int);
extern int module1_var1;
extern int module1_var2;

extern int module2_func1(int);
extern int module2_func2(int);
extern int module2_var1;
extern int module2_var2;

This is “module1.c“, that defines the “module1_*” symbols (functions and variables):

#include "program.h"

int module1_var1 = 1;
int module1_var2 = 2;

int module1_func1(int a)
{
    return module1_var1 + module2_var1 + a;
}

int module1_func2(int a)
{
    return module1_var2 + a;
}

This is “module2.c“, that defines the “module2_*” symbols:

#include "program.h"

int module2_var1 = 1;
int module2_var2 = 2;

int module2_func1(int a)
{
    return module1_var2 + module2_var1 + a;
}

int module2_func2(int a)
{
    return module2_var2 + module1_var1 + module2_func2(a);
}

Assuming that the 5 files are all in the current directory, and that gcc, python, networkx and graphviz are installed on the system, I run the following commands to output an image of the graph:

gcc -I./ -c -o program.o program.c
gcc -I./ -c -o module1.o module1.c
gcc -I./ -c -o module2.o module2.c
gcc -Xlinker -Map=program.map -Xlinker --cref  program.o module1.o module2.o -o program
./read_cref.py  program.map program.dot
dot -Tpng program.dot -o program.png

The first three commands compile the C source files into their respective object files. The fourth command is the linking step, and the “-Xlinker” options are there to pass the correct options to the underlying ld linker. Then I invoke the python script (that has execution permission) and execute the dot Graphviz tool to generate a PNG image.

Example results

One output of the linking step (the main output is the program itself) is the map file “program.map“, containing the cross reference table (many rows omitted):

...
Cross Reference Table

Symbol                                            File
GCC_3.0                                           /lib/i386-linux-gnu/libc.so.6
GLIBC_2.0                                         /lib/i386-linux-gnu/libc.so.6
                                                  /lib/i386-linux-gnu/ld-linux.so.2
...
module1_func1                                     module1.o
                                                  program.o
module1_func2                                     module1.o
module1_var1                                      module1.o
                                                  module2.o
module1_var2                                      module1.o
                                                  module2.o
module2_func1                                     module2.o
                                                  program.o
module2_func2                                     module2.o
                                                  program.o
module2_var1                                      module2.o
                                                  module1.o
                                                  program.o
module2_var2                                      module2.o
                                                  program.o
...

This table has been read by the python script and transformed into a graph, which in turn has been rendered by Graphviz “dot” command into this “program.png” image:

Dependency graph

There are many things that this graph shows:

Our three modules are not the only things that get linked into a program: there are also system libraries and startup code.
The “module1.o” and “module2.o” nodes depend on one another. Even if “program.c” contained only “module1_*” references, “module2.c” would still be needed.
There are two “head” nodes with no arrows pointing to them (no other module depends on them). They are “ld-linux.so.2“, which is the starting point of most Linux userspace programs, and “crtbegin.o” that manages C++ constructors (or C functions with the GCC “constructor” attribute) that we don’t use.
It answers a question that many developers had at least once “My program starts with main, but who calls main?”. The answer in this case is somewhere in “crt1.o“.

As a real-world example, I tried it on U-Boot compilation for the Versatile PB platform (that I tackled in a past blog post), and this is the resulting graph:

U-Boot dependency graph

Note that to simplify the gigantic graph that was initially produced, I retouched the python script using DiGraph instead of MultiDiGraph (to remove multiple arrows) and I removed the symbol labels.

From U-Boot graph we can see that there are many libraries that are linked but not referenced, such as “libfs.o” (file system management), “libinput.o” (keyboard management) and “libspi.o” (SPI management), so it seems that the configuration could be improved further by removing these libraries from the build.

Conclusions and improvements

We have seen a way to draw graphs representing the dependency of C source files and modules throughout a complete program. We exploited the functionalities of GNU ld to create a cross reference table, we used NetworkX to create a graph and Graphviz to visualize it. We applied the method to a simple example and a real embedded application.

This approach in visualizing programs has many benefits. New developers can familiarize with the source code by taking a look at its structure, and maintainers can check quickly if something seems “off” or something else can be designed better. It’s also a cheap way to document the module organization.

It is also possible to expand from this concept and analyse the dependency graph in useful ways. It’s easy to pick an “entry point” and from there extract the resulting sub-graph. By adding more information, such as code size, one could measure the impact of removing or adding new modules. Designers can extract dependency “cycles” and re-factor the code to remove them, improving the modularity and testability of the project.

In any case, I believe it is a useful approach, that is simple enough to be implemented by any developer using the right tools.

Tagged: ansi c, c, cross reference table, dependency graph, dot, gcc, gnu, graphviz, linker, map file, networkx, parsing, python, u-boot

Posted in: Software

13 Responses “Analyzing C source files dependencies in a program” →

Mahavir Jain

2014/04/30

Very useful!
Martin Chaplet

2014/06/03

Great tip !!!
Just a comment, if your project is quite huge : add a filtering with ‘tred’ graphviz tool and all is clear !
In example :
tred program.dot > program2.dot
dot -Tsvg program2.dot -o program2.svg
Sainath

2014/08/04

Hi ,
Thanks for the wonderful tip and the code, I have a query, I have tried using it for the cross compiled ARM programs and it does not work. Do you have any suggestions about how to make this code suitable for bare metal code in ARM.

Regards
Sainath
Balau

2014/08/04

Actually the U-Boot example I have shown is cross-compiled for ARM with a GCC toolchain. My script is not guaranteed to work with all versions of GCC because the map file format is not something stable, but I think modern GCC should be able to output the information needed for those graphs. If you are using toolchains different than GCC then I don’t know, it depends on whether that toolchain gives you that kind of information.
feryllt

2015/03/09

Hi,
why using:

gcc -Xlinker -Map=program.map -Xlinker –cref *.obj

instead of directly

gcc -Map=program.map –cref *.obj

Thanks.
feryllt

2015/03/09

Sorry, I made a mistake in the previous comment:

instead of directly:

–> ld -Map=program.map –cref *.obj
Balau

2015/03/09

Generally it’s better to run gcc for compiling, assembly and linking, instead of executing directly the linker or the assembler. In this particular case it helps to link the standard and system libraries that are not specified in the command line. If you want to know what gcc is doing underneath, you can run it with “--verbose” option.
Duncan White

2016/05/05
Great article. Note that if you want to exclude standard library stuff from the graph, i.e. just display your function call dependencies, you can modify the python script’s ‘len(words) == 1‘ case to read:
```
p = re.compile( '^/' );
if (p.match( last_module ) is None and
    p.match( words[0] ) is None):
  modules.add_edge(words[0], last_module, label=last_symbol);
```
Furthermore, assuming that each X.o file corresponds to source X.c, you can make the graph display X.c names instead of X.o names by adding:
```
last_module = re.sub('\.o$','.c', last_module )
words[0] = re.sub('\.o$','.c', words[0] )
```
into the add_edge if immediately before the add_edge call. To do both, simply make the ‘len(words) == 1‘ case read:
```
p = re.compile( '^/' );
if (p.match( last_module ) is None and
    p.match( words[0] ) is None):
  last_module = re.sub('\.o$','.c', last_module )
  words[0] = re.sub('\.o$','.c', words[0] )
  modules.add_edge(words[0], last_module, label=last_symbol);
```
This makes the graphs much smaller and (IMHO) prettier.

– edited by balau82
Balau

2016/05/05

Thanks for the comment! I agree that my implementation is quite rough, and there’s no limit to make it clearer or more verbose. I think it’s not only a matter of preference, it depends especially on what the developer wants to investigate about the program.
Ravi Kumar

2016/06/29

I have installed networkx by running “pip install networkx” command and installed setuptool(https://pypi.python.org/pypi/setuptools) using “curl https://bootstrap.pypa.io/ez_setup.py -o – | python” command.

While running “read_cref.py” script, I got following error. I am using ubuntu 12.04 64-bit. Can you help to debug this issue.

Traceback (most recent call last):
File “./read_cref.py”, line 5, in
from networkx import MultiDiGraph, write_dot
ImportError: cannot import name write_dot
Balau

2016/06/29

Yeah, in most recent versions of networkx it seems they changed the way to use write_dot, see this: https://github.com/networkx/networkx/issues/1984
So you can try instead of line 5 to separate the two imports:
from networkx import MultiDiGraph from networkx.drawing.nx_agraph import write_dot
Ravi Kumar

2016/07/06

Thanks, issue got fixed after replacing line 5 with the above two imports.

1 Trackback For This Post

SciPy error: object has no attribute ‘decorate’ | Matthias Bock's Blog →
April 7th, 2017 → 18:03
[…] using the linker cross reference table to graphviz dot file conversion script from https://balau82.wordpress.com/2013/11/24/analyzing-c-source-files-dependencies-in-a-program/, I get the […]

Analyzing C source files dependencies in a program

Introduction

Implementation

Example

Example results

Conclusions and improvements

Leave a comment

Subscribe

Author

Sections

Categories

Top Posts & Pages

Tags

Blog Stats

License

Analyzing C source files dependencies in a program

Introduction

Implementation

Example

Example results

Conclusions and improvements

Related

Leave a comment

Subscribe

Author

Sections

Categories

Top Posts & Pages

Tags

Blog Stats

License