Introduction
Software projects can easily become complicated, with many source files depending on one another in ways that are not obvious. Moreover, developing for embedded software and micro-controllers has specific complexities such as optimizing and chiselling a program in order to fit it into a small device with little memory. For these reasons, I wanted a tool to visualize the interdependency of the modules that compose a C program, with the ability to analyse it and point to potential design problems and improvements. I developed a small proof of concept that I am going to show in this post.
The idea is to use a diagnostic output of the linker, called the “cross reference table”, to graph and plot the dependencies of functions and variables across the linked modules. The table will be parsed by a python script and visualized with Graphviz.
Implementation
What is the cross reference table? The “GNU ld
” linker is able to output a map of the program that is being linked, with information on the different objects, their sections and their placement in memory. This map file will also optionally contain a table containing the references and definitions of symbols (which are functions and variables with external linkage) indicating what module is defining them and what module is referencing to them. This is the option that enables it (taken from the help pages):
--cref
Output a cross reference table. If a linker map file is being generated, the cross reference table is printed to the map file. Otherwise, it is printed on the standard output.
The format of the table is intentionally simple, so that it may be easily processed by a script if necessary. The symbols are printed out, sorted by name. For each symbol, a list of file names is given. If the symbol is defined, the first file listed is the location of the definition. The remaining files contain references to the symbol.
With this information it is possible to create a graph, with nodes consisting of object files (often related 1:1 to C source files) and libraries, and arrows between them representing symbols that a module uses but is defined in another module. For example, if module “module1.c
” defines the function “module1_func1()
” and the module “program.c
” uses it, the linker will link together “program.o
” and “module1.o
” object files and the graph will have two nodes, called “program.o
” and “module1.o
” with an arrow from the “program.o
” node to the “module1.o
” node representing the reference to the function “module1_func1()
“.
The tool that I chose to create this graph is NetworkX, a python library that creates, analyses and draws many types of graphs. It is able to output a Graphviz “dot
” file, so that I can use Graphviz tools to create images and show the results visually.
The following is a python script, that I called “read_cref.py
“, that takes two parameters: the input map file containing the cross reference table, and the output dot file that will contain the depencency graph.
#!/usr/bin/env python from sys import argv from io import open from networkx import MultiDiGraph, write_dot def find_cref(inmap): while True: l = inmap.readline() if l.strip() == 'Cross Reference Table': break; if len(l) == 0: return False while True: l = inmap.readline() if len(l) == 0: return False words = l.split() if len(words) == 2: if words[0] == 'Symbol' and words[1] == 'File': return True def read_cref(inmap): modules = MultiDiGraph() while True: l = inmap.readline() words = l.split() if len(words) == 2: last_symbol = words[0] last_module = words[1] elif len(words) == 1: modules.add_edge(words[0], last_module, label=last_symbol); elif len(l) == 0: break return modules inmap = open(argv[1], 'r') if find_cref(inmap): modules = read_cref(inmap) write_dot(modules, argv[2]) else: print 'error: cross reference table not found.'
Note that the tool has very little syntax checking, so it may misbehave on usage errors or different versions of the linker than the one I use (GNU ld (GNU Binutils for Debian) 2.23.90.20131017
), and it will surely misbehave when the directories or filenames contain spaces.
Example
As an example, I put together some C source files that depend on one another. Be aware of the distinction between definition and declaration in C. This is “program.c
“, that references the “printf
” function and some other functions and variables defined in other modules:
#include <stdio.h> #include "program.h" int main() { printf("hello %d\n", module1_func1(0)); module1_func1(module2_var2); module2_func1(module2_var1); module2_func2(0); return 0; }
This is program.h
, containing the declarations of the functions and variables defined in the modules.
extern int module1_func1(int); extern int module1_func2(int); extern int module1_var1; extern int module1_var2; extern int module2_func1(int); extern int module2_func2(int); extern int module2_var1; extern int module2_var2;
This is “module1.c
“, that defines the “module1_*
” symbols (functions and variables):
#include "program.h" int module1_var1 = 1; int module1_var2 = 2; int module1_func1(int a) { return module1_var1 + module2_var1 + a; } int module1_func2(int a) { return module1_var2 + a; }
This is “module2.c
“, that defines the “module2_*
” symbols:
#include "program.h" int module2_var1 = 1; int module2_var2 = 2; int module2_func1(int a) { return module1_var2 + module2_var1 + a; } int module2_func2(int a) { return module2_var2 + module1_var1 + module2_func2(a); }
Assuming that the 5 files are all in the current directory, and that gcc
, python
, networkx
and graphviz
are installed on the system, I run the following commands to output an image of the graph:
gcc -I./ -c -o program.o program.c gcc -I./ -c -o module1.o module1.c gcc -I./ -c -o module2.o module2.c gcc -Xlinker -Map=program.map -Xlinker --cref program.o module1.o module2.o -o program ./read_cref.py program.map program.dot dot -Tpng program.dot -o program.png
The first three commands compile the C source files into their respective object files. The fourth command is the linking step, and the “-Xlinker
” options are there to pass the correct options to the underlying ld
linker. Then I invoke the python script (that has execution permission) and execute the dot
Graphviz tool to generate a PNG image.
Example results
One output of the linking step (the main output is the program itself) is the map file “program.map
“, containing the cross reference table (many rows omitted):
... Cross Reference Table Symbol File GCC_3.0 /lib/i386-linux-gnu/libc.so.6 GLIBC_2.0 /lib/i386-linux-gnu/libc.so.6 /lib/i386-linux-gnu/ld-linux.so.2 ... module1_func1 module1.o program.o module1_func2 module1.o module1_var1 module1.o module2.o module1_var2 module1.o module2.o module2_func1 module2.o program.o module2_func2 module2.o program.o module2_var1 module2.o module1.o program.o module2_var2 module2.o program.o ...
This table has been read by the python script and transformed into a graph, which in turn has been rendered by Graphviz “dot
” command into this “program.png
” image:
There are many things that this graph shows:
- Our three modules are not the only things that get linked into a program: there are also system libraries and startup code.
- The “
module1.o
” and “module2.o
” nodes depend on one another. Even if “program.c
” contained only “module1_*
” references, “module2.c
” would still be needed. - There are two “head” nodes with no arrows pointing to them (no other module depends on them). They are “
ld-linux.so.2
“, which is the starting point of most Linux userspace programs, and “crtbegin.o
” that manages C++ constructors (or C functions with the GCC “constructor
” attribute) that we don’t use. - It answers a question that many developers had at least once “My program starts with
main
, but who callsmain
?”. The answer in this case is somewhere in “crt1.o
“.
As a real-world example, I tried it on U-Boot compilation for the Versatile PB platform (that I tackled in a past blog post), and this is the resulting graph:
Note that to simplify the gigantic graph that was initially produced, I retouched the python
script using DiGraph
instead of MultiDiGraph
(to remove multiple arrows) and I removed the symbol labels.
From U-Boot graph we can see that there are many libraries that are linked but not referenced, such as “libfs.o
” (file system management), “libinput.o
” (keyboard management) and “libspi.o
” (SPI management), so it seems that the configuration could be improved further by removing these libraries from the build.
Conclusions and improvements
We have seen a way to draw graphs representing the dependency of C source files and modules throughout a complete program. We exploited the functionalities of GNU ld
to create a cross reference table, we used NetworkX to create a graph and Graphviz to visualize it. We applied the method to a simple example and a real embedded application.
This approach in visualizing programs has many benefits. New developers can familiarize with the source code by taking a look at its structure, and maintainers can check quickly if something seems “off” or something else can be designed better. It’s also a cheap way to document the module organization.
It is also possible to expand from this concept and analyse the dependency graph in useful ways. It’s easy to pick an “entry point” and from there extract the resulting sub-graph. By adding more information, such as code size, one could measure the impact of removing or adding new modules. Designers can extract dependency “cycles” and re-factor the code to remove them, improving the modularity and testability of the project.
In any case, I believe it is a useful approach, that is simple enough to be implemented by any developer using the right tools.
Mahavir Jain
2014/04/30
Very useful!
Martin Chaplet
2014/06/03
Great tip !!!
Just a comment, if your project is quite huge : add a filtering with ‘tred’ graphviz tool and all is clear !
In example :
tred program.dot > program2.dot
dot -Tsvg program2.dot -o program2.svg
Sainath
2014/08/04
Hi ,
Thanks for the wonderful tip and the code, I have a query, I have tried using it for the cross compiled ARM programs and it does not work. Do you have any suggestions about how to make this code suitable for bare metal code in ARM.
Regards
Sainath
Balau
2014/08/04
Actually the U-Boot example I have shown is cross-compiled for ARM with a GCC toolchain. My script is not guaranteed to work with all versions of GCC because the map file format is not something stable, but I think modern GCC should be able to output the information needed for those graphs. If you are using toolchains different than GCC then I don’t know, it depends on whether that toolchain gives you that kind of information.
feryllt
2015/03/09
Hi,
why using:
gcc -Xlinker -Map=program.map -Xlinker –cref *.obj
instead of directly
gcc -Map=program.map –cref *.obj
Thanks.
feryllt
2015/03/09
Sorry, I made a mistake in the previous comment:
instead of directly:
–> ld -Map=program.map –cref *.obj
Balau
2015/03/09
Generally it’s better to run
gcc
for compiling, assembly and linking, instead of executing directly the linker or the assembler. In this particular case it helps to link the standard and system libraries that are not specified in the command line. If you want to know whatgcc
is doing underneath, you can run it with “--verbose
” option.Duncan White
2016/05/05
Great article. Note that if you want to exclude standard library stuff from the graph, i.e. just display your function call dependencies, you can modify the python script’s ‘
len(words) == 1
‘ case to read:Furthermore, assuming that each
X.o
file corresponds to sourceX.c
, you can make the graph displayX.c
names instead ofX.o
names by adding:into the
add_edge
if immediately before theadd_edge
call. To do both, simply make the ‘len(words) == 1
‘ case read:This makes the graphs much smaller and (IMHO) prettier.
Balau
2016/05/05
Thanks for the comment! I agree that my implementation is quite rough, and there’s no limit to make it clearer or more verbose. I think it’s not only a matter of preference, it depends especially on what the developer wants to investigate about the program.
Ravi Kumar
2016/06/29
I have installed networkx by running “pip install networkx” command and installed setuptool(https://pypi.python.org/pypi/setuptools) using “curl https://bootstrap.pypa.io/ez_setup.py -o – | python” command.
While running “read_cref.py” script, I got following error. I am using ubuntu 12.04 64-bit. Can you help to debug this issue.
Traceback (most recent call last):
File “./read_cref.py”, line 5, in
from networkx import MultiDiGraph, write_dot
ImportError: cannot import name write_dot
Balau
2016/06/29
Yeah, in most recent versions of networkx it seems they changed the way to use write_dot, see this: https://github.com/networkx/networkx/issues/1984
So you can try instead of line 5 to separate the two imports:
from networkx import MultiDiGraph
from networkx.drawing.nx_agraph import write_dot
Ravi Kumar
2016/07/06
Thanks, issue got fixed after replacing line 5 with the above two imports.