A recent incident motivated me to go in with software forensics. To be precise, the question is to detect plagiarism. In this post I'm going to describe the case. Also I will discuss a simple indicator for similarities in executables.
This investigation deals with the implementation of a certain algorithm. Vendor "M" invented the algorithm and implemented it in a larger program. However, there was no documentation available to the public.
Vendor "S" reverse-engineered an executable made by vendor "M". The resulting source code and executable were made available to the public under version 3 of the GNU Public License.
Vendor "X" finally was the last one who implemented the same algorithm. Again only an executable is available.
Now the question is, if the implementation of vendor "X" is based on the code published by "S" or on own research of the original code.
In order to get a first impression I opened the three binaries in IDA Pro, a powerful disassembler. IDA translates the machine code into human readable assembler. However, a line of C source code could result in 10 or more lines of assembler statements. So it may be hard to get the big picture while reading an assembler listing. Therefore I prefer to let IDA draw a graph and zoom out until I can overlook the whole function. The following figure shows the graphs of the same function in the three implementations: in chronological order, from left to right, vendors "M", "S" and "X".

The graphs are read from top to bottom. Boxes ("nodes") indicate continuous sections of code. Green and red lines ("edges") symbolize the true/false legs of a branch instruction.
There seems to be not much in common between the implementations done by "M" and "S". On the other hand there's a striking similarity between code of "S" and "X". However, one should not jump to conclusions here.
This method doesn't consider the code in a node. So, in theory, the same graph may result from two programs that do completely different things.
On the other hand different compiler or compiler options make the same source code look differently in a graph. Let's have a look at the following piece of code: for (i=0; i<3; i++;) { x[i] = i; } A graph would show a small node and a branch.
The compiler optimization option "unroll loops" transforms the for loop into a series of assignment statements:
x[0] = 0; x[1] = 1; x[2] = 2;
Now the graph shows a larger node and the branch is gone. But the code still does the same, though in a slightly different manner.
