Introduction:- :-
Reverse
engineering is the process of analyzing a subject system to identify the
system’s components and their interrelationships and create representations of
the system in another form or at a higher level of abstraction. The purpose of
reverse engineering is to understand a software system in order to facilitate
enhancement, correction, documentation, redesign, or reprogramming in a
different programming language.
Decompilers
Reverse engineering is done with the help of decompilers. A decompiler is a
program that reads a program written in a machine language (the source
language) and translates it into an equivalent program in a high level language
(the target language) (see Figure 1.2). A decompiler, or reverse compiler, thus
attempts to reverse the process of a compiler, which translates a high-level
language program into a binary or executable program. Basic decompiler
techniques are used to decompile binary programs from a wide variety of machine
languages to a diversity of high-level languages. The structure of decompilers
is based on the structure of compilers; similar principles and techniques are
used to perform the analysis of programs.
Code Improving
Optimizations
This
section presents the code-improving transformations used by a decompiler. The
aim of these optimizations is to eliminate the low-level language concepts of
condition codes, registers, and intermediate instructions, and introduce the
high-level concept of expressions of more than two operands. For this purpose,
it is noted that push instructions are used in a variety of ways by today's
compilers. Parameter passing is the most common use of this instruction, by
pushing them before the subroutine call, in the order specified by the calling convention
in use. Register spilling is used whenever the compiler runs out of registers
to compute an expression.
Structuring
Algorithms
In
Decompilation, the aim of a structuring algorithm is to determine the
underlying control structures of an arbitrary graph, thus converting it into a
functional and semantic equivalent graph. Arbitrary graph stands for any
control flow graph, reducible or irreducible, from a structured or unstructured
language. Since it is not known what language the initial program was written
in, and what compiler was used (e.g. what optimizations were turned on), the
use of goto jumps must be allowed in case the graph cannot be structured into a
set of generic high-level structures.
Structuring Loops
In
order to structure loops, a loop in terms of a graph representation needs to be
defined. This representation must be able to not only determine the extent of a
loop, but also provide a nesting order for the loops. The representation of a
loop by means of cycles is too fine a representation since loops are not
necessarily properly nested or disjoint. The use of strongly connected
components as loops is too coarse a representation as there is no nesting
order. The use of strongly connected regions does not provide a unique cover of
the graph, and does not cover the entire graph. Once a loop has been found, the
type of loop is determined.
Finding The Type
Of Loop
The
type of a loop is determined by the header and latching nodes of the loop. In a
pre-tested loop, the 2-way header node determines whether the loop is executed
or not, and the 1-way latching node transfers control back to the header node.
A post-tested loop is characterized by a 2-way latching node that branches back
to the header of the loop or out of the loop, and any type of header node.
Finally, an endless loop has a 1-way latching node that transfers control back
to the header node, and any type of header node.
Finding The Loop
Follow Node
The
loop follow node is the first node that is reached after the loop is
terminated. In a pre-tested loop, the follow node is the successor of the loop
header that does not belong to the loop. In a similar way, the follow node of a
post-tested loop is the successor of the loop-latching node that does not
belong to the loop. Initially, no follow nodes are taken for endless loops as
neither the header nor the latching node jump out of the loop. But since an
endless loop can have a jump out of the loop in the middle of the loop, it can too have a follow node. Since the follow node is the first
node that is reached after the loop is ended, it is desirable to find the
closest node that is reached from the loop after an exit is performed. The closest
node is the one with the smallest reverse post order numbering; i.e. the one
that is closest to the loop (in numbering order). Any other node that is also
reached from the loop can be reached from the closest node (because it must
have a greater reverse post order numbering), thus, the closest node is
considered the follow node of an endless loop.
Generating Code
for a Basic Block
After
data flow analysis, the intermediate instructions in a basic block are all
high-level instructions; pseudo high-level instructions must have been
eliminated from the code before this point. For each basic block, the
instructions in the basic block are mapped to an equivalent instruction of the
target language. Transfer of control instructions (i.e. jcond and jmp instructions)
is dependent on the structure of the graph (i.e. they belong to a loop or a
conditional jump (2 and n ways), or be equivalent to a goto), and hence, code
is generated for them according to the control flow information. This section
illustrates how code is generated for all other instructions of a basic block.
Challenges
1.
The main difficulty of a decompiler parser is the separation of code from data,
that is, the determination of which bytes in memory represent code and which
ones represent data.
2.
Less mature. Skilled staff sparse.
3.
The model can be imperfect. Salvaging partial information is still useful.
Conclusion
This
seminar report has presented techniques for the reverse compilation or
Decompilation of binary programs. Decompilers use similar principles and
techniques used in compilers. The data flow analysis phase analyzes the
low-level intermediate code and converts it into a high-level intermediate
representation available in any high-level language. Finally, the code
generation phase generates high-level code based on the high-level intermediate
representation and the structured graph of each subroutine. In this way, a
decompiler for a different machine can be built by writing a new front-end for
that machine, and a decompiler for a different target high-level language can
be built by writing a new back-end for the target language.
0 comments:
Post a Comment