Dis86: A decompiler for x86 16-bit real-mode binaries

Today I'm publicly releasing a decompiler I've been building. Source code is (here).

I plan to document the internals here over a series of posts, but today's announcement will remain fairly high-level.

Backstory

Over the last year, I've been casually working on a reverse-engineering and reimplementation project of an old MS-DOS video-game that got me curious about science and engineering as a child. Along with this project has come all kinds of fascinating discoveries about this original IBM PC that I was too young to appreciate at the time. I've been intending to write publicly about all my discoveries for far too long.

The most traditional tool used for this task is IDA Pro. Sadly they dropped x86 16-bit real mode support some time ago. And to complicate matters more, I use a Mac M1 Pro Aarch64 machine so I can't simply use some older version without resorting to a complicated emulation stack (argh). I tried Ghidra also at some point and found issues getting it to work for my needs.

Being a very experienced systems engineer, it doesn't take much annoyance until you're seriously considering taking up the task of just building the thing you want from scratch. It's a curse. A curse I love.

Prehistoric version 1

The first version was a very simple 1-to-1 assembly instruction to C statement convertor.

For example:

x86-16 Asmn C Code
inc ax AX += 1;
push bx PUSH(BX);
leave SP = BP; BP = POP();

The generated C-code then could execute instead of the orginal machine-code using a hybrid runtime I developed that switches between executing inside DosBox or on native Mac M1 Aarch64 depending on which functions have been decompiled or are still only available in the original machine code. I call this runtime Hydra. Hydra itself is an interesting systems engineering project and I plan to also release it someday (TBD).

The typical reverse-engineering flow looked like:

  1. Identify an interesting function and where it is in the binary (e.g. from 0452:012a to 0452:0151)
  2. Run dis86 to generate eqivalent C-code
  3. Configure the Hydra Runtime to use this new code instead of the old machine code
  4. Verify it works identically to the original
  5. Begin refactoring the function: simplifying, naming variables, introducing higher-level constructs, etc
  6. Goto 4 ... until satisfied

You spend a lot of time in the 4-6 loop here. And there are obvious and not-so-obvious transformations that appear repeatedly that you quickly get sick of doing manually. You really want to focus on the less trivial parts such as understanding a data-structure or algorithm.

Here are some examples:

Assembly pattern Semantic operation
xor si,si SI = 0
or bx,bx
jne 0x234
if (BX != 0) goto addr_234
push ax
call 0x356:0x78
pop cx
call function 0x356:0x78 with one argument (AX)

You also want to recognize function parameters and local variables. Instead of accessing them with raw load/store instructions, you want to symbolize them and use ordinary assignment.

Some examples of variables:

Assembly pattern Sematic operation
mov di,WORD PTR ss:[bp+0x6] DI = <first function parameter>
les bx,DWORD PTR ss:[bp-0x6] ES:BX = <local variable at offset -0x6>
mov WORD PTR ss:[bp-0x8],dx <local variable at offset -0x8> = DX

And, you want to symbolize global varibales and fucntion calls.

Prehistoric version 2

So version 2 is born which does a handful of "peephole-style" transformations and symbolization.

Example output:

#define _param_0006 ARG_16(0x6)
#define _local_0002 LOCAL_16(0x2)
#define _local_0008 LOCAL_16(0x8)
#define _local_0006 LOCAL_32(0x6)
#define _local_000a LOCAL_16(0xa)
void func_00006b42__0622_0922(void)
{
  PUSH(BP);                                                // push   bp
  BP = SP;                                                 // mov    bp,sp
  SP -= 0xa;                                               // sub    sp,0xa
  PUSH(SI);                                                // push   si
  PUSH(DI);                                                // push   di
  DI = _param_0006;                                        // mov    di,WORD PTR ss:[bp+0x6]
  BX = DI;                                                 // mov    bx,di
  BX <<= 0x2;                                              // shl    bx,0x2
  AX = *PTR_16(DS, BX+0x946);                              // mov    ax,WORD PTR ds:[bx+0x946]
  DX = *PTR_16(DS, BX+0x944);                              // mov    dx,WORD PTR ds:[bx+0x944]
  *(u16*)((u8*)&_local_0006 + 2) = AX;                     // mov    WORD PTR ss:[bp-0x4],ax
  *(u16*)((u8*)&_local_0006 + 0) = DX;                     // mov    WORD PTR ss:[bp-0x6],dx
  SI = 0;                                                  // xor    si,si
  goto label_00006b93;                                     // jmp    0x6b93

 label_00006b64:
  LOAD_SEG_OFF(ES, BX, _local_0006);                       // les    bx,DWORD PTR ss:[bp-0x6]
                                                           // test   WORD PTR es:[bx],0x40
  if ((*PTR_16(ES, BX) & 0x40) == 0) goto label_00006b72;  // je     0x6b72
  *PTR_16(ES, BX) ^= 0x50;                                 // xor    WORD PTR es:[bx],0x50

 label_00006b72:
  LOAD_SEG_OFF(ES, BX, _local_0006);                       // les    bx,DWORD PTR ss:[bp-0x6]
  AX = *PTR_16(ES, BX);                                    // mov    ax,WORD PTR es:[bx]
  _local_0002 = AX;                                        // mov    WORD PTR ss:[bp-0x2],ax
                                                           // test   WORD PTR ss:[bp-0x2],0x14
  if ((_local_0002 & 0x14) != 0) goto label_00006b8e;      // jne    0x6b8e
                                                           // push   WORD PTR ss:[bp-0x4]
                                                           // push   bx
                                                           // callf  0x581:0x496
  F_vga_dyn_append(m, BX, (u16)(_local_0006>>16));         // add    sp,0x4

 label_00006b8e:
  // ... SNIP ...

The refactor iteration loop improves, but naturally we want more! In particular we want control-flow synthesis.

For example:

Assembly pattern Pseudo C Code
00006b80: jne 0x6b85
00006b82: mov ax,1
00006b85: ...
if (<something> != 0) { AX = 1 }
00006bcb: jmp 0x6bfc
00006bcd: ...
00006bfc: cmp si,0x42
00006bff: jl 0x6bcd
00006c01: ...
while (SI < 0x42) { ... }
0000754c: jmp WORD PTR cs:[bx+0x1c] switch (<val>) { ... }

These are not easy to infer in a decompiler architecture that lacks proper intermediate representations, so ultimately the decision was made to redesign from the bottom up. Which leads directly to the current form of dis86.

The inference problem

In many ways dis86 is just like any ordinary (forward) compiler. All compilers are tasked with translation from some source language to some destination language. In this case we have source=x86-16 and target=c-source-code. This reality means that many established compiler design techniques can be used.

However, on top of all the normal challenges of building a high-quality compiler, decompilers also have a complicated "inference problem" of trying to invert a function that is NOT a bijection! Control flow synthesis is a simple example of this:

Consider:

void my_function_1() {
  if (cond) {
    // .... do thing
    return;
  }
  // ... do another thing
}

And:

void my_function_1() {
  if (cond) {
    // .... do thing
  } else {
    // ... do another thing
  }
}

These can both be compiled to the same machine code. But which ONE would that machine-code decompile to?

The above is a very trivial example, but it demonstrates the principle of trying to invert a function that is not invertible. In reality, these problems can get very complicated quickly.

Here's a more practical example you find in much x86-16 code:

000079c0 (074b:0510) |  55                    push   bp;
000079c1 (074b:0511) |  8b ec                 mov    bp,sp;
000079cb (074b:051b) |  9a fc 04 81 05        call   0x581:0x4fc;
000079d0 (074b:0520) |  5d                    pop    bp;
000079d1 (074b:0521) |  cb                    retf;

Does this function return a value? If so, where does it come from? Well, the calling conventions on this system return u16 in AX and u32 in DX:AX. But neither is set here. So is it a void function?

Maybe.

It could be:

void func_074b_0510() {
  func_0581_04fc();
}
void func_0581_04fc() {
  // ... body ...
}

Or it could be:

u16 func_074b_0510() {
  return func_0581_04fc();
}
u16 func_0581_04fc() {
  // ... body ...
}

Or it could be:

u32 func_074b_0510() {
  return func_0581_04fc();
}
u32 func_0581_04fc() {
  // ... body ...
}

How do we decide? And this is just one case. A solution for this (e.g. whole program analysis) may not be sufficient for all such ambiguities. In the limit, complexity dominates.

Many decompilers have to depend on various heuristics and assumptions to deal with the issue. In this sense, dis86 is no different. But, as a principle, we value producing code that executes correctly over code that maximizes readability. For the above case we'll depend on manual annotation unless a distinction can be made with certainty. Over time, we hope to improve this continually and/or make inference good at suggesting solutions to the operator. But, ultimately the operator is the ultimate driver.

This is a trade-off that depends a lot on your application. If you're using a decompiler to simply read code and perform a security audit, then you may wish to trade some correctness for readability. If you find a security hole, you'll go read the machine-code directly to verify it anyways. But, if you're tring to semantically lift an entire binary into C code, then you instead worry about breaking some corner case in some infrequently executed code path.

Dis86 Design Overview

Finally we come to the overall internal structure of dis86.

Dis86 is a series of transform steps over well-defined data-structures:

  1. Disassembly: Raw binary is decoded into an in-memory representation of the instructions
  2. Basic blocks inference: Detect basic block boundaries based on jmp instructions and targets
  3. Build SSA IR [1]: Convert each instruction into one or more IR operations, inserting them into their respective basic-blocks
  4. Optimize IR: xor reduction, or reduction, phi reduction, branch simplifiction, stack ptr accumulation, common-subexpression-elimination, value-propagation, deadcode-elimination, etc
  5. Symbolize and forward memory: name parameters and locals, forward non-escaping memory, fuse upper/lower split addresses, etc
  6. Infer control flow: Detect loops, if-stmt, and switch-stmts. Schedule basic-block ordering/placement and jump fallthrough.
  7. Build an AST [2]: Build an AST representing C code using the final IR and the Control Flow
  8. Generate C Code: Walk the AST and emit the final C code

Each of these intermediate steps can be inspected, which is very useful for debugging and development.

Next Steps

In future posts, we'll drill down into specific components in gory detail.

If you'd like to learn more about this project, you can stay tuned to this blog via the rss feed (here). And, feel free to review out the source code yourself on github (here). If you find this interesting, you can also buy me a coffee (here).

Appendix