Reversing a Mystery Function

One of my hobbies is reverse-engineering old binaries. I even wrote a custom decompiler to help me do it (here). Reverse-engineering is something like doing a crossword puzzle. You start with only a few small hints and each hint you solve gives you more information about the other hints. Eventually, the entire picture comes together. Here we have an excellent example of the process. And the result makes for a fun discovery.

The machine code

We start with this weird-looking x86-16 machine code.

Any bets on what it does?

0000:03b2:  59                      pop    cx
0000:03b3:  0e                      push   cs
0000:03b4:  51                      push   cx
0000:03b5:  33 c9                   xor    cx,cx
0000:03b7:  eb 16                   jmp    0000:03cf
0000:03b9:  59                      pop    cx
0000:03ba:  0e                      push   cs
0000:03bb:  51                      push   cx
0000:03bc:  b9 01 00                mov    cx,0x1
0000:03bf:  eb 0e                   jmp    0000:03cf
0000:03c1:  59                      pop    cx
0000:03c2:  0e                      push   cs
0000:03c3:  51                      push   cx
0000:03c4:  b9 02 00                mov    cx,0x2
0000:03c7:  eb 06                   jmp    0000:03cf
0000:03c9:  59                      pop    cx
0000:03ca:  0e                      push   cs
0000:03cb:  51                      push   cx
0000:03cc:  b9 03 00                mov    cx,0x3
0000:03cf:  55                      push   bp
0000:03d0:  56                      push   si
0000:03d1:  57                      push   di
0000:03d2:  8b ec                   mov    bp,sp
0000:03d4:  8b f9                   mov    di,cx
0000:03d6:  8b 46 0a                mov    ax,WORD PTR ss:[bp+0xa]
0000:03d9:  8b 56 0c                mov    dx,WORD PTR ss:[bp+0xc]
0000:03dc:  8b 5e 0e                mov    bx,WORD PTR ss:[bp+0xe]
0000:03df:  8b 4e 10                mov    cx,WORD PTR ss:[bp+0x10]
0000:03e2:  0b c9                   or     cx,cx
0000:03e4:  75 08                   jne    0000:03ee
0000:03e6:  0b d2                   or     dx,dx
0000:03e8:  74 69                   je     0000:0453
0000:03ea:  0b db                   or     bx,bx
0000:03ec:  74 65                   je     0000:0453
0000:03ee:  f7 c7 01 00             test   di,0x1
0000:03f2:  75 1c                   jne    0000:0410
0000:03f4:  0b d2                   or     dx,dx
0000:03f6:  79 0a                   jns    0000:0402
0000:03f8:  f7 da                   neg    dx
0000:03fa:  f7 d8                   neg    ax
0000:03fc:  83 da 00                sbb    dx,0x0
0000:03ff:  83 cf 0c                or     di,0xc
0000:0402:  0b c9                   or     cx,cx
0000:0404:  79 0a                   jns    0000:0410
0000:0406:  f7 d9                   neg    cx
0000:0408:  f7 db                   neg    bx
0000:040a:  83 d9 00                sbb    cx,0x0
0000:040d:  83 f7 04                xor    di,0x4
0000:0410:  8b e9                   mov    bp,cx
0000:0412:  b9 20 00                mov    cx,0x20
0000:0415:  57                      push   di
0000:0416:  33 ff                   xor    di,di
0000:0418:  33 f6                   xor    si,si
0000:041a:  d1 e0                   shl    ax,0x1
0000:041c:  d1 d2                   rcl    dx,0x1
0000:041e:  d1 d6                   rcl    si,0x1
0000:0420:  d1 d7                   rcl    di,0x1
0000:0422:  3b fd                   cmp    di,bp
0000:0424:  72 0b                   jb     0000:0431
0000:0426:  77 04                   ja     0000:042c
0000:0428:  3b f3                   cmp    si,bx
0000:042a:  72 05                   jb     0000:0431
0000:042c:  2b f3                   sub    si,bx
0000:042e:  1b fd                   sbb    di,bp
0000:0430:  40                      inc    ax
0000:0431:  e2 e7                   loop   0000:041a
0000:0433:  5b                      pop    bx
0000:0434:  f7 c3 02 00             test   bx,0x2
0000:0438:  74 06                   je     0000:0440
0000:043a:  8b c6                   mov    ax,si
0000:043c:  8b d7                   mov    dx,di
0000:043e:  d1 eb                   shr    bx,0x1
0000:0440:  f7 c3 04 00             test   bx,0x4
0000:0444:  74 07                   je     0000:044d
0000:0446:  f7 da                   neg    dx
0000:0448:  f7 d8                   neg    ax
0000:044a:  83 da 00                sbb    dx,0x0
0000:044d:  5f                      pop    di
0000:044e:  5e                      pop    si
0000:044f:  5d                      pop    bp
0000:0450:  ca 08 00                retf   0x8
0000:0453:  f7 f3                   div    dx,ax,bx
0000:0455:  f7 c7 02 00             test   di,0x2
0000:0459:  74 01                   je     0000:045c
0000:045b:  92                      xchg   dx,ax
0000:045c:  33 d2                   xor    dx,dx
0000:045e:  eb ed                   jmp    0000:044d

Entry stubs

A first observation is that we have a series of "entry stubs" of the form:

0000:03b2:  59                      pop    cx
0000:03b3:  0e                      push   cs
0000:03b4:  51                      push   cx
0000:03b5:  <....>                  <set cx to some value>
0000:03b7:  eb 16                   jmp    0000:03cf

In addition, elsewhere in the binary we have all the following calls:

call  0000:03b2
callf 0000:03b5
call  0000:03b9
callf 0000:03bc
call  0000:03c1
callf 0000:03c4
call  0000:03c9
callf 0000:03cc

But no calls to 0000:03cf directly. Instead these 8 stubs are different entry points for the core routine implemented from 0000:03cf.

This part of each stub is a "near call entry" for a "far call function":

0000:03b2:  59                      pop    cx
0000:03b3:  0e                      push   cs
0000:03b4:  51                      push   cx

Basically: (1) save near-call return offset to CX, (2) push the code-segment CS, (3) push offset back to stack

This sequence allows the caller to perform a near call (call) and the implementation function to return with a far return (retf).

If this is confusing, just ignore it. Near and far calls are a special beast specific only to x86-16.

So, we have 4 actual "versions" of the call:

Mode (CX) Near Call Far Call
0 0000:03b2 0000:03b5
1 0000:03b9 0000:03bc
2 0000:03c1 0000:03c4
3 0000:03c9 0000:03cc

Core Implementation Inspection

The core routine is implemented from 0000:03cf to 0000:0460. We turn our attention to that section now.

After further inspection, it seems that this is probably a hand-coded or hand-optimized assembly routine:

  1. Passing an argument in CX is not part of the normal calling-conventions
  2. Typical function prologue is push BP; mov SP,BP with no other stack operations between. This causes the first parameter to be loaded from SS:BP+0xa rather than the more typical SS:BP+0x6
  3. Almost halfway through the function, it starts using the frame-pointer BP as an ordinary register with mov BP,CX. Emitting a frame-pointer is a well-known optimization, but clobbering the frame-pointer halfway through is.. umm.. lesser known.

Here's an interesting sequence:

0000:041a:  d1 e0                   shl    ax,0x1
0000:041c:  d1 d2                   rcl    dx,0x1
0000:041e:  d1 d6                   rcl    si,0x1
0000:0420:  d1 d7                   rcl    di,0x1

The rcl instruction rotates left including the Carry Flag (CF). So this is shifting most-significant bits out of one register and into the least-significant register. Curious.

Finally, we see these checks early in the function:

0000:03e6:  0b d2                   or     dx,dx
0000:03e8:  74 69                   je     0000:0453
0000:03ea:  0b db                   or     bx,bx
0000:03ec:  74 65                   je     0000:0453

Jumping to:

0000:0453:  f7 f3                   div    dx,ax,bx
0000:0455:  f7 c7 02 00             test   di,0x2
0000:0459:  74 01                   je     0000:045c
0000:045b:  92                      xchg   dx,ax
0000:045c:  33 d2                   xor    dx,dx
0000:045e:  eb ed                   jmp    0000:044d

Which just returns:

0000:044d:  5f                      pop    di
0000:044e:  5e                      pop    si
0000:044f:  5d                      pop    bp
0000:0450:  ca 08 00                retf   0x8

So, sometimes with just do a simple division and return. Curious.

Decompiling

Let's decompile it with dis86 into C code. After doing a little manual cleanup, we have:

HYDRA_FUNC(mystery_function)
{
  #define _param_0006 *PTR_16(SS, SP + 0x4)
  #define _param_0008 *PTR_16(SS, SP + 0x6)
  #define _param_000a *PTR_16(SS, SP + 0x8)
  #define _param_000c *PTR_16(SS, SP + 0xa)

  u16 ax_2, dx_2, bx_2, cx_2, dx_25, ax_19, di_18, bx_17, dx_24, ax_18, di_8, bp_3, cx_14, di_10, si_4, dx_10, ax_5, ax_6, dx_11, si_5, di_11, di_19, si_14, ax_20, cx_12, ax_21, dx_28, bx_12, dx_16, ax_12, ax_22;
  u16 tmp_0, tmp_1;

  u32 tmp_u32;
  u64 tmp_u64;

  ax_2 = _param_0006;
  dx_2 = _param_0008;
  bx_2 = _param_000a;
  cx_2 = _param_000c;

  if (cx_2 != 0) {
    goto addr_03ee;
  }

  if (dx_2 == 0 || bx_2 == 0) {
    tmp_0 = MAKE_32(dx_2, ax_2) / bx_2;
    tmp_1 = MAKE_32(dx_2, ax_2) % bx_2;
    if ((CX & 2) == 0) {
      ax_22 = tmp_0;
    } else {
      ax_22 = tmp_1;
    }
    dx_16 = 0;
    ax_12 = ax_22;
    goto ret;
  } else {
    goto addr_03ee;
  }

 addr_03ee:;
  if ((CX & 1) != 0) {
    bx_17 = bx_2;
    dx_24 = dx_2;
    ax_18 = ax_2;
    di_8 = CX;
    bp_3 = cx_2;
  } else {
    if (((dx_2 >> 15) != 0) == 0) {
      dx_25 = dx_2;
      ax_19 = ax_2;
      di_18 = CX;
    } else {
      /* MANUALLY IMPLEMENTED: ... 32-bit negation impl ....
         0000:03f8: f7 da                   neg    dx
         0000:03fa: f7 d8                   neg    ax
         0000:03fc: 83 da 00                sbb    dx,0x0
      */
      tmp_u32 = -MAKE_32(dx_2, ax_2);
      dx_25 = (u16)(tmp_u32 >> 16);
      ax_19 = (u16)tmp_u32;

      di_18 = CX | 12;
    }

    if (((cx_2 >> 15) != 0) == 0) {
      bx_17 = bx_2;
      dx_24 = dx_25;
      ax_18 = ax_19;
      di_8 = di_18;
      bp_3 = cx_2;
    } else {
      /* MANUALLY IMPLEMENTED: ... 32-bit negation impl ....
         0000:0406: f7 d9                   neg    cx
         0000:0408: f7 db                   neg    bx
         0000:040a: 83 d9 00                sbb    cx,0x0
         0000:0410: 8b e9                   mov    bp,cx
      */
      tmp_u32 = -MAKE_32(cx_2, bx_2);
      bp_3 = (u16)(tmp_u32 >> 16);
      bx_17 = (u16)tmp_u32;

      dx_24 = dx_25;
      ax_18 = ax_19;
      di_8 = di_18 ^ 4;
    }
  }

  cx_14 = 32;
  di_10 = 0;
  si_4 = 0;
  dx_10 = dx_24;
  ax_5 = ax_18;

  while (1) {
  addr_041a:;
    /* MANUALLY IMPLEMENTED: ... 64-bit left-shifting ...
       0000:041a:   d1 e0                   shl    ax,0x1
       0000:041c:   d1 d2                   rcl    dx,0x1
       0000:041e:   d1 d6                   rcl    si,0x1
       0000:0420:   d1 d7                   rcl    di,0x1
    */
    tmp_u64 = MAKE_64(MAKE_32(di_10, si_4), MAKE_32(dx_10, ax_5));
    tmp_u64 <<= 1;
    // unpack back to 16-bit registers
    ax_6  = (u16)(tmp_u64 >> 0);
    dx_11 = (u16)(tmp_u64 >> 16);
    si_5  = (u16)(tmp_u64 >> 32);
    di_11 = (u16)(tmp_u64 >> 48);

    if (di_11 < bp_3) {
      di_19 = di_11;
      si_14 = si_5;
      ax_20 = ax_6;
      goto addr_0431;
    }
    if (di_11 > bp_3) {
      goto addr_042c;
    }
    if (si_5 < bx_17) {
      di_19 = di_11;
      si_14 = si_5;
      ax_20 = ax_6;
      goto addr_0431;
    }
  addr_042c:;
    /* MANUALLY IMPLEMENTED: ... 32-bit subtraction ...  (DI:SI) -= (BP:BX)
       0000:042c:   2b f3                   sub    si,bx
       0000:042e:   1b fd                   sbb    di,bp
    */
    tmp_u32 = MAKE_32(di_11, si_5) - MAKE_32(bp_3, bx_17);
    di_19 = (u16)(tmp_u32 >> 8);
    si_14 = (u16)(tmp_u32 >> 0);

    ax_20 = ax_6 + 1;
  addr_0431:;
    cx_12 = cx_14 - 1;
    if (cx_12 == 0) {
      goto addr_0433;
    }
    cx_14 = cx_12;
    di_10 = di_19;
    si_4 = si_14;
    dx_10 = dx_11;
    ax_5 = ax_20;
    goto addr_041a;
  }
addr_0433:;
  if ((di_8 & 2) == 0) {
    ax_21 = ax_20;
    dx_28 = dx_11;
    bx_12 = di_8;
    goto addr_0440;
  }
  ax_21 = si_14;
  dx_28 = di_19;
  bx_12 = di_8 >> 1;

 addr_0440:;
  if ((bx_12 & 4) == 0) {
    dx_16 = dx_28;
    ax_12 = ax_21;
  } else {
    /* MANUALLY IMPLEMENTED ... 32-bit negation impl .... -(DX:AX)
       0000:0446:   f7 da                   neg    dx
       0000:0448:   f7 d8                   neg    ax
       0000:044a:   83 da 00                sbb    dx,0x0
    */
    tmp_u32 = -MAKE_32(dx_28, ax_21);
    dx_16 = (u16)(tmp_u32 >> 16);
    ax_12 = (u16)tmp_u32;
  }

 ret:;
  DX = dx_16;
  AX = ax_12;
  RETURN_FAR();

  #undef _param_0006
  #undef _param_0008
  #undef _param_000a
  #undef _param_000c
}

NOTE: Dis86 can be a little verbose at times due to it's internal SSA form that creates a lot of unique variables as a side-effect. This can be a little jarring, but it's actually very helpful because it prevents variable aliasing that can easily thwart manual refactorings. Dis86 is a relatively new decompiler (as of April 2024), so this should improve a lot with more time.

Refactoring

With more cleanup we can reach the following version. Note that I tried to retain the connection to registers and the overall machine-level state-transfers.

From here, you should be able to understand what the function does:

#define DIVMOD(quot, rem, dividend, divider) do { \
      quot = dividend / divider;                    \
      rem  = dividend % divider;                    \
  } while(0)

#define SHUFFLE_LEFT(high_32, low_32) do { \
    u64 tmp_u64 = MAKE_64(high_32, low_32); \
    tmp_u64 <<= 1; \
    high_32 = (u32)(tmp_u64 >> 32); \
    low_32 = (u32)(tmp_u64 >> 0); \
  } while (0)

HYDRA_FUNC(mystery_function)
{
  // Mode param is passed in CX
  u16 cx_mode = CX;

  // Move mode to DI (to make room for params)
  u16 di_mode = cx_mode;

  // Load parameters
  u16 ax_param_1 = *PTR_16(SS, SP + 0x4); // Typically in AX
  u16 dx_param_2 = *PTR_16(SS, SP + 0x6); // Typically in DX
  u16 bx_param_3 = *PTR_16(SS, SP + 0x8); // Typically in BX
  u16 cx_param_4 = *PTR_16(SS, SP + 0xa); // Typically in CX

  // Number to divide (dividend) in DX:AX
  u32 dx_ax = MAKE_32(dx_param_2, ax_param_1);

  // Divisor in CX:BX
  u32 cx_bx = MAKE_32(cx_param_4, bx_param_3);

  // Special case: cx_param_4 == 0 implies that the divisor fits in 16-bits.
  // If dx_param_2 == 0, then both fit in 16-bits and we can just use the normal
  // 8086 divide instruction. If bx_param_3 == 0 then we have a divisor that is 0
  // and we want to execute the divide instruction to trigger a divide by zero fault.
  // Note: In both of these cases, using the unsigned division instruction is perfectly acceptable.

  if (cx_param_4 == 0 && (dx_param_2 == 0 || bx_param_3 == 0)) {
    // The DIV instruction computes both the quotient and remainder at once
    u16 ax_quot, dx_rem;
    DIVMOD(ax_quot, dx_rem, dx_ax, bx_param_3);

    // mode & 0x2 => return the remainder
    if (di_mode & 0x2) {
      dx_ax = (u32)dx_rem;
    } else {
      dx_ax = (u32)ax_quot;
    }
    goto ret;
  }

  // !(mode & 0x1) => perform signed division
  if (!(di_mode & 0x1)) {
    // DX:AX (num) is negative?
    if ((i32)dx_ax < 0) {
      // Negate it so we can treated it as unsigned
      dx_ax = -dx_ax;
      // Record in the mode so we can fix it up at the end
      di_mode |= 0xc; // negate both a quot and rem result
    }

    // DX:AX (div) is negative?
    if ((i32)cx_bx < 0) {
      // Negate it so we can treated it as unsigned
      cx_bx = -cx_bx;
      // Record in the mode so we can fix it up at the end
      di_mode ^= 0x4; // negate a quot result only if both num & div are not negative (xor cancels the negations)
    }
  }

  // The loop need to use the CX register for the LOOP instruction, so
  // we need to shuffle the contents elsewhere. The frame-pointer (BP) was selected.
  // This an interesting choice and a strong hint the routine was hand-coded because
  // typically it's an off-limits register (especially as this function is totally using
  // it as a proper frame-pointer during param loading!)
  u32 bp_bx = cx_bx;

  // DI is also shuffled to the stack for the duration of the loop

  // DI:SI is used a scratch register that we slowly shift bits into
  u32 di_si = 0;
  for (u16 cx_loop_cnt = 32; cx_loop_cnt != 0; cx_loop_cnt--) {
    /* ORIGINAL ... 64-bit left-shifting via rcl chaining through the Carry Flag
       0000:041a:   d1 e0                   shl    ax,0x1
       0000:041c:   d1 d2                   rcl    dx,0x1
       0000:041e:   d1 d6                   rcl    si,0x1
       0000:0420:   d1 d7                   rcl    di,0x1
    */

    // Shuffle bits in DI:SI and DX:AX left
    // The MSB that shifts out of DX:AX will shift into DI:SI
    SHUFFLE_LEFT(di_si, dx_ax);

    // Value os too small to divide it out.. keep shifting
    if (di_si < bp_bx) {
      continue;
    }

    // Great! We can subtract off a divisor and update the quotient in the newly available bit in DX:AX
    di_si -= bp_bx;
    dx_ax++;
  }

  // At this point, we have the quotient in DX:AX and the remainder in DI:SI
  // Notice that we no longer need BP:BX (the divisor)

  // Before the loop, the mode is pushed from DI, afterwards its popped into BX
  u16 bx_mode = di_mode;

  // mode & 0x2 => return the remainder
  if (bx_mode & 0x2) {
    dx_ax = di_si;
    bx_mode >>= 1;
  }

  // mode & 0x8 => negate remainder result
  // mode & 0x4 => negate quotient result
  if (bx_mode & 0x4) {
    /* ORIGINAL: ... 32-bit negation impl ....
       0000:0446:   f7 da                   neg    dx
       0000:0448:   f7 d8                   neg    ax
     0000:044a: 83 da 00                sbb    dx,0x0
    */
    dx_ax = -dx_ax;
  }

 ret:;
  /* Return 32-bits in DX:AX */
  DX = UPPER_16(dx_ax);
  AX = LOWER_16(dx_ax);
  RETURN_FAR();
}

Explanation

The mystery function implements divmod for 32-bits on an x86-16 machine!

The machine has a native DIV instruction:

Opcode Mnemonic Description
F7 /6 DIV r/m16 Unsigned divide DX:AX by r/m16, with result stored in AX = Quotient, DX = Remainder.

The instruction takes a 32-bit dividend in DX:AX and a 16-bit divider as r/m16, so this alone isn't sufficient to implement division between two 32-bit values as required by ordinary C code.

The core of this routine is the loop of 32, shifting one bit at a time. This is clever way to implement divmod.

Example:

Let's demonstrate the division algorithm by example, implementing 8-bit divisions.

let
  num = 43
  div = 3

we want to compute
  num / div = 14 rem 1

Convert to 8-bit binary:

num  = 00101011 (43)
div  = 00000011 (3)

Form a 16-bit <0,num> scratch value:

scratch = 00000000_00101011

Step 1a: Shift up scratch until the upper 8-bits are greater than or equal to div:

scratch = 00000101_01100000 (5 shifts)

Step 1b: Subtract div = 00000011 off upper and add 1 to lower:

scratch = 00000010_01100000 + 1

Step 2a: Shift up scratch until the upper 8-bits are greater than or equal to div:

scratch = 00000100_11000010 (1 shift)

Step 2b: Subtract div = 00000011 off upper and add 1 to lower:

scratch = 00000001_11000010 + 1

Step 3a: Shift up scratch until the upper 8-bits are greater than or equal to div:

scratch = 00000011_10000110 (1 shift)

Step 3b: Subtract div = 00000011 off upper and add 1 to lower:

scratch = 00000000_10000110 + 1

Step 4: Shift up scratch until done

scratch = 00000001_00001110 (shift 1)

Finally we unpack the scratch as <rem,quot>:

rem  = 00000001 (1)
quot = 00001110 (14)

And we have the desired result.

To see this another way, we are forming the quotient as a decomposition of shifts/powers-of-two:

quot = (shift 5 and add 1) + (shift 1 and add 1) + (shift 1 and add 1)
     = 2**(8-5) + 2**(8-5-1) + 2**(8-5-1-1)
     = 2**3 + 2**2 + 2**1
     = 8 + 4 + 2
     = 14

Or in a more traditional lens:

num_0, quot_0 = 43,  0
num_1, quot_1 = 43,  0  (try to subtract 3*(2**7) = 3*128 = 384)
num_2, quot_2 = 43,  0  (try to subtract 3*(2**6) = 3*64  = 192)
num_3, quot_3 = 43,  0  (try to subtract 3*(2**5) = 3*32  = 96)
num_4, quot_4 = 43,  0  (try to subtract 3*(2**4) = 3*16  = 48)
num_5, quot_5 = 19,  8  (try to subtract 3*(2**3) = 3*8   = 24)
num_6, quot_6 =  7, 12  (try to subtract 3*(2**2) = 3*4   = 12)
num_7, quot_7 =  1, 14  (try to subtract 3*(2**1) = 3*2   = 6)
num_8, quot_8 =  1, 14  (try to subtract 3*(2**0) = 3*1   = 3)

Try running through some examples yourself if you want to fully understand.

Core Algorithm

Our function implements the same algorithm as the example, but over 32-bits rather than 8-bits. This means it needs a 64-bit shift on a 16-bit machine :-)

The core algorithm is here:

// DI:SI is used a scratch register that we slowly shift bits into
u32 di_si = 0;
for (u16 cx_loop_cnt = 32; cx_loop_cnt != 0; cx_loop_cnt--) {
  // The MSB that shifts out of DX:AX will shift into DI:SI
  SHUFFLE_LEFT(di_si, dx_ax);

  // Value to small to divide it out.. keep shifting
  if (di_si < bp_bx) {
    continue;
  }

  // Great! We can subtract off a divisor and update the quotient in the newly available bit in DX:AX
  di_si -= bp_bx;
  dx_ax++;
}

Special Case

Let's return to the special case with the DIV instruction:

if (cx_param_4 == 0 && (dx_param_2 == 0 || bx_param_3 == 0)) {
  // The DIV instruction computes both the quotient and remainder at once
  u16 ax_quot, dx_rem;
  DIVMOD(ax_quot, dx_rem, dx_ax, bx_param_3);
  // ... SNIP ....
  goto ret;
}

This is actually two cases:

Condition Reason for using DIV
(CX:BX) == 0 Intentionally trigger a divide-by-zero exception
(DX:AX) and (CX:BX) both fit 16-bits Quotient and Remainder will both also be 16-bits

Modes

Finally we need to discuss the "modes"

There is a prologue section:

// !(mode & 0x1) => perform signed division
if (!(di_mode & 0x1)) {
  // DX:AX (num) is negative?
  if ((i32)dx_ax < 0) {
    // Negate it so we can treated it as unsigned
    dx_ax = -dx_ax;
    // Record in the mode so we can fix it up at the end
    di_mode |= 0xc; // negate both a quot and rem result
  }

  // DX:AX (div) is negative?
  if ((i32)cx_bx < 0) {
    // Negate it so we can treated it as unsigned
    cx_bx = -cx_bx;
    // Record in the mode so we can fix it up at the end
    di_mode ^= 0x4; // negate a quot result only if both num & div are not negative (xor cancels the negations)
  }
}

And there is an epilogue section:

// mode & 0x2 => return the remainder
if (bx_mode & 0x2) {
  dx_ax = di_si;
  bx_mode >>= 1;
}

// mode & 0x8 => negate remainder result
// mode & 0x4 => negate quotient result
if (bx_mode & 0x4) {
  /* ORIGINAL: ... 32-bit negation impl ....
     0000:0446: f7 da                   neg    dx
     0000:0448: f7 d8                   neg    ax
   0000:044a:   83 da 00                sbb    dx,0x0
  */
  dx_ax = -dx_ax;
}

Here's the meaning of the Mode bits:

Bit Value Meaning
1 = 1<<0 Perform Unsigned Division
2 = 1<<1 Return Remainder instead of Quotient
4 = 1<<2 Negate Quotient after computation
8 = 1<<3 Negate Remainder after computation

With this we can now also understand the entry stubs:

Mode (CX) Near Call Far Call Meaning
0 0000:03b2 0000:03b5 Compute Signed Division
1 0000:03b9 0000:03bc Compute Signed Remainder
2 0000:03c1 0000:03c4 Compute Unsigned Division
3 0000:03c9 0000:03cc Compute Unsigned Remainder

Interestingly, there's no mode for divmod. And amusingly, scanning my binary, I found at least one function that called both "Compute Unsigned Division" and "Compute Unsigned Remainder" for the same values. I suppose the issue is that C doesn't have an operator for unified divmod and the compiler here didn't implement an optimization for it.

Conclusion

And with that, we have a full understanding of the function. Who guessed it was a division implementation from the assembly listing? Let me know!.

I had a lot of fun dissecting this function. I'm always a little impressed by the old tricks that are hanging out in these ancient binaries: long forgotten. It's fun to dredge them up and repopularize them!

If you enjoyed this kind of thing, you can stay tuned to this blog via the rss feed (here). You can also buy me a coffee (here).

Have a grand day, and try not to get lost in the Binary Mines!