Forwards compatible, sort of

Whilst fooling around with Dr. Brain last week, I stumbled across an somewhat interesting piece of forwards compatibility. According to here, the typical Windows program can be executed ‘successfully’ under MS-DOS (yeah, yeah, citation needed). Apparently, Windows programs really contain two executable formats, one wrapped in another. The outer most format is the ‘MZ’ executable format, which is the original MS-DOS relocatable executable format, successor to the fixed (not relocatable) COM format.

The newer formats (‘NE’ and ‘PE’) sneak their way into the MZ format by expanding the header (there’s a field that controls the header size) and defining a new 32 bit field. The value in this field is the offset into the file at which the newer executable format begins.

In the MZ format, the executable code occurs directly after the header. Should the software be run on MS-DOS, anything in between the end of the MZ header and the beginning of the start of the new header will be executed. For Dr. Brain, this was a short program that writes “This program requires Microsoft Windows.”, and then exits. (If you’re extra curious about this short program, see P.S. For the curious below.)

Being skeptical the the original MS-DOS header was appearing on new programs, I went to the source of truth, which of course, is the Word 2007 viewer. (On a tangent, I believe there are free viewers available for every MS Office file type. They’ve come in handy a few times.) Sure enough, it starts with the MZ header and has a small program that does essentially the same thing the Dr. Brain version does.

Is there really a concern that the Word 2007 viewer will be run on some version of MS-DOS or some sort of weird DOS mode? I suppose no one is losing any sleep over 128 bytes (actually 256 bytes in the Word 2007 viewer), but still.

P.S. For the curious

So, you’re curious, eh…

Here’s first 80 bytes of the Dr. Brain executable:

0000000: 4d5a b801 0100 0000 0400 0000 ffff 0000 MZ..............
0000010: b800 0000 0000 0000 4000 0000 0000 0000 ........@.......
0000020: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0000030: 0000 0000 0000 0000 0000 0000 8000 0000 ................
0000040: 0e1f ba0e 00b4 09cd 21b8 014c cd21 5468 ........!..L.!Th
0000050: 6973 2070 726f 6772 616d 2072 6571 7569 is program requi
0000060: 7265 7320 4d69 6372 6f73 6f66 7420 5769 res Microsoft Wi
0000070: 6e64 6f77 732e 0d0a 2400 0000 0000 0000 ndows...$.......
0000080: 4e45 053c a504 0900 0000 0000 0203 1000 NE.<............

The file clearly starts with 'MZ'. The next interesting part is the header size, which is a 16-bit value at offset 0x08 (recall that x86 is little endian, so '0400' is really '0004'). This value (for whatever reason) is in units of 'paragraphs' which are 16 bytes long. The header ends at 0x40 (4 paragraphs x 16 bytes = 64 bytes = 0x40). The final interesting bit is the 32-bit value at 0x3c, which is the field hijacked by newer executables. In this case it's 0x80, which is the offset in bytes of the start of the new executable. Notice that 'NE' starts is at 0x80, confirming that this is an NE executable.

So, what exactly does the 64 byte program that's squeezed between 0x40 and 0x80 do? Not much. Here's the relevant disassembly:

00000000  0E                push cs
00000001  1F                pop ds
00000002  BA0E00            mov dx,0xe
00000005  B409              mov ah,0x9
00000007  CD21              int 0x21
00000009  B8014C            mov ax,0x4c01
0000000C  CD21              int 0x21

The first two instructions set up the data segment to be the same as the code segment. The next three appear to be setting up and making a call into DOS via a software interrupt. Not being intimately familiar with DOS system calls (hopefully I'm not the only one), I referred to Ralf Brown's Interrupt List. (Fans of web sites circa 1995 will appreciate the 'RBIL' link.) Impressively comprehensive, the desired information was found (INTERRUP.F):

AH = 09h
DS:DX -> '$'-terminated string
Return: AL = 24h (the '$' terminating the string, despite official docs which
state that nothing is returned) (at least DOS 2.1-7.0 and
Notes: ^C/^Break are checked, and INT 23 is called if either pressed
standard output is always the screen under DOS 1.x, but may be
redirected under DOS 2+

So, this call will write to the console the string starting at 0x4e in the original dump up to (but not including) the '$' at 0x78 in the dump.

The final two instructions are making another syscall. The RBIL documents this call also (INTERRUP.G):

AH = 4Ch
AL = return code
Return: never returns
Notes: unless the process is its own parent
(see #01378 [offset 16h] at AH=26h), all open files are closed and
all memory belonging to the process is freed
all network file locks should be removed before calling this function
SeeAlso: AH=00h,AH=26h,AH=4Bh,AH=4Dh,INT 15/AH=12h/BH=02h,INT 20,INT 22
SeeAlso: INT 60/DI=0601h

So, it simply quits the program with a return value of 1.

Post a Comment

Your email is never shared.