Friday, February 6, 2009

Minimal Dalvik Executables

When I was a kid, the Windows era had yet to arrive. Floor-model PCs in shops ran MS-DOS. They'd often take steps to stop little imps like me messing around with their systems. I recall one where they even deleted the standard DOS utilities. But they had left COMMAND.COM unmodified which is enough for plenty of mischief. One can create executables via:
 $ copy con: hello.com
and then enter machine code by holding Alt while typing a number on the keypad. (It doesn't work on DOSBox by the way.) Some numbers, such as 0, must be avoided, but there are workarounds; the book from which I learned the trick, Ralph Burger's "Computer Viruses: a High-tech Disease", recommended first creating a program that read numbers in ASCII and output the corresponding binary, a sort of proto-assembler, which in turn could be used to create more complex executables.

I created a HELLO.COM program by typing the following, where the numbers are typed on the numberpad with the Alt key held:
 180 9 190 9 1 205 33 205 32 hello$
which is perhaps more recognizable to DOS programmers when converted to hex:
 B409 BA0901 CD21 CD20 hello$
It is the machine code for:
MOV AH, 09
MOV DX, 109h
INT 21h
INT 20h
This instructs DOS to print the '$'-terminated string at address 0x109 and then exit. When .COM files are loaded, DOS places their contents at address 0x100 (locations 0 to 0x100 are populated with miscellaneous information such as the command-line), hence the 0x109.

Thus with 9 bytes plus the size of the message, plus a byte for the terminator, we get a valid DOS program. We can type raw machine code from the keyboard to produce a full program.

Linux on x86 systems is less straightforward. The machine code is similar, but there must also be an ELF header (not strictly true: in some circumstances, the entire machine code can be placed within a truncated ELF header!), making binaries more tedious to construct by hand. However, certain kernel configurations allow different header types, some of them brief enough to be easily memorized and hand-rolled.

Minimalist Android

How about Android? An APK file is a ZIP file containing, among other files, the Dalvik executable in a DEX file. We're interested in minimizing this DEX file, whose full specification is well-documented.

All Android code must lie in some class, and Android applications typically inherit from a library class such as Activity, so we must also record information such as the class name, access rights, and inheritance. This, in turn, requires a strings section so we can store the class names, as well as invoked library functions and function signatures. Other overheads include the header and the map section. We can cheat a little by defining the UI with XML, so it won't contribute to the size of the DEX file.

After compiling to Java class files, most or all the R*.class files can be removed before making a DEX file,since the constants they contain are hardcoded in other class files by javac. For some projects, R$styleable must be retained, though this class appears to have since been removed from the public API.

I chose to write a program that parsed the DEX file from the HelloAndroid example and output only the necessary sections rather than attempt to create a DEX file from scratch. I found I could safely elide the source filenames, debug information and annotations. The result consists of 632 bytes, though we could trivially shrink this further by using a shorter name for the class.
0000000: 6465 780a 3033 3500 2f4f 153b 3623 8747  dex.035./O.;6#.G
0000010: 6d02 4697 5b1e 959d a8b1 2f0f 9c3a a14f m.F.[...../..:.O
0000020: 7802 0000 7000 0000 7856 3412 0000 0000 x...p...xV4.....
0000030: 0000 0000 f001 0000 0a00 0000 7000 0000 ............p...
0000040: 0500 0000 9800 0000 0300 0000 ac00 0000 ................
0000050: 0000 0000 0000 0000 0500 0000 d000 0000 ................
0000060: 0100 0000 f800 0000 6001 0000 1801 0000 ........`.......
0000070: 6201 0000 6a01 0000 6d01 0000 8501 0000 b...j...m.......
0000080: 9a01 0000 bc01 0000 bf01 0000 c301 0000 ................
0000090: c701 0000 d101 0000 0100 0000 0200 0000 ................
00000a0: 0300 0000 0400 0000 0500 0000 0500 0000 ................
00000b0: 0400 0000 0000 0000 0600 0000 0400 0000 ................
00000c0: 5401 0000 0700 0000 0400 0000 5c01 0000 T...........\...
00000d0: 0100 0000 0000 0000 0100 0200 0800 0000 ................
00000e0: 0300 0000 0000 0000 0300 0200 0800 0000 ................
00000f0: 0300 0100 0900 0000 0300 0000 0100 0000 ................
0000100: 0100 0000 0000 0000 ffff ffff 0000 0000 ................
0000110: e101 0000 0000 0000 0100 0100 0100 0000 ................
0000120: 0000 0000 0400 0000 7010 0000 0000 0e00 ........p.......
0000130: 0300 0200 0200 0000 0000 0000 0900 0000 ................
0000140: 6f20 0100 2100 1500 037f 6e20 0400 0100 o ..!.....n ....
0000150: 0e00 0000 0100 0000 0000 0000 0100 0000 ................
0000160: 0200 063c 696e 6974 3e00 0149 0016 4c61 .....I..La
0000170: 6e64 726f 6964 2f61 7070 2f41 6374 6976 ndroid/app/Activ
0000180: 6974 793b 0013 4c61 6e64 726f 6964 2f6f ity;..Landroid/o
0000190: 732f 4275 6e64 6c65 3b00 204c 636f 6d2f s/Bundle;. Lcom/
00001a0: 616e 6472 6f69 642f 6865 6c6c 6f2f 4865 android/hello/He
00001b0: 6c6c 6f41 6e64 726f 6964 3b00 0156 0002 lloAndroid;..V..
00001c0: 5649 0002 564c 0008 6f6e 4372 6561 7465 VI..VL..onCreate
00001d0: 000e 7365 7443 6f6e 7465 6e74 5669 6577 ..setContentView
00001e0: 0000 0001 0102 8180 0498 0203 01b0 0200 ................
00001f0: 0b00 0000 0000 0000 0100 0000 0000 0000 ................
0000200: 0100 0000 0a00 0000 7000 0000 0200 0000 ........p.......
0000210: 0500 0000 9800 0000 0300 0000 0300 0000 ................
0000220: ac00 0000 0500 0000 0500 0000 d000 0000 ................
0000230: 0600 0000 0100 0000 f800 0000 0120 0000 ............. ..
0000240: 0200 0000 1801 0000 0110 0000 0200 0000 ................
0000250: 5401 0000 0220 0000 0a00 0000 6201 0000 T.... ......b...
0000260: 0020 0000 0100 0000 e101 0000 0010 0000 . ..............
0000270: 0100 0000 f001 0000 ........
Only the 4 16-bit words at 0x128 and the 9 16-bit words at 0x140 are actual Dalvik bytecode. They belong to the constructor and the onCreate method respectively.

1 comment:

nobu said...

that was intresting, thanks :)