Searching for invalid object code

Jérémy Bobbio suggested that I should explain how I looked for packages affected by a compiler bug (Debian bug 506713; gcc bug 38287). I don't claim that this is a particularly good way to do it, but here it is:

First, I identified a pattern to search for. Unfortunately I don't really understand the cause or fix for the bug, but I did have the example which led to this bug report: Debian bug 490999. The bad code was a stack-pointer-relative load immediately after a stack allocation (SPARC save instruction) where the offset was not adjusted for the stack allocation:

save  %sp, -112, %sp
ld  [ %sp + 0x40 ], %i5

I generalised this to:

save  %sp, offset1, sp
...
ld  [ %sp + offset2 ], register

where offset1 + offset2 < 0. Of course, this may be valid if the intervening instructions include a restore, branch or store to the effective stack location that the last instruction loads from. I ended up allowing up to 10 intervening instructions and examining a disassembly to work out which cases were valid.

I looked up the instruction encoding for these two instructions. Thankfully SPARC is RISC so they are simple and regular:

save %sp, offset1, sp is encoded as 0x9de3a000 | offset1 & 0x1fff
ld [ %sp + offset2 ], register is encoded as 0xc003a000 | reg << 25 | offset2 & 0x1fff

I took the dumb but effective approach of scanning entire files for this pattern rather than only scanning their code sections. This seemed to work - I got no false hits for non-code - but might not work for other patterns that could match ASCII text.

I wrote the scanning program in Python, which is my default choice of language unless I know it's going to be too slow. I was hoping to be able to read the code files into arrays, but unfortunately the Python array type only supports the native byte-order (SPARC is big-endian and I was intending to use an x86 which is little-endian). I tried reading into a tuple using struct.unpack, which does support explicit byte-ordering, but this used so much memory for larger files that the program swapped to a crawl. So finally I resorted to reading the file into a string, doing a string search for '\x9d\xe3', rejecting matches that weren't appropriately aligned, then unpacking and comparing the code words from the point of the string match. (In Python 3.0 I would have to use the bytes type for this, as str is a Unicode string type.)

So that's how I scanned single files. The next step was to find, unpack and scan all the SPARC shared libraries in the archive. (This particular code generation bug is understood to affect only PIC code, and that is normally only used in shared libraries.) I wrote functions to search Contents-sparc for shared library files - assumed to match the pattern ([^\s]*/lib[^/\s]+\.so(?:\.[^/\s]*)?) - and to parse Packages to find the filenames for the packages containing those files. The latter uses the debian_bundle.deb822 module from python-debian. The last key function downloads and unpacks a package using wget and dpkg-deb. I could have used the httplib module for downloading but I correctly anticipated that I'd need to restart the script several times so I wanted to cache the packages which was easier to do using wget.

So, that's the explanation. If you really want to see it, here's the code.