I’ll have more on The Empty a little latter. This little Gem took so much to figure out that I had to write it up.
I’ve been working on a data extraction/conversion project for a friend of mine and ran into a problem dealing with something that looked like a floating point number. C might have been a reasonable choice, but would likely have cost me time while I am figuring out how the data records are actually stored. Perl has this nifty little command called pack that makes pulling data out of binary blocks relatively easy. Besides, its a great deal easier to work with a script that doesn’t segfault every time you get things just a little off.
The particular piece of software in question was written around the mid 80s to 90s era. From what I’ve been able to determine it uses some form of flat file database that is highly reminiscent of a C style struct. The file layout is based on fixed sized blocks. A file header, that I have yet to fully decode, occupies the first few blocks of the file, depending on the block size. This means that you have to skip a certain number of blocks that are clobbered by the header record at the start of the file to find the first data block. So far, I’ve been able to identify a couple of integer formats, a fixed length string format, a single character field, and the villain of this story, a decimal number field. I call it a villain because this thing has turned out to be a bear to extract. Everything I tried failed me until tonight, when I finally figured it out and nearly kicked myself at how simple it actually was.
Using od with a carefully selected range I was able to identify a four byte chunk of data in a record that appeared to correspond to the decimal field in the front end. Since I had a working copy the front end available, finding the field by modifying a record and seeing what changed was easy enough. First I gave the ‘f’ option in pack a try and received wildly varying numbers from data displayed in the original application. The ‘f’ option should have been correct to decode a 32bit IEEE-754 floating point number. From my understanding, if I was looking at a native floating point number then IEEE754 should have been what to expect.
Unfortunately, there was no way that the bytes could be arranged to come anywhere near a IEEE754 number. Given how the front end was layed out, I knew that the field was designed to contain a value between 99999.99 and 0.00. It would always display two decimal digits and always max out with 5 digits to the left of the decimal point. If it wasn’t encoded as a floating point number, maybe they had used an integer to encoded it a as a fixed point value? I tried using ‘V’ from pack with similar results to what I had seen when treating the field as a float.
Just to be thorough, I also tried a few variations of byte order and a trick that had been used to encode block sizes, using to 16bit numbers to store the value rather than a single 32bit number. In the file header, the size of a data block was stored using two little endian 16bit values arranged in the wrong order to decode as a little endian 32bit number. To decode the value, you had to pull out the two 16bit numbers using pack’s ‘v’ option, multiply the high order word by 0×10000, and add them together. No joy there.
Ok, so maybe what this was some sort of encoded number, say BCD? If it had been encoded using BCD then there should have been just enough bits to store 8 digits in an 8bit number. Each byte would contain 2 digits, one in the high nibble (4bits) and the other in the low nibble. More over, if 0×00000000 were 0 then 0×00000001 should have been 0.01 and 0.10 would have been 0×00000010. I gave it a try and immediately concluded that BCD was not he encoding being used. 0.01 came out as 0×00000004 instead and 0.10 came out as 0x0000002D.
Somewhere in here, I think my head exploded. Having data that made absolutely no sense with anything that seemed remotely plausible, I set to get a sample of values and their encodings for comparison. That lead me to the following:
The first thing to notice here is the step column. It represents the difference between the value on that row and the one right after it. If you don’t look too closely, the encoded values look like they have an alternating step between 4 and 5. At first glance, you should be able take the encoded value for any number in this sequence and get a 4 or 5 from the that value mod 9. I pulled up dc and gave it a shot with the value for 0.56:
$ dc 16 i FE 9 % p 2
This result meant that somewhere in the list it was doubling up with either to 5′s steps or two for steps’s happening right next to each other. If you look closely at the table, you’ll see that 0.26 and 0.27 ( along with a couple of other values) do indeed have a step of 5 one right after the other.
It didn’t hit me until a latter that I had seen a sequence of numbers like this one before. A long time ago, when the only programming language I knew was BASIC, I wrote a little program to draw lines, pixel by pixel on the screen. Of course, raster displays cannot display a completely clean line by their very nature, they always have a touch of stair step on any line that is not purely vertical or horizontal. If the slope of your line is not an integer value, then the individual line segments that make up a line could easily have lengths in the longer axis like these.
On a hunch, I took the highest number that could be displayed in the front end 99999.99 and rounded up to 100000.00 to make a clean number and setup a ratio with the encoded value for that number. The rounding decision was purely arbitrary, though it seemed to work out ok in the end. The net result, after reducing by common factors, was the fraction 1000/4536. Surprisingly, this worked. I tested several conversions, calculating out what the result should going from a decimal to the encoded value and an encoded value to a decimal an every case I checked came out correct. I coded it up in perl and ran a test extraction. IT WORKED!
It looks like whoever designed the application used that ratio to map the range of 0-100000.00 to a set of integer values. Figuring this one out had me giddy for the rest of the evening. Now if only I could crack the date format . . . . .Programming, Software Archaeology