Software Archaeology – f(x)=x https://chrissalch.com The identity function Tue, 15 May 2012 18:25:48 +0000 en-US hourly 1 https://wordpress.org/?v=5.2.7 37371501 Strange Numbers https://chrissalch.com/blog/2012/01/21/strange-numbers/ https://chrissalch.com/blog/2012/01/21/strange-numbers/#respond Sat, 21 Jan 2012 07:33:23 +0000 http://chrissalch.com/?p=22 I’ll have more on The Empty a little latter. This little Gem took so much to figure out that I had to write it up.

I’ve been working on a data extraction/conversion project for a friend of mine and ran into a problem dealing with something that looked like a floating point number. C might have been a reasonable choice, but would likely have cost me time while I am figuring out how the data records are actually stored. Perl has this nifty little command called pack that makes pulling data out of binary blocks relatively easy. Besides, its a great deal easier to work with a script that doesn’t segfault every time you get things just a little off.

The particular piece of software in question was written around the mid 80s to 90s era. From what I’ve been able to determine it uses some form of flat file database that is highly reminiscent of a C style struct. The file layout is based on fixed sized blocks. A file header, that I have yet to fully decode, occupies the first few blocks of the file, depending on the block size. This means that you have to skip a certain number of blocks that are clobbered by the header record at the start of the file to find the first data block. So far, I’ve been able to identify a couple of integer formats, a fixed length string format, a single character field, and the villain of this story, a decimal number field. I call it a villain because this thing has turned out to be a bear to extract. Everything I tried failed me until tonight, when I finally figured it out and nearly kicked myself at how simple it actually was.

Using od with a carefully selected range I was able to identify a four byte chunk of data in a record that appeared to correspond to the decimal field in the front end. Since I had a working copy the front end available, finding the field by modifying a record and seeing what changed was easy enough. First I gave the ‘f’ option in pack a try and received wildly varying numbers from data displayed in the original application. The ‘f’ option should have been correct to decode a 32bit IEEE-754 floating point number. From my understanding, if I was looking at a native floating point number then IEEE754 should have been what to expect.

Unfortunately, there was no way that the bytes could be arranged to come anywhere near a IEEE754 number. Given how the front end was layed out, I knew that the field was designed to contain a value between 99999.99 and 0.00. It would always display two decimal digits and always max out with 5 digits to the left of the decimal point. If it wasn’t encoded as a floating point number, maybe they had used an integer to encoded it a as a fixed point value? I tried using ‘V’ from pack with similar results to what I had seen when treating the field as a float.

Just to be thorough, I also tried a few variations of byte order and a trick that had been used to encode block sizes, using to 16bit numbers to store the value rather than a single 32bit number. In the file header, the size of a data block was stored using two little endian 16bit values arranged in the wrong order to decode as a little endian 32bit number. To decode the value, you had to pull out the two 16bit numbers using pack’s ‘v’ option, multiply the high order word by 0x10000, and add them together. No joy there.

Ok, so maybe what this was some sort of encoded number, say BCD? If it had been encoded using BCD then there should have been just enough bits to store 8 digits in an 8bit number. Each byte would contain 2 digits, one in the high nibble (4bits) and the other in the low nibble. More over, if 0x00000000 were 0 then 0x00000001 should have been 0.01 and 0.10 would have been 0x00000010. I gave it a try and immediately concluded that BCD was not he encoding being used. 0.01 came out as 0x00000004 instead and 0.10 came out as 0x0000002D.

Somewhere in here, I think my head exploded. Having data that made absolutely no sense with anything that seemed remotely plausible, I set to get a sample of values and their encodings for comparison. That lead me to the following:

Value Encoded Step
0.00 00 4
0.01 04 5
0.02 09 4
0.03 0d 5
0.04 12 4
0.05 16 5
0.06 1b 4
0.07 1f 5
0.08 24 4
0.09 28 5
0.1 2d 4
0.11 31 5
0.12 36 4
0.13 3a 5
0.14 3f 5
0.15 44 4
0.16 48 5
0.17 4d 4
0.18 51 5
0.19 56 4
0.2 5A 5
0.21 5F 4
0.22 63 5
0.23 68 4
0.24 6C 5
0.25 71 4
0.26 75 5
0.27 7A 5
0.28 7F 4
0.29 83 5
0.3 88 4
0.31 8C 5
0.32 91 4
0.33 95 5
0.34 9A 4
0.35 9E 5
0.36 A3 4
0.37 A7 5
0.38 AC 4
0.39 b0 5
0.40 b5 4
0.41 b9 5
0.42 be 5
0.43 c3 4
0.44 c7 5
0.45 cc 4
0.46 d0 5
0.47 d5 4
0.48 d9 5
0.49 de 4
0.5 e2 5
0.51 e7 4
0.52 eb 5
0.53 f0 4
0.54 f4 5
0.55 f9 4
0.56 fe *

The first thing to notice here is the step column. It represents the difference between the value on that row and the one right after it. If you don’t look too closely, the encoded values look like they have an alternating step between 4 and 5. At first glance, you should be able take the encoded value for any number in this sequence and get a 4 or 5 from the that value mod 9. I pulled up dc and gave it a shot with the value for 0.56:

$ dc
16 i FE 9 % p
2

This result meant that somewhere in the list it was doubling up with either to 5’s steps or two for steps’s happening right next to each other. If you look closely at the table, you’ll see that 0.26 and 0.27 ( along with a couple of other values) do indeed have a step of 5 one right after the other.

It didn’t hit me until a latter that I had seen a sequence of numbers like this one before. A long time ago, when the only programming language I knew was BASIC, I wrote a little program to draw lines, pixel by pixel on the screen. Of course, raster displays cannot display a completely clean line by their very nature, they always have a touch of stair step on any line that is not purely vertical or horizontal. If the slope of your line is not an integer value, then the individual line segments that make up a line could easily have lengths in the longer axis like these.

On a hunch, I took the highest number that could be displayed in the front end 99999.99 and rounded up to 100000.00 to make a clean number and setup a ratio with the encoded value for that number. The rounding decision was purely arbitrary, though it seemed to work out ok in the end. The net result, after reducing by common factors, was the fraction 1000/4536. Surprisingly, this worked. I tested several conversions, calculating out what the result should going from a decimal to the encoded value and an encoded value to a decimal an every case I checked came out correct. I coded it up in perl and ran a test extraction. IT WORKED!

It looks like whoever designed the application used that ratio to map the range of 0-100000.00 to a set of integer values. Figuring this one out had me giddy for the rest of the evening. Now if only I could crack the date format . . . . .

]]>
https://chrissalch.com/blog/2012/01/21/strange-numbers/feed/ 0 22
Converting 2DA Files to Something More Useful – RePost https://chrissalch.com/blog/2009/04/05/converting-2da-files-to-something-more-useful-repost/ https://chrissalch.com/blog/2009/04/05/converting-2da-files-to-something-more-useful-repost/#respond Sun, 05 Apr 2009 16:46:40 +0000 http://chrissalch.com/?p=129 Note:This is something I wrote way back in April of 2009 on my blog at kc5vzm.com.  While checking on google analytics, I noticed that the old link was still being referenced and decided that I should probably do something to keep it live, at least for the moment.  So, here you go!

A friend of mine mentioned he had problem that I might be able to help him with. He had 575 data files on a somewhat lax format. In fact, the disparities between various data files would seem to indicate that the files were hand written rather than generated using a tool or specialized editor. The large number of file and relative inefficiency of dealing with them in their native formats brings up the question of alternative means to store the data. Said friend happens to be able to talk to a MySql database from within the Never Winter Nights scripting language. To make a long story short my friend asked for a tool to take his collection of 575 2DA files and convert them into an SQL script which could be loaded into a MySql database. So, that’s exactly what I gave him.

Besides being a reasonably simple task to accomplish, this project gave me a chance to take a look at a very simple Maven managed build. The request was for a java tool specifically (my first choice for something of this nature would have been PERL) to allow for ease of distribution and portability. C/C++ while reasonably portable if you have a sane build system, compiler, and know a touch about what you are doing require that the end user either be given binaries or have a similar sane build system, compiler, and know what you were attempting to do. PERL, which is significantly more portable and provides a reasonably common runtime environment, has a high barrier to entry on the Windows platform. There are some nice commercial installs and a few open source projects, such as Strawberry PERL or Vanilla PERL, but that is a little much for the average user. Java, the hulking beast that it is, runs just about anywhere with very little prodding, exists on most Windows based systems and is not very difficult to get on UNIX, and will do most of what was needed.

The goals of this particular project are relatively simple:

  • Convert a valid Bioware 2DA file or directory of files into a reasonable SQL script.
  • Create appropriate tables within the SQL database to match the contents of the 2DA file.
  • Create a load script to load data into the created tables.

I had a few other things that I would have liked to and still may find a way to pull into the project latter on. Unfortunately, about four hours of the time that I meant to be coding was occupied in the vain attempt to counter act a rather nasty headache. When my head finally did clear up, time constraints left a few of my planed features, a nice ‘Wizard’ or ‘Druid’ like GUI, laying on the chopping room floor. Besides, command line apps are so much simpler to use in the long run and graphical apps are such a GUI mess. (:wink:)

For those who are terribly interested in seeing some code I will later deny to have written, you can take a look at the git repository. This is barely something that I would consider an alpha. Really and truthfully, it is more of a proof of concept in dire need or refactoring/design to make it more than trivially useful.

The 2DA file specification is available from Bioware via the following link, towards the bottom of the page. Unfortunately, the data I was given as an example, does not match the specification very well. The specification states:

  • The first row contains a string specifying the 2DA version, ‘2DA V2.0’ for the files I was given.
  • The second line should contain nothing or an optional ‘DEFAULT:’ value clauses to be provided when the requested row does not exist.
  • The third line contains column names separated by white space with the first column, an index column, ignored. While not explicitly stated, it appears that there must be at least one space to delineate the first column.
  • From the fourth line each line consists of white space delimited columns.
  • Values in the data set that contain spaces must be quoted with the ‘”‘ character.
  •  ‘****’ is treated as a special data value similar to a database null.
  • The index column is re-indexed by the existing game tools.

While significantly simplified, that about covers the relevant parts of the specification. In reality, the data set I was given as a test didn’t quite match this in several aspects. Here are a few of the more annoying things that I discovered:

  • The furthest right data column appears to be allowed to contain spaces without quoting.
  • Five ‘*’ characters are sometimes used rather than four.
  • A row containing now column data but an index appears to be ignored.

If the data files I was handed are actually used as was described the parsing code used is very lax about data validation. Really and truly, this is more of a curse than anything else. Lack of data validation will lead to more problems in the long run than letting your developers build crappy content files. Real systems have a tendency to be annoying that way.

Parsing out this kind of data is not really all that difficult. Most of the parsing can be handled by the java.lang.String.split function and an amazingly short regular expression. Dealing with quoted strings is a little more difficult and required looping through a character array to parse properly. I’m pretty sure there is a better way to accomplish that part but, as stated, this code is little better than a proof of concept. Thankfully, none of the data set I was dealing with had any DEFAULT clauses to deal with and this code ignores that feature.

Once the entire dataset for a file is loaded into an internal representation, SQL generation is accomplished via simple string combination with a little sanitizing for the ‘ character and strange column names (‘%’). The current implementation takes this one step further and parse a directory of 2DA files, loading them all into memory at one time. This is definitely not an ideal situation and will eventually be corrected. The data set really should be parsed one file at a time to avoid potential issues with memory usage.

Unfortunately, I don’t have a good way to avoid loading the entirety of a single file right now. The code is attempting to make guesses about data types based on the data it finds in a given column. If all of the data in a column is parsable by Integer.parseInt then the column is considered to be an SQL INTEGER. If all of the data in a column is parsable by Double.paresDouble then that column is considered to be an SQL DECIMAL. If neither of those are true or every value in the column is the special null string then the column is considered to be a VARCHAR with a size equal to the longest value in the column. This is definitely very crude and possibly error prone. It does seem to produce a loadable script with the current test data set though.

Note:For anyone interested in actually using this monstrosity you are perfectly welcome to do so. Just keep in mind that I don’t make any promises about the quality of this code or the resultant data set. If you do come across a bug, feel free to drop a line to emeraldd.chris@gmail.com with 2daLoader in the subject line and I’ll see what can be done. A binary version of the file can be downloaded from:

2daLoader-1.0-SNAPSHOT-jar-with-dependencies

Since Jar files are in actuality just crafted zip files, my CMS is currently treating them as zip files. This means that you will need to change the file extension back to .jar when you download the file. I hope to have this little issue corrected sometime soon though.

To run the application, just start a command window and enter a command line similar to this:

 java -jar 2daLoader-1.0-SNAPSHOT.jar <path to directory containing 2da files>

After pressing return, you should see a large number of logging lines scrolling by telling more than you ever wanted to know about what the application is doing. Keep in mind, this can take a long time for a large number of files. If it does not work as expected, make sure you have entered the correct file and path names. The current version will produce an output file name buildSql.sql in the current directory. One of the things I’d like to add latter is a means of specifying the output file.

In any event, have fun!

]]>
https://chrissalch.com/blog/2009/04/05/converting-2da-files-to-something-more-useful-repost/feed/ 0 129