Note:This is something I wrote way back in April of 2009 on my blog at kc5vzm.com. While checking on google analytics, I noticed that the old link was still being referenced and decided that I should probably do something to keep it live, at least for the moment. So, here you go!
A friend of mine mentioned he had problem that I might be able to help him with. He had 575 data files on a somewhat lax format. In fact, the disparities between various data files would seem to indicate that the files were hand written rather than generated using a tool or specialized editor. The large number of file and relative inefficiency of dealing with them in their native formats brings up the question of alternative means to store the data. Said friend happens to be able to talk to a MySql database from within the Never Winter Nights scripting language. To make a long story short my friend asked for a tool to take his collection of 575 2DA files and convert them into an SQL script which could be loaded into a MySql database. So, that’s exactly what I gave him.
Besides being a reasonably simple task to accomplish, this project gave me a chance to take a look at a very simple Maven managed build. The request was for a java tool specifically (my first choice for something of this nature would have been PERL) to allow for ease of distribution and portability. C/C++ while reasonably portable if you have a sane build system, compiler, and know a touch about what you are doing require that the end user either be given binaries or have a similar sane build system, compiler, and know what you were attempting to do. PERL, which is significantly more portable and provides a reasonably common runtime environment, has a high barrier to entry on the Windows platform. There are some nice commercial installs and a few open source projects, such as Strawberry PERL or Vanilla PERL, but that is a little much for the average user. Java, the hulking beast that it is, runs just about anywhere with very little prodding, exists on most Windows based systems and is not very difficult to get on UNIX, and will do most of what was needed.
The goals of this particular project are relatively simple:
- Convert a valid Bioware 2DA file or directory of files into a reasonable SQL script.
- Create appropriate tables within the SQL database to match the contents of the 2DA file.
- Create a load script to load data into the created tables.
I had a few other things that I would have liked to and still may find a way to pull into the project latter on. Unfortunately, about four hours of the time that I meant to be coding was occupied in the vain attempt to counter act a rather nasty headache. When my head finally did clear up, time constraints left a few of my planed features, a nice ‘Wizard’ or ‘Druid’ like GUI, laying on the chopping room floor. Besides, command line apps are so much simpler to use in the long run and graphical apps are such a GUI mess. (:wink:)
For those who are terribly interested in seeing some code I will later deny to have written, you can take a look at the git repository. This is barely something that I would consider an alpha. Really and truthfully, it is more of a proof of concept in dire need or refactoring/design to make it more than trivially useful.
The 2DA file specification is available from Bioware via the following link, towards the bottom of the page. Unfortunately, the data I was given as an example, does not match the specification very well. The specification states:
- The first row contains a string specifying the 2DA version, ’2DA V2.0′ for the files I was given.
- The second line should contain nothing or an optional ‘DEFAULT:’ value clauses to be provided when the requested row does not exist.
- The third line contains column names separated by white space with the first column, an index column, ignored. While not explicitly stated, it appears that there must be at least one space to delineate the first column.
- From the fourth line each line consists of white space delimited columns.
- Values in the data set that contain spaces must be quoted with the ‘”‘ character.
- ’****’ is treated as a special data value similar to a database null.
- The index column is re-indexed by the existing game tools.
While significantly simplified, that about covers the relevant parts of the specification. In reality, the data set I was given as a test didn’t quite match this in several aspects. Here are a few of the more annoying things that I discovered:
- The furthest right data column appears to be allowed to contain spaces without quoting.
- Five ‘*’ characters are sometimes used rather than four.
- A row containing now column data but an index appears to be ignored.
If the data files I was handed are actually used as was described the parsing code used is very lax about data validation. Really and truly, this is more of a curse than anything else. Lack of data validation will lead to more problems in the long run than letting your developers build crappy content files. Real systems have a tendency to be annoying that way.
Parsing out this kind of data is not really all that difficult. Most of the parsing can be handled by the java.lang.String.split function and an amazingly short regular expression. Dealing with quoted strings is a little more difficult and required looping through a character array to parse properly. I’m pretty sure there is a better way to accomplish that part but, as stated, this code is little better than a proof of concept. Thankfully, none of the data set I was dealing with had any DEFAULT clauses to deal with and this code ignores that feature.
Once the entire dataset for a file is loaded into an internal representation, SQL generation is accomplished via simple string combination with a little sanitizing for the ‘ character and strange column names (‘%’). The current implementation takes this one step further and parse a directory of 2DA files, loading them all into memory at one time. This is definitely not an ideal situation and will eventually be corrected. The data set really should be parsed one file at a time to avoid potential issues with memory usage.
Unfortunately, I don’t have a good way to avoid loading the entirety of a single file right now. The code is attempting to make guesses about data types based on the data it finds in a given column. If all of the data in a column is parsable by Integer.parseInt then the column is considered to be an SQL INTEGER. If all of the data in a column is parsable by Double.paresDouble then that column is considered to be an SQL DECIMAL. If neither of those are true or every value in the column is the special null string then the column is considered to be a VARCHAR with a size equal to the longest value in the column. This is definitely very crude and possibly error prone. It does seem to produce a loadable script with the current test data set though.
Note:For anyone interested in actually using this monstrosity you are perfectly welcome to do so. Just keep in mind that I don’t make any promises about the quality of this code or the resultant data set. If you do come across a bug, feel free to drop a line to firstname.lastname@example.org with 2daLoader in the subject line and I’ll see what can be done. A binary version of the file can be downloaded from:
Since Jar files are in actuality just crafted zip files, my CMS is currently treating them as zip files. This means that you will need to change the file extension back to .jar when you download the file. I hope to have this little issue corrected sometime soon though.
To run the application, just start a command window and enter a command line similar to this:
java -jar 2daLoader-1.0-SNAPSHOT.jar <path to directory containing 2da files>
After pressing return, you should see a large number of logging lines scrolling by telling more than you ever wanted to know about what the application is doing. Keep in mind, this can take a long time for a large number of files. If it does not work as expected, make sure you have entered the correct file and path names. The current version will produce an output file name buildSql.sql in the current directory. One of the things I’d like to add latter is a means of specifying the output file.
In any event, have fun!Old Blog, Programming, Software Archaeology