Monday, October 6, 2008

Kathleen Fisher's "Programming Language Ideas Escape the Lab

I learned a lot in Kathleen Fisher's talk "Programming Language Ideas Escape the Lab: A Declarative Data Description Language for Managing Ad hoc Data." In it, she describes PADS, a package that she and her colleagues at AT&T Labs, Princeton and Galois, Inc. have developed.

I'm an old dinosaur of a programmer that hasn't taken a real CS course in many years so I had to run off and Google a lot of the terms she used. Most relevant was "Data Description Language" (or, as Wikipedia seems to prefer "Data Definition Language.")
"A Data Definition Language (DDL) is a computer language for defining data structures." What a great idea!

Fisher described being faced with what she calls "ad hoc data". Many times, I have been confronted with a project that involved working with data like this. The data would consist of ascii files that were difficult in one or more of the ways she described. Sometimes the files were so big that my preferred editor, vi, was brought to its knees. In my experience, and as she pointed out, this data is often Horrible and Ugly. It is not only ungainly, it can also be BUGGY.

I wish I had had access to PADS. I had no tools to deal with these files except grep and awk. Working with these tools could be sort of fun, since I felt like I had to be clever. But it certainly was not efficient!

On the PADS website is this:

"PADS is a system that simplifies processing ad hoc data sources. Its users can declaratively describe data sources and then use generated tools to understand, parse, translate, and format data."

There are many instances of "ad hoc data" and it comes from many different sources. Kathleen mentioned web log data, error and crash log data, records of train station data and various files full of government statistics. It seems clear that this a tool that can automate the process of parsing such data should prove very useful.

Here is my understanding of what PADS does:
  • read in multiple instances of raw ad hoc data
  • generate description.
    • chunk data
    • isolate tokens
    • do initial structure discovery
    • produce initial format refinement
    • calculate scoring function
  • iterate between scoring function and rewriting rules until we get good data description.
The scoring function is a way of estimating how closely your data description parses the data. The package produces XML. Once one has the output from this algorithm, you can generate many tools to manipulate and understand the data.

No comments: