Monday, October 6, 2008

Kathleen Fisher's "Programming Language Ideas Escape the Lab

I learned a lot in Kathleen Fisher's talk "Programming Language Ideas Escape the Lab: A Declarative Data Description Language for Managing Ad hoc Data." In it, she describes PADS, a package that she and her colleagues at AT&T Labs, Princeton and Galois, Inc. have developed.

I'm an old dinosaur of a programmer that hasn't taken a real CS course in many years so I had to run off and Google a lot of the terms she used. Most relevant was "Data Description Language" (or, as Wikipedia seems to prefer "Data Definition Language.")
"A Data Definition Language (DDL) is a computer language for defining data structures." What a great idea!

Fisher described being faced with what she calls "ad hoc data". Many times, I have been confronted with a project that involved working with data like this. The data would consist of ascii files that were difficult in one or more of the ways she described. Sometimes the files were so big that my preferred editor, vi, was brought to its knees. In my experience, and as she pointed out, this data is often Horrible and Ugly. It is not only ungainly, it can also be BUGGY.

I wish I had had access to PADS. I had no tools to deal with these files except grep and awk. Working with these tools could be sort of fun, since I felt like I had to be clever. But it certainly was not efficient!

On the PADS website is this:

"PADS is a system that simplifies processing ad hoc data sources. Its users can declaratively describe data sources and then use generated tools to understand, parse, translate, and format data."

There are many instances of "ad hoc data" and it comes from many different sources. Kathleen mentioned web log data, error and crash log data, records of train station data and various files full of government statistics. It seems clear that this a tool that can automate the process of parsing such data should prove very useful.

Here is my understanding of what PADS does:
  • read in multiple instances of raw ad hoc data
  • generate description.
    • chunk data
    • isolate tokens
    • do initial structure discovery
    • produce initial format refinement
    • calculate scoring function
  • iterate between scoring function and rewriting rules until we get good data description.
The scoring function is a way of estimating how closely your data description parses the data. The package produces XML. Once one has the output from this algorithm, you can generate many tools to manipulate and understand the data.

Thursday, October 2, 2008

Elizabeth Churchill's "Tools and Talk"

Well, I got to this talk a tiny bit after it had started. Hey, I had a good excuse! Like many other folks, I stayed for the end of Fran Allen's talk, not realizing it was running late.

The Title of this talk was "Tools and Talk: Conversational Code and the Changing World of the Internet." Churchill is the Principal Research Scientist at Yahoo! Research. Her biography in our GHC program says she was originally a psychologist. Her presentation confirmed that she is still interested the details of how people's minds work and how that explains their behavior on the internet. Her work is to help design tools that take advantage of this understanding so that they can do a better job of helping people get their questions answered.

She is interested in search tools of various kinds. Before describing several studies of different kinds of tools, she pointed out a couple of things which piqued my interest.The first is that when you start a new search on the internet, you are at a big disadvantage if you don't know the keywords that the community that is already discussing your topic is using in their conversations. I have had a glimmer of this myself a few times. I would start looking for, as Churchill said "you know,... that thing that does this" as she made a gesture that might be showing somebody plunging a toilet or filling their bicycle tires with a hand pump. (Ahhh. I get it now! This is an example of ostention. This is a word that Churchill used a few times in her talk. I was unfamiliar with it and, at the end, had to ask her to spell it for me. I found this definition on a web site called ostension.org when I googled the word:


"What is ostension? The word 'ostention' comes from the Latin 'ostendere', to show. It was used by semiotician Umberto Eco to refer to moments in oral communication when, instead of using words, people substitute actions, such as putting a finger on your lips to indicate that someone should be quiet.")

So, of course, you can't type your gesture into the search engine. What do you do? You try a few search terms, and maybe you luck out and find a keyword that starts leading you to the right set of web sites. ( I thought of this as having to know "the jargon". Actually, as a math student I found the same thing. If you know that when people say delta and epsilon, they almost always mean small quantities, it really makes reading a paper or listening to a technical conversation easier.)

Okay, enough with all these parenthetical points. I guess you can tell that I enjoyed finding out that there are folks who have analyzed the search process and discovered something that I almost knew from my own experience.

Churchill gave us some of the intriguing questions that researchers in these areas find themselves asking:

Where does a search session begin and end?
What happens when people reach dead ends?
When do people turn to social search/navigation?



Before I heard this talk, I didn't know about the idea of social search. This is from wikipedia:

"Social search or a social search engine is a type of web search method that determines the relevance of search results by considering the interactions or contributions of users. When applied to web search this user-based approach to relevance is in contrast to established algorithmic or machine-based approaches where relevance is determined by analyzing the text of each document or the link structure of the documents."

But the definition Churchill was using seems to be include things like wiki.answers.com or other sites where one can ask a question of the community because she asks if the people who use them are more social and points out that if you are new to a community, you have to learn how to ask a question. If you don't know the right keywords or the customs your question may be ignored. If you are shy you might just lurk and not ask your question.

Churchill then discussed some recent studies of developer networks and craft communities online.The first study she discussed was about internet tools for "seeking love", that is, dating sites.These sites are used by millions and make a lot of money. As in other searches, the person's mental model of what is going on in the search is applied. On a dating site what people think is going on affects how they present themselves as they try to outwit what they believe are the rules of the site. The example she gave was of somebody who is 41 saying they are 39 because they believe they will be binned with older people if they are over 40. But as you go along you tweak your model and your behavior. And often, as you search, what you are searching for changes.

Okay, now I'm in trouble, because I lost the end of my notes due to an unfortunate computing accident! (I'm still experimenting with taking notes online. I think I have learned a thing or two. Maybe I'll devote a post to that later.) But I do remember that Elizabeth then talked about craft communities online. These are groups of people that share information about jewelry making or some other craft on the internet. There are web sites and search tools devoted to these different crafts.

She noted that for some very obscure crafts, there isn't much online. One problem is the idea that "If it's not on the web, it doesn't exist." This is not true, of course, and some dedicated folks try to make up for deficits by finding stuff offline and getting it online by scanning or other means.

Another problem with craft sites is that the expert will not notice when they describe the steps in their process, that they left out the stuff that is most obvious to them. It is probably not obvious to the beginner. So some users of these sites might benefit from a tool that allowed them to make a copy of a contributor's directions and add their own annotations. They might add a note that says "I found out the hard way that this doesn't work unless ...." where ... is the missing information they discovered.

Towards the end of her talk, Churchill showed a a graph of discussions in an online community. She noted how you could see at a glance that while some people where answering a lot of questions (there were a lot of connections between them and many different people) there were people "on the outside" that were not getting answers to their questions. What the researcher asks about this is "why aren't those people getting their questions answered and what can be done to improve the communication tool so that this doesn't happen?" I enjoyed seeing how this kind of graph could help the viewer understand so much about the nature of the communication that was going on.

In summary, this talk was about the process of analyzing different online communication tools in order to discover ways in which they can be improved so that more participants find what they want when they use them. It explored the fascinating interaction between groups of humans using these tools. If a researcher is equipped with the right questions and concepts, these questions and concepts can be like tongs to pull apart the information and see solutions. It was clear that this is a fascinating field that can make our world a better place by giving all of us better search tools.