Analytics for Human Information: Watson? So What?

 For those of you I have not met, I’m Fernando (but please call me FER) and I’m the Chief Technology Officer for HP Autonomy.

 

As you might know, at HP Autonomy, we have some of the world’s leading edge technology to help you engage with information, specifically Human Information—the kind of stuff human beings create and can interpret, but that machines struggle with. But let me start by giving you some clues that will help us later in the blog: let’s summarize a few features from our core technology, HP IDOL.

 

When you have criteria (some information or an object that forms a question or represents a need for information) and you have a dataset you want to mine, HP IDOL allows you to return results that answer that question with the highest fidelity. IDOL allows you to sift through unlimited volumes of data to find specific information related to your question or area of interest. We refer to this as Inquiring.

 

You might then want to use information contained in the results of an inquiry to analyze those results. The analysis might provide insights that allow you to improve your inquiry, or it might provide more general information, themes, key attributes, etc., about your content. We refer to this as Investigating.

 

Finally, for this short chapter on IDOL features, data can be enhanced with more details that help with the Inquire and Investigate functions. Data can be enhanced from within (you could extract names, determine sentiment, identify categories, etc.) or externally (you could match names from databases, apply workflows, etc.) and these functions allow you to add information to data of any type—be it audio, video or text—which makes it easier to inquire and investigate information, or to identify key features of your content. We refer to this as Improving.

 

Now back to the main subject of this blog.

 

Of late, in the pursuit of a new and exciting project (which I will tell you about soon), we have started to put together “hack-a-thons.”

 

On the face of it, a hack-a-thon looks like a light-hearted competition amongst developers. One sets out the tools to use, and some guidance, but beyond that the sky is the limit. In some respects, you can guide the session via the award categories you might come up with (top algorithm, innovative use of data, most reusable code, etc.). I’ll tell you what…I was in for a surprise.

 

So what about Watson?

 

For those who haven’t heard of Jeopardy, it’s an American quiz show. The deal is that the contestants are given the answer, and have to figure out the question. (This makes talking about it confusing, so we say that contestants are given a ‘clue’ (the answer), and have to give a ‘response’ (the question).

 

The nice folks at IBM decided to have a go at a machine that would be able to answer jeopardy questions and compete in the show. During 2007, the IBM team was given three to five years and a staff of 15 people to solve the problems (BTW, we have Wikipedia to thank for this factual data about Watson).

 

Watson is composed of a cluster of 90 IBM Power 750 servers, each of which uses a 3.5 GHz POWER7 eight core processor with four threads per core. In total, the system has 2,880 POWER7 processor cores and is able to store 16 terabytes of RAM. Tony Pearson (from IBM) estimated Watson's hardware cost at about $3 million.

 

Watson uses more than 100 different techniques to analyze natural language, identify sources, find and generate hypotheses, find and score evidence, and merge and rank hypotheses. 200 million pages of structured and unstructured content consume four terabytes of disk storage, including the full text of Wikipedia. On Jeopardy, Watson consistently outperformed its human opponents on the game's signaling device, but had trouble responding to a few categories, notably those having short clues containing only a few words.

 

Back to the hack-a-thon

 

The ground rules were simple: to use HP IDOL and specifically a limited number of functions of Inquire, Investigate, and Improve to create data-driven applications. Within those parameters, and working as individuals, there were no other rules except that time was limited to the hack-a-thon’s one-day duration.

 

Here’s how it went: A couple of the guys in the dev team decided [randomly] that they would create an app to automatically take example clues and responses from the Jeopardy twitter feed. They randomly selected a clue, and had the app produce some suggestions for the correct response. They sent me a nice write up, but I’m going to do some paraphrasing here.

 

To achieve a solution, there were two challenges that needed to be tackled and one set of limitations to overcome. Limitations? The guys did not have the following things available: 3-4 years, 15 people, 750 servers, 200 million pages, 100 techniques, etc. You get the picture.

 

So what did our brave dev guys achieve in one day with a single IDOL database with only Wikipedia as a source and the power of our technology? See it through my eyes as I saw this develop:

 

For the category “movie narrators” and the clue ‘This actor narrates 2013's "The Great Gatsby" Gatsby Movie’, the app produces the answers…

 

What is Narrative mode?

What is First-person narrative?

What is Silent film?

Who is Tobey Maguire?

What is Sam Waterston?

What is Astoria, Queens?

 

And the correct response is…Who is Tobey Maguire?

 

The correct response isn’t the top suggestion, but it’s in the top six, and it’s correctly phrased.  It’s also worth noting that all of the responses are all at least relevant to the clue. (Sam Waterston is an actor in the 1974 film, and Astoria is a location.)

 

Sans 3-4 years, 15 people, 750 servers, 200 million pages, 100 techniques, etc., I’m pretty impressed at this point. From this example, you can see that there are a few things that could be improved. Some of the responses are not actors, or even people, so we could do some work on limiting our responses to the appropriate category. We could also benefit from some other data sources. For instance, the clue “This Majority Leader tweeted his disdain that Spotify was blocked in Congressional offices” has the response “Who is Eric Cantor?” The Spotify story isn’t on Cantor’s Wikipedia page, so it would be hard for us to select the correct response.

 

A few more random general knowledge type examples that followed:

 

Category: Sports Biographies

Clue: A '70s Tigers pitcher: "The Bird"

 

The app’s answers are…

What is Mark Fidrych?

What is Detroit Tigers?

What is Major League Baseball?

What is Tasburgh?

What is History of the Detroit Tigers?

What is Lamar University?

 

And the correct answer is…Who is Mark Fidrych?

 

Category: Toon Time

Clue: In Dec. 2012 Alan Tudyk received an Annie nomination for voicing King Candy in this film

 

The app’s answers are…

What is Wreck-It Ralph?

What is Timeline of LGBT history?

What is Pinky and the Brain?

What is Adaptations of A Christmas Carol?

What is Alan Tudyk?

 

And the correct answer is…What is Wreck It Ralph?

 

Category: Authors' Former Jobs

Clue: Neil Gaiman (@NeilHimself) was hungry like a wolf when he wrote a 1984 bio of this band

 

The app’s answers are...

What is Duran Duran?

Who is Dante Alighieri and the Divine Comedy in popular culture?

What is Ray Bradbury?

What is 2000 AD (comics)?

What is Adaptations of Little Red Riding Hood?

What is M. John Harrison?

 

And the correct answer is...What is Duran Duran?

 

Category: Feel Not-So-Good Movies

Clue: Fantine dies, Eponine dies, Gavroche dies, Jean Valjean dies

 

The app’s answers are...

 

What is Les Misérables?

What is Éponine?

What is Cosette?

What is Thénardiers?

What is Songs from Les Misérables?

 

And the correct answer is...What is Les Miserables?

 

So far so good, right? As I said before, two challenges. Challenge one: figure out what the clue is referring to. Challenge two: once you have the answer, rephrase it as a question.

 

How we do it

 

The basic algorithm is as follows: We use the clue (and the theme category it belongs to, which we are told) that comes from the jeopardy tweet as the text for a query [inquire] against the Wikipedia database of IDOL. IDOL does its clever pattern matching and returns the Wiki pages that best fit the criteria and also ranks them appropriately. Then we use the titles of the returned Wikipedia pages as the basis for the suggested responses.

 

To choose the correct form of the response, we do some text manipulation and entity extraction using the deduction [improve] functions of IDOL on the page titles. For the hack-a-thon, the only types of entity we considered was ‘person.’ If eduction decided that page title ‘X’ refers to a person, we would respond “Who is X?”; otherwise, we would default to “What is X?” Obviously without too much effort we could have expanded to cover places, dates, plurals, etc.

 

There are other clues that the algorithm just isn’t suited to. An example from the Jeopardy website, "JUST ONE RADIO ADVERTISING SONG”, with the category “RHYME TIME”, has the response “WHAT IS A SINGLE JINGLE”. This sort of world play is something we haven’t attempted to solve. But for general knowledge based questions, the results are outstanding for a day’s effort.

 

This led me to think that given our very unique IP and the outstanding inventiveness, cunning, and the intelligence of our people, what would have happened if I gave them, say, 3-4 years, 15 people, 750 servers, 200 million pages, 100 techniques, etc.? Anything is possible!

 

I want to tell you much more about some of the other amazing outcomes from our first and second hack-a-thons. But let me save some of that and other IDOL features for my next few blog posts. I’d like to leave you with the vision that the next 12 months will be full of information-driven apps using HP’s OS for Human information.

 

Developers of applications are in for a treat!

 

FER

Leave a Comment

We encourage you to share your comments on this post. Comments are moderated and will be reviewed
and posted as promptly as possible during regular business hours

To ensure your comment is published, be sure to follow the Community Guidelines.

Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.
Search
About the Author
Fernando Lucini has been with Autonomy since 2000 and serves as Chief Technology Officer. As Chief Technology Officer, based at Autonomy’s h...


Follow Us
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation