Rating

    No results.

IR in IBM’s Watson: An interview with Nico Schlaefer

GD Star Rating
loading...

Last month, IBM’s Watson Deep QA system took on two Jeopardy! champions and won. Several researchers at the Language Technologies Institute at Carnegie Mellon University have been involved with the project over the past few years, including Professor Eric Nyberg and his students Hideki Shima and Nico Schlaefer. Nico has been particularly involved in the IR technology behind Watson, and has answered a few questions on his role in the project. This work forms the basis of his recent thesis proposal on “Statistical Source Expansion for Question Answering”.

If you didn’t see the Jeopardy! match, check out the practice matchinterview with Professor Nyberg, Hideki and Nico, and background on IBM’s site.

Probably Irrelevant: What role does IR play in a QA system?

Nico Schlaefer: Watson and other state-of-the-art QA systems find answers in unstructured text, which is indexed and searched with IR systems. From an architectural point of view, QA applications are often build on top of an IR system – they take a natural language question and transform it into a query that can be handled by the retrieval engine, submit the query and get the search results, and then further process these results by extracting answers and scoring them. So IR plays a key role in question answering, and the performance of a QA system highly depends on the quality of the search results.

PI: What are good characteristics of an IR system for QA?

NS: In QA, it is quite common to generate relatively complicated queries that include term weights and proximity operators. Some systems also pre-annotate their sources with syntactic or semantic information and formulate constrained queries that leverage these annotations. In addition, QA systems often do not retrieve whole documents but shorter passages comprising just a few sentences. To be suitable for QA, an IR system should provide a rich query language, support annotations on the source documents, and allow QA systems to retrieve search results of different granularities.

PI: Where does IR fail with respect to QA?

NS: IR often fails if there is little relevant information for a given question in the sources. The question and relevant documents may use different terminology, which makes it hard for the IR system to retrieve useful text. Query expansion or pseudo-relevance feedback can help to some extent, but often these techniques do not consistently improve performance, and some QA systems only use them as a fallback solution if an initial search does not return anything useful. Obviously, these methods are not going to help if the answer to a question is not in the sources. We developed a different approach – statistical source expansion – which overcomes some of these issues by augmenting existing sources with more relevant information and by increasing semantic redundancy.

PI: What kinds of resources are expanded?

NS: We focused on sources we found most useful for answering Jeopardy! questions and also question from TREC QA evaluations. These include encyclopedias (such as Wikipedia) and dictionaries (such as Wiktionary). More recently, we also experimented with the ClueWeb09 corpus, a large web crawl created at CMU which comprises about 12 TB of English web pages.

PI: What techniques and/or tools are you using to identify topics to expand?

NS: This depends on the sources. For example, when expanding an encyclopedia or a dictionary, we consider each document as a candidate topic. We can then sort the topics by some measure of popularity and focus on expanding the most popular ones. This approach is based on the assumption that Jeopardy! questions (and also questions in most other QA tasks, such as TREC) tend to ask about popular topics, so we get the largest performance gain out of expanding those topics. When expanding other sources that are not organized by topics, such as web crawls or newswire corpora, more sophisticated topic detection techniques become necessary. For example, the most popular topics can be identified using named entity recognizers, statistical methods or dictionaries of known topics.

PI: How do you estimate relevance to those topics?

NS: We use a statistical model that combines a variety of features to estimate the topicality and textual quality of text passages. For example, one of the topicality features is a likelihood ratio estimated with language models. A topic model is trained using the seed document we’re expanding or related web pages retrieved for that seed, and a background model is trained on a large collection of text. The ratio of the likelihoods of a text passage under the topic model and the background model is a good indicator of topicality. Textual quality can, for example, be estimated using dictionaries of known words and n-grams. We also look at simple surface features of text passages, such as the length of a passage and its offset in the source document.

PI: Could you give an example of a question that is helped by source expansion?

NS: Here is a question for which source expansion helped:

What is the name of the rare neurological disease with symptoms such as: involuntary movements (tics), swearing, and incoherent vocalizations (grunts, shouts, etc.)?

This is a question from the TREC 8 evaluation [pdf], but if written as a statement (”This rare neurological disease has symptoms such as …”) I think it could also pass as a Jeopardy! question. The answer is “Tourette syndrome”.

We first tried to answer this question using Wikipedia as a source, and there is indeed an article about “Tourette syndrome” in our copy of Wikipedia, but unfortunately it doesn’t mention most of the keywords in the question and Watson wasn’t able to get the answer. We then expanded Wikipedia, and “Tourette syndrome” was one of the topics that was automatically selected. The expanded article contains the following text passages which, by the way, all come from different websites:

  • Rare neurological disease that causes repetitive motor and vocal tics
  • The first symptoms usually are involuntary movements (tics) of the face, arms, limbs or trunk.
  • Tourette’s syndrome (TS) is a neurological disorder characterized by repetitive, stereotyped, involuntary movements and vocalizations called tics.
  • The person afflicted may also swear or shout strange words, grunt, bark or make other loud sounds.

These passages jointly almost perfectly cover the question keywords. I think the only content word that is not in there is “incoherent”. This made it very easy for Watson to find the answer.

What role does IR play in a QA system?
Watson and other state-of-the-art QA systems find answers in unstructured text, which is indexed and searched with IR systems. From an architectural point of view, QA applications are often build on top of an IR system – they take a natural language question and transform it into a query that can be handled by the retrieval engine, submit the query and get the search results, and then further process these results by extracting answers and scoring them. So IR plays a key role in question answering, and the performance of a QA system highly depends on the quality of the search results.
What are good characteristics of an IR system for QA?
In QA, it is quite common to generate relatively complicated queries that include term weights and proximity operators. Some systems also pre-annotate their sources with syntactic or semantic information and formulate constrained queries that leverage these annotations. In addition, QA systems often do not retrieve whole documents but shorter passages comprising just a few sentences. To be suitable for QA, an IR system should provide a rich query language, support annotations on the source documents, and allow QA systems to retrieve search results of different granularities.
Where does IR fail with respect to QA?
IR often fails if there is little relevant information for a given question in the sources. The question and relevant documents may use different terminology, which makes it hard for the IR system to retrieve useful text. Query expansion or pseudo-relevance feedback can help to some extend, but often these techniques do not consistently improve performance, and some QA systems only use them as a fallback solution if an initial search does not return anything useful. Obviously, these methods are not going to help if the answer to a question is not in the sources. We developed a different approach – statistical source expansion – which overcomes some of these issues by augmenting existing sources with more relevant information and by increasing semantic redundancy.
What kinds of resources are expanded?
We focused on sources we found most useful for answering Jeopardy! questions and also question from TREC QA evaluations. These include encyclopedias (such as Wikipedia) and dictionaries (such as Wiktionary). More recently, we also experimented with the ClueWeb09 corpus, a large web crawl created at CMU which comprises about 12 TB of English web pages.
What techniques and/or tools are you using to identify topics to expand?
This depends on the sources. For example, when expanding an encyclopedia or a dictionary, we consider each document as a candidate topic. We can then sort the topics by some measure of popularity and focus on expanding the most popular ones. This approach is based on the assumption that Jeopardy! questions (and also questions in most other QA tasks, such as TREC) tend to ask about popular topics, so we get the largest performance gain out of expanding those topics. When expanding other sources that are not organized by topics, such as web crawls or newswire corpora, more sophisticated topic detection techniques become necessary. For example, the most popular topics can be identified using named entity recognizers, statistical methods or dictionaries of known topics.
How do you estimate relevance to those topics?
We use a statistical model that combines a variety of features to estimate the topicality and textual quality of text passages. For example, one of the topicality features is a likelihood ratio estimated with language models. A topic model is trained using the seed document we’re expanding or related web pages retrieved for that seed, and a background model is trained on a large collection of text. The ratio of the likelihoods of a text passage under the topic model and the background model is a good indicator of topicality. Textual quality can, for example, be estimated using dictionaries of known words and n-grams. We also look at simple surface features of text passages, such as the length of a passage and its offset in the source document.
Could you give an example of a question that is helped by source expansion?
Here is a question for which source expansion helped:
What is the name of the rare neurological disease with symptoms such as: involuntary movements (tics), swearing, and incoherent vocalizations (grunts, shouts, etc.)?
This is a question from the TREC 8 evaluation, but if written as a statement (”This rare neurological disease has symptoms such as …”) I think it could also pass as a Jeopardy! question. The answer is “Tourette syndrome”.
We first tried to answer this question using Wikipedia as a source, and there is indeed an article about “Tourette syndrome” in our copy of Wikipedia, but unfortunately it doesn’t mention most of the keyword in the question and Watson wasn’t able to get the answer. We then expanded Wikipedia, and “Tourette syndrome” was one of the topics that was automatically selected. The expanded article contains the following text passages which, by the way, all come from different websites:
- Rare neurological disease that causes repetitive motor and vocal tics
- The first symptoms usually are involuntary movements (tics) of the face, arms, limbs or trunk.
- Tourette’s syndrome (TS) is a neurological disorder characterized by repetitive, stereotyped, involuntary movements and vocalizations called tics.
- The person afflicted may also swear or shout strange words, grunt, bark or make other loud sounds.
These passages jointly almost perfectly cover the question keywords. I think the only content word that is not in there is “incoherent”. This made it very easy for Watson to find the answe
LinkedInShare

Comments are closed.