Within companies, the use of Watson can fall under the purview of, for example:. IBM Watson operates in many industry verticals, ranging from healthcare predictive analytics and condition diagnosis to smart city infrastructure planning and legal case outcome analysis.
IBM Watson runs on IBM Power servers that are not really supercomputers in the classic sense, but that act that way when they are clustered together in groups of tens and even hundreds of servers, with a price point for an internal system that only a large enterprise or a research institute can afford. The good news for small and midsize companies is that they can use Watson, too; IBM offers a cloud-based version of Watson that companies can pay for by subscription or on demand.
Companies that can afford an investment into multiple millions of dollars can purchase an in-house IBM Watson system, which consists of futile servers tethered together into a processing cluster.
For companies without these resources, Watson can be accessed through the IBM cloud. For example, IBM offers a software developer's cloud powered by Watson. It also provides a cloud-based global healthcare analytics cloud. Learn the latest news and best practices about data science, big data analytics, and artificial intelligence. Delivered Mondays.
Mary E. Shacklett is president of Transworld Data, a technology research and market development firm. In all, it uses millions of logic rules to determine the best answers. Today Watson is frequently being applied to new areas, which means learning new material. Researchers begin by loading Word documents, PDFs and web pages into Watson to build up its knowledge. It has had a long history Simmons and saw rapid advancement spurred by system building, experimentation, and government funding in the past decade Maybury , Strzalkowski and Harabagiu With QA in mind, we settled on a challenge to build a computer system, called Watson, 1 which could compete at the human champion level in real time on the American TV quiz show, Jeopardy.
Quiz Show sidebar for more information on the show. It pits three human contestants against one another in a competition that requires answering rich natural language questions over a very broad domain of topics, with penalties for wrong answers. The nature of the three-person competition is such that confidence, precision, and answering speed are of critical importance, with roughly 3 seconds to answer each question.
A computer system that could compete at human champion levels at this game would need to produce exact answers to often complex natural language questions with high precision and speed and have a reliable confidence in its answers, such that it could answer roughly 70 percent of the questions asked with greater than 80 percent precision in 3 seconds or less.
Finally, the Jeopardy Challenge represents a unique and compelling AI question similar to the one underlying DeepBlue Hsu — can a computer system be designed to compete against the best humans at a task thought to require high levels of human intelligence, and if so, what kind of technology, algorithms, and engineering is required? While we believe the Jeopardy Challenge is an extraordinarily demanding task that will greatly advance the field, we appreciate that this challenge alone does not address all aspects of QA and does not by any means close the book on the QA challenge the way that Deep Blue may have for playing chess.
Meeting the Jeopardy Challenge requires advancing and incorporating a variety of QA technologies including parsing, question classification, question decomposition, automatic source acquisition and evaluation, entity and relation detection, logical form generation, and knowledge representation and reasoning.
Winning at Jeopardy requires accurately computing confidence in your answers. The questions and content are ambiguous and noisy and none of the individual algorithms are perfect.
Therefore, each component must produce a confidence in its output, and individual component confidences must be combined to compute the overall confidence of the final answer. The final confidence is used to determine whether the computer system should risk choosing to answer at all. The confidence must be computed during the time the question is read and before the opportunity to buzz in.
This is roughly between 1 and 6 seconds with an average around 3 seconds. Confidence estimation was very critical to shaping our overall approach in DeepQA.
There is no expectation that any component in the system does a perfect job — all components post features of the computation and associated confidences, and we use a hierarchical machine-learning method to combine all these features and decide whether or not there is enough confidence in the final answer to attempt to buzz in and risk getting the question wrong. A clue Jeopardy board is organized into six columns.
Each column contains five clues and is associated with a category. A recurring theme in our approach is the requirement to try many alternate hypotheses in varying contexts to see which produces the most confident answers given a broad range of loosely coupled scoring algorithms. Leveraging category information is another clear area requiring this approach.
There are a wide variety of ways one can attempt to characterize the Jeopardy clues. For example, by topic, by difficulty, by grammatical construction, by answer type, and so on. A type of classification that turned out to be useful for us was based on the primary method deployed to solve the clue. The bulk of Jeopardy clues represent what we would consider factoid questions — questions whose answers are based on factual information about one or more individual entities.
The questions themselves present challenges in determining what exactly is being asked for and which elements of the clue are relevant in determining the answer. Here are just a few examples note that while the Jeopardy! Quiz Show sidebar , this transformation is trivial and for purposes of this paper we will just show the answers themselves :. Category: General Science Clue: When hit by electrons, a phosphor gives off electromagnetic energy in this form.
Answer: Light or Photons. Category: Lincoln Blogs Clue: Secretary Chase just submitted this to me for the third time; guess what, pal. Answer: his resignation.
Answer: Georgia and Alabama. Some more complex clues contain multiple facts about the answer, all of which are required to arrive at the correct response but are unlikely to occur together in one place.
For example:. Subclue 1: This archaic term for a mischievous or annoying child. Subclue 2: This term can also mean a rogue or scamp.
Answer: Rapscallion. Another class of decomposable questions is one in which a subclue is nested in the outer clue, and the subclue can be replaced with its answer to form a new question that can more easily be answered. Answer: North Korea. Decomposable Jeopardy clues generated requirements that drove the design of DeepQA to generate zero or more decomposition hypotheses for each question as possible interpretations. Jeopardy also has categories of questions that require special processing defined by the category itself.
Some of them recur often enough that contestants know what they mean without instruction; for others, part of the task is to figure out what the puzzle is as the clues and answers are revealed categories requiring explanation by the host are not part of the challenge.
Examples of well-known puzzle categories are the Before and After category, where two subclues have answers that overlap by typically one word, and the Rhyme Time category, where the two subclue answers must rhyme with one another. Clearly these cases also require question decomposition. Category: Before and After Goes to the Movies Clue: Film of a typical day in the life of the Beatles, which includes running from bloodthirsty zombie fans in a Romero classic.
Subclue 2: Film of a typical day in the life of the Beatles. Subclue 1: Pele ball soccer Subclue 2: where store cabinet, drawer, locker, and so on Answer: soccer locker. There are many infrequent types of puzzle categories including things like converting roman numerals, solving math word problems, sounds like, finding which word in a set has the highest Scrabble score, homonyms and heteronyms, and so on.
Puzzles constitute only about 2—3 percent of all clues, but since they typically occur as entire categories five at a time they cannot be ignored for success in the Challenge as getting them all wrong often means losing a game. Category: Picture This Contestants are shown a picture of a B bomber Clue: Alphanumeric name of the fearsome machine seen here. Answer: B Clue: Vain Answer: Virginia and Indiana. Both present very interesting challenges from an AI perspective but were put out of scope for this contest and evaluation.
We define a LAT to be a word in the clue that indicates the type of the answer, independent of assigning semantics to that word. Category: Oooh…. Chess Clue: Invented in the s to speed up the game, this maneuver involves two pieces of the same color. In these cases the type of answer must be inferred by the context. Answer: crewel. The distribution of LATs has a very long tail, as shown in figure 1.
We found distinct and explicit LATs in the 20, question sample. The most frequent explicit LATs cover less than 50 percent of the data. Figure 1 shows the relative frequency of the LATs. Our clear technical bias for both business and scientific motivations is to create general-purpose, reusable natural language processing NLP and knowledge representation and reasoning KRR technology that can exploit as-is natural language resources and as-is structured knowledge rather than to curate task-specific knowledge resources.
Ultimately the outcome of the public contest will be decided based on whether or not Watson can win one or two games against top-ranked humans in real time. The highest amount of money earned by the end of a one- or two-game match determines the winner. This is because a player may decide to bet big on Daily Double or Final Jeopardy questions. There are three hidden Daily Double questions in a game that can affect only the player lucky enough to find them, and one Final Jeopardy question at the end that all players must gamble on.
Daily Double and Final Jeopardy questions represent significant events where players may risk all their current earnings. While Watson is equipped with betting strategies necessary for playing full Jeopardy , from a core QA perspective we want to measure correctness, confidence, and speed, without considering clue selection, luck of the draw, and betting strategies.
We measure correctness and confidence using precision and percent answered. Precision measures the percentage of questions the system gets right out of those it chooses to answer. Percent answered is the percentage of questions it chooses to answer correctly or incorrectly. The system chooses which questions to answer based on an estimated confidence score: for a given threshold, the system will answer all questions with confidence scores above that threshold.
The threshold controls the trade-off between precision and percent answered, assuming reasonable confidence estimation. For higher thresholds the system will be more conservative, answering fewer questions with higher precision. For lower thresholds, it will be more aggressive, answering more questions with lower precision. Accuracy refers to the precision if all questions are answered.
Figure 2 shows a plot of precision versus percent attempted curves for two theoretical systems. It is obtained by evaluating the two systems over a range of confidence thresholds. Both systems have 40 percent accuracy, meaning they get 40 percent of all questions correct. They differ only in their confidence estimation. The upper line represents an ideal system with perfect confidence estimation.
Such a system would identify exactly which questions it gets right and wrong and give higher confidence to those it got right. As can be seen in the graph, if such a system were to answer the 50 percent of questions it had highest confidence for, it would get 80 percent of those correct.
We refer to this level of performance as 80 percent precision at 50 percent answered. The lower line represents a system without meaningful confidence estimation. Since it cannot distinguish between which questions it is more or less likely to get correct, its precision is constant for all percent attempted.
Developing more accurate confidence estimation means a system can deliver far higher precision even with the same overall accuracy. Figure 2. Precision Versus Percentage Attempted. Perfect confidence estimation upper line and no confidence estimation lower line.
A compelling and scientifically appealing aspect of the Jeopardy Challenge is the human reference point. Figure 3 contains a graph that illustrates expert human performance on Jeopardy It is based on our analysis of nearly historical Jeopardy games.
Each point on the graph represents the performance of the winner in one Jeopardy game. In contrast to the system evaluation shown in figure 2, which can display a curve over a range of confidence thresholds, the human performance shows only a single point per game based on the observed precision and percent answered the winner demonstrated in the game.
A further distinction is that in these historical games the human contestants did not have the liberty to answer all questions they wished. The diagnostic tool, for example, wasn't brought to market because the business case wasn't there, says Ajay Royyuru , IBM's vice president of health care and life sciences research. It's a hard task, and no matter how well you do it with AI, it's not going to displace the expert practitioner.
In an attempt to find the business case for medical AI, IBM pursued a dizzying number of projects targeted to all the different players in the health care system: physicians, administrative staff, insurers, and patients. In medical text documents, Bengio says, AI systems can't understand ambiguity and don't pick up on subtle clues that a human doctor would notice.
But no AI built so far can match a human doctor's comprehension and insight. IBM's work on cancer serves as the prime example of the challenges the company encountered. The effort to improve cancer care had two main tracks. Kris and other preeminent physicians at Sloan Kettering trained an AI system that became the product Watson for Oncology in MD Anderson got as far as testing the tool in the leukemia department, but it never became a commercial product. Both efforts have received strong criticism.
One excoriating article about Watson for Oncology alleged that it provided useless and sometimes dangerous recommendations IBM contests these allegations. Watson for Oncology was supposed to learn by ingesting the vast medical literature on cancer and the health records of real cancer patients. The hope was that Watson, with its mighty computing power, would examine hundreds of variables in these records—including demographics, tumor characteristics, treatments, and outcomes—and discover patterns invisible to humans.
It would also keep up to date with the bevy of journal articles about cancer treatments being published every day. To Sloan Kettering's oncologists, it sounded like a potential breakthrough in cancer care. To IBM, it sounded like a great product. Watson learned fairly quickly how to scan articles about clinical studies and determine the basic outcomes. But it proved impossible to teach Watson to read the articles the way a doctor would.
Watson's thinking is based on statistics, so all it can do is gather statistics about main outcomes, explains Kris. The drug was fast-tracked based on dramatic results in just 55 patients, of whom four had lung cancer. Several studies have compared Watson for Oncology's cancer treatment recommendations to those of hospital oncologists.
The concordance percentages indicate how often Watson's advice matched the experts' treatment plans. The realization that Watson couldn't independently extract insights from breaking news in the medical literature was just the first strike.
Researchers also found that it couldn't mine information from patients' electronic health records as they'd expected. At MD Anderson, researchers put Watson to work on leukemia patients' health records—and quickly discovered how tough those records were to work with.
Yes, Watson had phenomenal NLP skills. But in these records, data might be missing, written down in an ambiguous way, or out of chronological order. In a paper published in The Oncologist , the team reported that its Watson-powered Oncology Expert Advisor had variable success in extracting information from text documents in medical records.
It had accuracy scores ranging from 90 to 96 percent when dealing with clear concepts like diagnosis, but scores of only 63 to 65 percent for time-dependent information like therapy timelines. In a final blow to the dream of an AI superdoctor, researchers realized that Watson can't compare a new patient with the universe of cancer patients who have come before to discover hidden patterns.
Both Sloan Kettering and MD Anderson hoped that the AI would mimic the abilities of their expert oncologists, who draw on their experience of patients, treatments, and outcomes when they devise a strategy for a new patient.
A machine that could do the same type of population analysis—more rigorously, and using thousands more patients—would be hugely powerful.
But the health care system's current standards don't encourage such real-world learning. If an AI system were to base its advice on patterns it discovered in medical records—for example, that a certain type of patient does better on a certain drug—its recommendations wouldn't be considered evidence based, the gold standard in medicine. Without the strict controls of a scientific study, such a finding would be considered only correlation, not causation.
Kohn, formerly of IBM, and many others think the standards of health care must change in order for AI to realize its full potential and transform medicine. Infrastructure must change too: Health care institutions must agree to share their proprietary and privacy-controlled data so AI systems can learn from millions of patients followed over many years.
According to anecdotal reports , IBM has had trouble finding buyers for its Watson oncology product in the United States.
Some oncologists say they trust their own judgment and don't need Watson telling them what to do. Others say it suggests only standard treatments that they're well aware of. But Kris says some physicians are finding it useful as an instant second opinion that they can share with nervous patients. Many of these hospitals proudly use the IBM Watson brand in their marketing, telling patients that they'll be getting AI-powered cancer care.
Illustration: Eddie Guy. In the past few years, these hospitals have begun publishing studies about their experiences with Watson for Oncology. In India, physicians at the Manipal Comprehensive Cancer Center evaluated Watson on breast cancer cases and found a 73 percent concordance rate in treatment recommendations; its score was brought down by poor performance on metastatic breast cancer. Watson fared worse at Gachon University Gil Medical Center, in South Korea, where its top recommendations for colon cancer patients matched those of the experts only 49 percent of the time.
Doctors reported that Watson did poorly with older patients, didn't suggest certain standard drugs, and had a bug that caused it to recommend surveillance instead of aggressive treatment for certain patients with metastatic cancer. These studies aimed to determine whether Watson for Oncology's technology performs as expected. But no study has yet shown that it benefits patients.
But they needed to show, fairly quickly, an impact on hard outcomes. Sloan Kettering's Kris isn't discouraged; he says the technology will only get better. It's a long haul, but it's worth it. Some success stories are emerging from Watson Health—in certain narrow and controlled applications, Watson seems to be adding value. Take, for example, the Watson for Genomics product, which was developed in partnership with the University of North Carolina, Yale University, and other institutions.
The tool is used by genetics labs that generate reports for practicing oncologists: Watson takes in the file that lists a patient's genetic mutations, and in just a few minutes it can generate a report that describes all the relevant drugs and clinical trials. Watson has a relatively easy time with genetic information, which is presented in structured files and has no ambiguity—either a mutation is there, or it's not.
The tool doesn't employ NLP to mine medical records, instead using it only to search textbooks, journal articles, drug approvals, and clinical trial announcements, where it looks for very specific statements.
For 32 percent of cancer patients enrolled in that study, Watson spotted potentially important mutations not identified by a human review, which made these patients good candidates for a new drug or a just-opened clinical trial. But there's no indication, as of yet, that Watson for Genomics leads to better outcomes.
The U. Department of Veterans Affairs uses Watson for Genomics reports in more than 70 hospitals nationwide, says Michael Kelley , the VA's national program director for oncology.
The VA first tried the system on lung cancer and now uses it for all solid tumors. But Kelley says he doesn't think of Watson as a robot doctor. Most doctors would probably be delighted to have an AI librarian at their beck and call—and if that's what IBM had originally promised them, they might not be so disappointed today.
The Watson Health story is a cautionary tale of hubris and hype. Everyone likes ambition, everyone likes moon shots, but nobody wants to climb into a rocket that doesn't work.
0コメント