Microsoft and Alibaba have independently developed AI models that scored better than humans in a Stanford University reading comprehension test.
This AI milestone was reached using the Stanford Question Answering Dataset (SQuAD), which consists of over 10,000 question-and-answer pairs that apply to more than 500 Wikipedia articles. Alibaba’s model achieved a score of 82.44, while the submission from Microsoft Research Asia bested that with a mark of 82.65. The human score for the SQuAD test is 82.304.
Although that’s a slim margin to claim superior performance, it represents the first time any natural language processing (NLP) software has been able to eclipse humans on this particular benchmark. Google, IBM, Facebook, TenCent, Samsung, Salesforce, and others have submitted their own models for this test, but to date, none of these have reached the human level of reading comprehension.
The achievement by Alibaba and Microsoft suggests that NLP technology is getting closer to playing a much larger role in commercial applications in areas like customer service, travel, and healthcare, to name a few. When paired with search engines, the technology can be leveraged to provide all sorts of useful interactions for businesses and consumers. Microsoft noted that it has already integrated earlier versions of its SQuAD model into its Bing search engine.
According to Microsoft, the software developers there are working on ways to use the technology to introduce context into these interactions. From Microsoft’s AI blog:
For example, let’s say you asked a system, “What year was the prime minister of Germany born?” You might want it to also understand you were still talking about the same thing when you asked the follow-up question, “What city was she born in?”
Despite this latest progress in NLP, Ming Zhou, assistant managing director of Microsoft Research Asia, concedes that overall, humans are still better than software at understanding the complexities of language. “Natural language processing is still an area with lots of challenges that we all need to keep investing in and pushing forward,” said Zhou. “This milestone is just a start.”
That view is elaborated by Ernest Davis, a New York University professor of computer science and longtime AI researcher, who is quoted in a Washington Post article on this topic. Davis concedes that while the Alibaba and Microsoft efforts are impressive, much of reading comprehension is based on what you have already learned prior to reading any particular passage. And these models don’t encapsulate that kind of context.
“We really need to deal much more deeply with the problem of extracting the meaning of a text in a rich sense,” says Davis. “That problem is still not solved.”