I am often asked to articulate some problems that people can start working on immediately and get some real world industry experience in artificial intelligence. The hard part for these situations is that the problem or a part of it should be meaningful to people with every level of technological sophistication. So here is an attempt to articulate such a problem for Natural Language Processing.
Consider this Wikipedia based data set (40MB), which is built by the SQUAD team at Stanford. Try to solve the following problems for the text contained in these articles. Each program should be self-learning and must have clearly defined accuracy metrics.
- Write a program to accurately split paragraphs in the database to sentences.
- Write a program to accurately split sentences to individual words and punctuation marks.
- Write a program to accurately predict the part of speech of each of the words/ tokens.
- Write a program to split each word into its lemma and modification e.g. for the word “rooting” the lemma would be “root” and the modification is “ing”.
- Write a program to identify all the a) Noun Phrases; and b) Verb Phrases
- Write a program to a) identify all the anaphora; b) categorize them and c) resolve them.
- Write a program to categorize all noun and verb phrases, including those represented as anaphoras to a suitable level of hypernyms e.g. “cat” is an animal, India is a country, etc.
- Write a program to identify the missing information and its type/ hypernym e.g. for “Which is the largest continent” the program should identify that the answer should be a hyponym of “continents.”
- Write a program to answer each of the questions in the dataset by identifying the right sentence or phrase e.g. for “Which is the largest continent?” answer “Asia is the largest continent.”
- Write a program to answer each of the questions in the dataset by identifying the exact answer e.g. for “Which is the largest continent?” answer “Asia.”
At this point you are ready to compete in the worldwide SQUAD competition as well.
I am curious as to various approaches that people are likely to take for this. Please suggest some in the comments, and also link the various resources you would be using to solve these problems.