• Entity Information Extraction using Structured and Semi-structured resources

      Yates, Alexander; Obradovic, Zoran; Guo, Yuhong; Cucerzan, Silviu-Petru (Temple University. Libraries, 2014)
      Among all the tasks that exist in Information Extraction, Entity Linking, also referred to as entity disambiguation or entity resolution, is a new and important problem which has recently caught the attention of a lot of researchers in the Natural Language Processing (NLP) community. The task involves linking/matching a textual mention of a named-entity (like a person or a movie-name) to an appropriate entry in a database (e.g. Wikipedia or IMDB). If the database does not contain the entity it should return NIL (out-of-database) value. Existing techniques for linking named entities in text mostly focus on Wikipedia as a target catalog of entities. Yet for many types of entities, such as restaurants and cult movies, relational databases exist that contain far more extensive information than Wikipedia. In this dissertation, we introduce a new framework, called Open-Database Entity Linking (Open-DB EL), in which a system must be able to resolve named entities to symbols in an arbitrary database, without requiring labeled data for each new database. In experiments on two domains, our Open-DB EL strategies outperform a state-of-the-art Wikipedia EL system by over 25% in accuracy. Existing approaches typically perform EL using a pipeline architecture: they use a Named-Entity Recognition (NER) system to find the boundaries of mentions in text, and an EL system to connect the mentions to entries in structured or semi-structured repositories like Wikipedia. However, the two tasks are tightly coupled, and each type of system can benefit significantly from the kind of information provided by the other. We propose and develop a joint model for NER and EL, called NEREL, that takes a large set of candidate mentions from typical NER systems and a large set of candidate entity links from EL systems, and ranks the candidate mention-entity pairs together to make joint predictions. In NER and EL experiments across three datasets, NEREL significantly outperforms or comes close to the performance of two state-of-the-art NER systems, and it outperforms 6 competing EL systems. On the benchmark MSNBC dataset, NEREL, provides a 60% reduction in error over the next best NER system and a 68% reduction in error over the next-best EL system. We also extend the idea of using semi-structured resources to a relatively less explored area of entity information extraction. Most previous work on information extraction from text has focused on named-entity recognition, entity linking, and relation extraction. Much less attention has been paid to extracting the temporal scope for relations between named-entities; for example, the relation president-Of (John F. Kennedy, USA) is true only in the time-frame (January 20, 1961 - November 22, 1963). In this dissertation we present a system for temporal scoping of relational facts, called TSRF which is trained on distant supervision based on the largest semi-structured resource available: Wikipedia. TSRF employs language models consisting of patterns automatically bootstrapped from sentences collected from Wikipedia pages that contain the main entity of a page and slot-fillers extracted from the infobox tuples. This proposed system achieves state-of-the-art results on 6 out of 7 relations on the benchmark Text Analysis Conference (TAC) 2013 dataset for the task of temporal slot filling (TSF). Overall, the system outperforms the next best system that participated in the TAC evaluation by 10 points on the TAC-TSF evaluation metric.