ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition

April 13-16, 2003
Tokyo Institute of Technology, Tokyo, Japan

Discovery Methods for Information Extraction

Ralph Grishman

New York University, New York, NY, USA

Information extraction (IE) involves automatically identifying instances of a specified type of relation or event in text, and collecting the arguments and modifiers of the relation/event. High quality, easily adaptable IE systems would have a major effect on the ways in which we can make use of information in text (and ultimately, in speech as well).

At the present state of the art, however, performance varies widely depending on the nature of the language being processed and the complexity of the relation being extracted. For restricted sublanguages and simple relations, levels of accuracy comparable to human coders are possible. This has been achieved, for example, for some types of medical records, where both physicians and an extraction system identified diseases with 70-80% accuracy (Friedman et al. 1995). High performance has also been achieved for semi-structured Web documents - documents with some explicit mark-up (Cohen and Jensen 2001). In contrast, for more complex relations and more general texts, accuracies of 50-60% are more typical. Even at these levels IE can be of significant value in situations where the text is too voluminous to be reviewed manually; for example, to provide a document search tool much richer than current keyword systems (Grishman et al 2002). IE is also being used in other applications where perfect recall is not required, such as data mining from text collections and the generation of time lines for texts. To make IE a more widely-useable technology, we face a two-fold challenge: improving its performance and improving its portability to new domains. Our group, and other research groups, are exploring how corpusbased training methods can address these challenges.

The difficulty of IE lies in part in the wide variety of ways in which a given relation may be expressed. Automated tools for corpus analysis can help in analyzing large corpora to find these varied expressions, and hopefully can find a wider range of expressions with less human effort than current methods.

References


Full Paper

Bibliographic reference.  Grishman, Ralph (2003): "Discovery methods for information extraction", in SSPR-2003, paper WMO1.