Acquiring information extraction patterns from unannotated corpora

Català Roig, Neus

Acquiring information extraction patterns from unannotated corpora

dc.contributor

Universitat Politècnica de Catalunya. Departament de Llenguatges i Sistemes Informàtics

dc.contributor.author

Català Roig, Neus

dc.date.accessioned

2011-04-12T15:20:56Z

dc.date.available

2010-06-02

dc.date.issued

2003-07-14

dc.date.submitted

2010-02-26

dc.identifier.isbn

9788469347034

dc.identifier.uri

http://www.tdx.cat/TDX-0226110-110911

dc.identifier.uri

http://hdl.handle.net/10803/6671

dc.description.abstract

Information Extraction (IE) can be defined as the task of automatically extracting preespecified kind of information from a text document. The extracted information is encoded in the required format and then can be used, for example, for text summarization or as accurate index to retrieve new documents. The main issue when building IE systems is how to obtain the knowledge needed to identify relevant information in a document. Today, IE systems are commonly based on extraction rules or IE patterns to represent the kind of information to be extracted. Most approaches to IE pattern acquisition require expert human intervention in many steps of the acquisition process. This dissertation presents a novel method for acquiring IE patterns, Essence, that significantly reduces the need for human intervention. The method is based on ELA, a specifically designed learning algorithm for acquiring IE patterns from unannotated corpora. The distinctive features of Essence and ELA are that 1) they permit the automatic acquisition of IE patterns from unrestricted and untagged text representative of the domain, due to 2) their ability to identify regularities around semantically relevant concept-words for the IE task by 3) using non-domain-specific lexical knowledge tools such as WordNet and 4) restricting the human intervention to defining the task, and validating and typifying the set of IE patterns obtained. Since Essence does not require a corpus annotated with the type of information to be extracted and it does makes use of a general purpose ontology and widely applied syntactic tools, it reduces the expert effort required to build an IE system and therefore also reduces the effort of porting the method to any domain. In order to Essence be validated we conducted a set of experiments to test the performance of the method. We used Essence to generate IE patterns for a MUC-like task. Nevertheless, the evaluation procedure for MUC competitions does not provide a sound evaluation of IE systems, especially of learning systems. For this reason, we conducted an exhaustive set of experiments to further test the abilities of Essence. The results of these experiments indicate that the proposed method is able to learn effective IE patterns.

eng

dc.format.mimetype

application/pdf

dc.language.iso

eng

dc.publisher

Universitat Politècnica de Catalunya

dc.rights.license

ADVERTIMENT. L'accés als continguts d'aquesta tesi doctoral i la seva utilització ha de respectar els drets de la persona autora. Pot ser utilitzada per a consulta o estudi personal, així com en activitats o materials d'investigació i docència en els termes establerts a l'art. 32 del Text Refós de la Llei de Propietat Intel·lectual (RDL 1/1996). Per altres utilitzacions es requereix l'autorització prèvia i expressa de la persona autora. En qualsevol cas, en la utilització dels seus continguts caldrà indicar de forma clara el nom i cognoms de la persona autora i el títol de la tesi doctoral. No s'autoritza la seva reproducció o altres formes d'explotació efectuades amb finalitats de lucre ni la seva comunicació pública des d'un lloc aliè al servei TDX. Tampoc s'autoritza la presentació del seu contingut en una finestra o marc aliè a TDX (framing). Aquesta reserva de drets afecta tant als continguts de la tesi com als seus resums i índexs.

dc.source

TDX (Tesis Doctorals en Xarxa)

dc.subject

reconeixement de patrons sintàtics i semàntics

dc.subject

aprenentatge automàtic

dc.subject

procesament del llenguatge natural

dc.subject

intel·ligència artificial

dc.title

Acquiring information extraction patterns from unannotated corpora

dc.type

info:eu-repo/semantics/doctoralThesis

dc.type

info:eu-repo/semantics/publishedVersion

dc.subject.udc

004

cat

dc.subject.udc

cat

dc.contributor.director

Castell Ariño, Núria

dc.rights.accessLevel

info:eu-repo/semantics/openAccess

cat

dc.identifier.dl

B.34046-2010

Documents

TNCR1de1.pdf

8.033Mb PDF

This item appears in the following Collection(s)

Departament de Llenguatges i Sistemes Informàtics (fins juliol 2014) [83]