Utilize este identificador para referenciar este registo: http://hdl.handle.net/10451/13908
Título: Developing reliability metrics and validation tools for datasets with deep linguistic information
Autor: Castro, Sérgio Ricardo de
Orientador: Branco, António
Palavras-chave: Natural language processing
corpora annotation with deep linguistic information
inter-annotator agreement
Data de Defesa: 2011
Resumo: The purpose of this dissertation is to propose a reliability metric and respective validation tools for corpora annotated with deep linguistic information. The annotation of corpus with deep linguistic information is a complex task, and therefore is aided by a computational grammar. This grammar generates all the possible grammatical representations for sentences. The human annotators select the most correct analysis for each sentence, or reject it if no suitable representation is achieved. This task is repeated by two human annotators under a double-blind annotation scheme and the resulting annotations are adjudicated by a third annotator. This process should result in reliable datasets since the main purpose of this dataset is to be the training and validation data for other natural language processing tools. Therefore it is necessary to have a metric that assures such reliability and quality. In most cases, the metrics uses for shallow annotation or parser evaluation have been used for this same task. However the increased complexity demands a better granularity in order to properly measure the reliability of the dataset. With that in mind, I suggest the usage of a metric based on the Cohen’s Kappa metric that instead of considering the assignment of tags to parts of the sentence, considers the decision at the level of the semantic discriminants, the most granular unit available for this task. By comparing each annotator’s options it is possible to evaluate with a high degree of granularity how close their analysis were for any given sentence. An application was developed that allowed the application of this model to the data resulting from the annotation process which was aided by the LOGON framework. The output of this application not only has the metric for the annotated dataset, but some information related with divergent decision with the intent of aiding the adjudication process.
URI: http://hdl.handle.net/10451/13908
http://repositorio.ul.pt/handle/10455/6753
Aparece nas colecções:FC-DI - Master Thesis (dissertation)

Ficheiros deste registo:
Ficheiro Descrição TamanhoFormato 
1011rf_29479.pdf2,71 MBAdobe PDFVer/Abrir    Acesso Restrito. Solicitar cópia ao autor!


FacebookTwitterDeliciousLinkedInDiggGoogle BookmarksMySpace
Formato BibTex MendeleyEndnote Degois 

Todos os registos no repositório estão protegidos por leis de copyright, com todos os direitos reservados.