Utilize este identificador para referenciar este registo: http://hdl.handle.net/10451/14056
Título: Improving semantic similarity for proteins based on the gene ontology
Autor: Pesquita, Cátia
Orientador: Couto, Francisco José Moreira
Palavras-chave: Semantic similarity
BioOntologies
Gene ontology
Genome annotation
Teses de mestrado - 2007
Data de Defesa: 2007
Relatório da Série N.º: di-fcul-tr-08-6
Resumo: One of the current challenges in the Life Sciences is to extract the knowledge contained in the vast amount of data that the genomic and post-genomic techniques are producing. One of the major efforts in this area was the development of the Gene Ontology (GO), a BioOntology that contains terms that describe gene products, organized in a graph structure. Gene products annotated with ontology terms can be compared according to them. This process is called semantic similarity and it is based on the structure of the BioOntology and the relations between its terms, focusing either on a structural comparison or more frequently on the semantic similarity between the terms themselves. In this work, I developed two novel hybrid measures of semantic similarity for proteins based on the Gene Ontology: simGIC (Graph-Information Content similarity) and simGED (Graph-Edit-Distance similarity). These measures were designed to take into account both graph attributes and the terms' information content, thus capturing more information than the previously existing measures which focused mostly on a single aspect (graph structure or term similarity). These two novel measures were evaluated against several previously proposed measures, using two strategies: relationship with sequence similarity and correlation with family similarity. The evaluation metric in the sequence similarity studies was the resolution of the measures, i.e. the range of semantic similarity values they cover, since most measures showed the same behaviour and similar correlation values. Overall simGIC was shown to be the best performer, with both the highest resolutions in the sequence similarity evaluation and highest correlation to family similarity, while simGED obtained above average results. The in uence of electronic annotations was also investigated but I found no conclusive evidence to support the general view that these are unreliable to use in semantic similarity studies. Keywords: Semantic Similarity, BioOntologies, Gene Ontology, Genome Annotation.Abstract One of the current challenges in the Life Sciences is to extract the knowledge contained in the vast amount of data that the genomic and post-genomic techniques are producing. One of the major efforts in this area was the development of the Gene Ontology (GO), a BioOntology that contains terms that describe gene products, organized in a graph structure. Gene products annotated with ontology terms can be compared according to them. This process is called semantic similarity and it is based on the structure of the BioOntology and the relations between its terms, focusing either on a structural comparison or more frequently on the semantic similarity between the terms themselves. In this work, I developed two novel hybrid measures of semantic similarity for proteins based on the Gene Ontology: simGIC (Graph-Information Content similarity) and simGED (Graph-Edit-Distance similarity). These measures were designed to take into account both graph attributes and the terms' information content, thus capturing more information than the previously existing measures which focused mostly on a single aspect (graph structure or term similarity). These two novel measures were evaluated against several previously proposed measures, using two strategies: relationship with sequence similarity and correlation with family similarity. The evaluation metric in the sequence similarity studies was the resolution of the measures, i.e. the range of semantic similarity values they cover, since most measures showed the same behaviour and similar correlation values. Overall simGIC was shown to be the best performer, with both the highest resolutions in the sequence similarity evaluation and highest correlation to family similarity, while simGED obtained above average results. The influence of electronic annotations was also investigated but I found no conclusive evidence to support the general view that these are unreliable to use in semantic similarity studies.
Descrição: Tese de mestrado em Bioinformática, apresentada à Universidade de Lisboa, através da Faculdade de Ciências, 2007
URI: http://hdl.handle.net/10451/14056
http://repositorio.ul.pt/handle/10455/3075
Aparece nas colecções:FC-DI - Master Thesis (dissertation)

Ficheiros deste registo:
Ficheiro Descrição TamanhoFormato 
08-6.pdf1,37 MBAdobe PDFVer/Abrir    Acesso Restrito. Solicitar cópia ao autor!


FacebookTwitterDeliciousLinkedInDiggGoogle BookmarksMySpace
Formato BibTex MendeleyEndnote Degois 

Todos os registos no repositório estão protegidos por leis de copyright, com todos os direitos reservados.