Utilize este identificador para referenciar este registo: http://hdl.handle.net/10451/22424
Título: Positional Amino acid frquency patterns for automatic protein annotation
Autor: Silva, Andreia Carina Pereira da
Orientador: Falcão, André Osório e Cruz de Azerêdo, 1969-
Palavras-chave: Anotação automática de proteínas
K-means clustering
Association rule learning
Gene Ontology
Teses de mestrado - 2015
Data de Defesa: 2015
Resumo: Today most proteins contained in protein data bases have been annotated through electronic inference. Due to the amount of data being generated by high throughput methods, electronic inference remains the only viable path to understand proteins’ biochemical functions(s), cellular location(s), participation in cellular processes, as well as, its structure and interactions. The feature learning model here proposed aims to introduce a new perspective on protein function annotation problem at a positional amino acid level. Initially, the probabilistic scores for each amino acid at each protein position is acquired, via a traditional PSI-BLAST search; this generates a PSSM with said information. Each protein’s positional amino acid frequency pattern (PAFP) is sieved through a threshold to decrease the number of PAFPs irrelevant to the protein’s function. Afterwards, these are clustered to their Euclidean closer relatives, via k-means algorithm; identifying, in this manner, s sort of fingerprint of amino acid score patterns. These are then associated to Gene Ontology terms retrieved for the training proteins, using arules package from R, i. e., establish association rules between the resulting K-means clusters of PAFPs and the Go terms. The 300 threshold for the sum of PAFPs generated 280 GO terms, with a support of 0.0005, about 30 proteins, and a confidence of 40%. These terms were used to describe 516591 proteins out of 549008 in Swiss-Prot the release of July 2015. Most GO terms were, not leaf level, but higher. The model infers far more proteins to each Go term than the ones annotated to it, however it also fails to allocate proteins annotated with the GO term, resulting in high recall levels, but not equivalently high precision. However, note that these results do not mean the inference is incorrect but in fact that there is no evidence to support it one way or the other. Also, in the training set there are 7271 GO terms with a support of at least 30 proteins, it would be expectable for the model to return a similar number of identified GO terms. Despite, falling short of what was expected, the results strongly suggest that the existence of certain PAFPs within proteins may be important for their function. It is also interesting that the strongest signal was found on terms for which the positive ratio is very low, which are typically very difficult classification problems. Results strongly suggest that it may be possible to find annotation clues by looking on amino acids substitution patterns alone. The results however were not perfect and more work will certainly be required to further validate the initial findings.
Descrição: Tese de mestrado, Bioinformática e Biologia computacional (Bioinformática), Universidade de Lisboa, Faculdade de Ciências, 2015
URI: http://hdl.handle.net/10451/22424
Designação: Tese de mestrado em Bioinformática e Biologia computacional (Bioinformática)
Aparece nas colecções:FC-DI - Master Thesis (dissertation)

Ficheiros deste registo:
Ficheiro Descrição TamanhoFormato 
ulfc116073_tm_Andreia_Silva.pdf17,95 MBAdobe PDFVer/Abrir

FacebookTwitterDeliciousLinkedInDiggGoogle BookmarksMySpace
Formato BibTex MendeleyEndnote 

Todos os registos no repositório estão protegidos por leis de copyright, com todos os direitos reservados.