Примена правила придруживања и метода подржавајућих вектора за предвиђање Т - ћелијских епитопа
Application of association rule and support vector machine technique for T - cell epitope prediction
Jandrlić, Davorka R.
Faculty:University of Belgrade, Faculty of Mathematics
MetadataShow full item record
Истраживање података (eng. Data mining)је интердисциплинарно поље информатике које се бави аутоматским или полу-аутоматским откривањем знања у подацима. Основни задатак истраживања података је издвајање нетривијалних, претходно непознатих и потенцијално корисних образаца, односа и веза у подацима и статистички значајних структура из великих колекција података. Императив је да добијени резултати буду нови, ваљани, корисни и разумљиви. Технике за истраживање података укључују статистичке моделе, математичке алгоритме и методе машинског учења...Data mining is an interdisciplinary subfield of computer science, including various scientific disciplines such as: database systems, statistics, machine learning, artificial intelligence and the others. The main task of data mining is automatic and semi-automatic analysis of large quantities of data to extract previously unknown, nontrivial and interesting patterns. Rapid development in the fields of immunology, genomics, proteomics, molecular biology and other related areas has caused a large increase in biological data. Drawing conclusions from these data requires sophisticated computational analyses. Without automatic methods to extract data it is almost impossible to investigate and analyze this data. Currently, one of the most active problems in immunoinformatics is T - cell epitope identification. Identification of T - cell epitopes, especially dominant T - cell epitopes widely represented in population, is of the immense relevance in vaccine development and detecting immunological
patterns characteristic for autoimmune diseases. Epitope-based vaccines are of great importance in combating infectious and chronic diseases and various types of cancer. Experimental methods for identification of T - cell epitopes are expensive, time consuming, and are not applicable for large scale research (especially not for the choice of the optimal group of epitopes for vaccine development which will cover the whole population or personalized vaccines). Computational and mathematical models for T - cell epitope prediction, based on MHC-peptide binding, are crucial to enable the systematic investigation and identification of T - cell epitopes on a large dataset and to complement expensive and time consuming experimentation . T - cells (T - lymphocytes) recognize protein antigen(s) only when degradated to peptide fragments and complexed with Major Histocompatibility Complex (MHC) molecules on the surface of antigen-presenting cells . The binding of these peptides (potential epitopes) to MHC molecules and presentation to T - cells is a crucial (and the most selective) step in both cellular and humoral adoptive immunity. Currently exist numerous of methodologies that provide identification of these epitopes. In this PhD thesis, discussed methods are exclusively based on peptide sequence binding to MHC molecules. It describes existing methodologies for T - cell epitope prediction, the shortcomings of existing methods and some of the available databases of experimentally determined linear T - cell epitopes. The new models for T - cell epitope prediction using data mining techniques are developed and extensive analyses concerning to whether disorder and hydropathy prediction methods could help understanding epitope processing and presentation is done. Accurate computational prediction of T cell epitope, which is the aim of this thesis, can greatly expedite epitope screening by reducing costs and experimental effort. These theses deals with predictive data mining tasks: classification and regression, and descriptive data mining tasks: clustering, association rules and sequence analysis. The new-developed models, which are main contribution of the dissertation are comparable in performance with the best currently existing methods, and even better in some cases. Developed models are based on the support vector machine technique for classification and regression problems. A new approach of extracting the most important physicochemical properties that influence the classification of MHC-binding ligands is also presented. For that purpose are developed new clustering-based classification models. The models are based on k-means clustering technique. The second part of the thesis concerns the establishment of rules and associations of T - cell epitopes that belong to different protein structures. The task of this part of research was to find out whether disorder and hydropathy prediction methods could help in understanding epitope processing and presentation. The results of the application of an association rule technique and thorough analysis over large protein dataset where T cell epitopes, protein structure and hydropathy has been determined computationally, using publicly available tools, are presented. During the research on this theses new extendable open source software system that support bioinformatic research and have wide applications in prediction of various proteins characteristics is developed. A part of this thesis is described in the works  that are published or submitted for publications in several journals. The dissertation is organized as follows: In section1 is illustrated introduction to the problem of identifying T - cell epitopes, the importance of mathematical and computational methods in this area, as well as the importance of T - cell epitopes to the immune system and basis for functioning of the immune system. In section 2 are described in details data mining techniques that are used in the thesis for development of new models. Section 3 provides an overview of existing methods for predicting the T - cell epitopes and explains the work methodologies of existing models and methods. It pointed out the shortcomings of existing methods which have been the motivation for the development of new models for the T - cell epitope prediction. Some of the publicly available databases with the experimentally determined MHC binding peptides and T - cell epitope are described. In section 4 are presented new developed models for epitopes prediction. The developed models include three new encoding schemes for peptide sequences representation in the form of a vector which is more suitable as input to models based on the data mining techniques. Section 5 reports results of presented new classification and regression models. The new models are compared with each other as well as with currently existing methods for T cell epitope prediction. Section 6 presents the resView More
Keywords:подржавајући вектори, класификација, регресија, груписање к-срединама, правила придруживања, Т-ћелијски епитопи; support vector machine, classification, regression, k-mean clustering, association rules, T cell epitopes