Internet & Technology

A Robust Model for Gene Anlysis and Classification

Description
The International Journal of Multimedia & Its Applications (IJMA)
Published
of 10
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  The International Journal of Multimedia & Its Applications (IJMA) Vol.3, No.1, February 2011DOI : 10.5121/ijma.2011.3102   11   A    R  OBUST MODEL FOR GENE ANLYSIS ANDCLASSIFICATION   Fatemeh Aminzadeh, Bita Shadgar 1 , Alireza Osareh   Department of Computer Engineering, Shahid Chamran University, Ahvaz, Iran {f.aminzadeh, bita.shadgar, a.osareh}@scu.ac.ir  A BSTRACT    The development of microarray gene technology has provided a large volume of data to many fields.Microarray data analysis and classification has demonstrated an effective methodology for the effectivediagnosis of diseases and cancers. Although much research has been performed on applying machinelearning techniques for microarray data classification during the past years, it has been shown that conventional machine learning techniques have intrinsic drawbacks in achieving accurate and robust classifications. So it is more desirable to make a decision by combining the results of various expert classifiers rather than by depending on the result of only one classifier. We address the microarraydataset based cancer classification using a newly proposed ensemble classifier generation techniquereferred to as RotBoost, which is constructed by combining Rotation Forest and AdaBoost. Theexperiments conducted with 8 microarray datasets, among which a classification tree is adopted as thebase learning algorithm, demonstrate that RotBoost can generate ensemble classifiers with significantlylower prediction error than either Rotation Forest or AdaBoost more often than the reverse. K  EYWORDS   Microarray Analysis, Ensembles, Cancer Classification, Rotation Forest  1.   I NTRODUCTION   Microarray technology has provided the ability to measure the expression levels of thousands of genes simultaneously in a single experiment. This scheme provides an effective experimentalprotocol for gaining insight into the cellular mechanism and the nature of complex biologicalprocess. Microarray data analysis has been developing at fast speed in recent years and hasbecome a popular and standard way in most current genomics research works [1]Each spot on a microarray chip contains the clone of a gene from a tissue sample. SomemRNA samples are labelled with two different kinds of dyes, for example, Cy5 (red) and Cy3(blue). After each mRNA interacts with the genes, i.e., hybridization, the color of each spot onthe chip will change. Then, the resulted image reflects the characteristics of the tissue at themolecular level [2-3].However, the amount of data in each microarray is too overwhelming for manual analysis,since a single sample often contains measurements for around 10,000 genes. Due to thisexcessive amount of information, efficiently produced results require automatic computercontrolled analysis of data. Many computational tools have been applied to mine through thishuge amount of gene expression data to discover biologically meaningful knowledge.One of the main important computational groups for analysis of microarray data is machinelearning-based approaches. Machine learning techniques have been successfully applied tocancer classification problem using gene microarray data [4].   1 Corresponding author  The International Journal of Multimedia & Its Applications (IJMA) Vol.3, No.1, February 201112   As more and more gene microarray datasets become publicly available, developingtechnologies for analysis of such data becomes an essential task [5, 6]. Having said that, so farvarious machine learning and pattern recognition methods are increasingly utilized, for instance,discriminant analysis [7], neural networks [8], and support vector machines [9–11]. Aconsiderable amount of researches involving microarray data analysis are focused on cancerclassification, aiming at classifying test cancer samples into known classes with the help of atraining set containing samples whose classes are known [12]. To tackle this issue, severalmethods based on gene expression data have been suggested. Some of these are applicable onlyto binary classification, such as the weighted voting scheme of Golub et al. [13], whereas otherscan handle multiple classification problems. These approaches range from traditional methods,such as Fisher’s linear and quadratic discriminant analysis, to more modern machine learningtechniques, such as classification trees or aggregation of classifiers by bagging or boosting (fora review see [14]) [12]. There are also approaches which are able to identify test samples that donot belong to any of the known classes by imposing thresholds on the prediction strength [13,15].Despite these progresses, gene microarray cancer classification has remained a greatchallenge to computer scientists. The main challenges lie in the nature of microarray data, whichis mostly high-dimensional and noisy. Natural biological instabilities are very likely to importmeasurement variations and bring implications to microarray analysis [4]. This makes learningfrom microarray data a difficult task especially under the effect of curse of dimensionality.Indeed, gene expression data often contains many irrelevant and redundant features, which inturn can affect the efficiency of most machine learning techniques.There is therefore a great requirement to build up robust methods that are able to overcomethe limitation of the small number of microarray input instances and reduce the influence of uncertainties so as to produce reliable classification (cancerous/non-cancerous) results. In mostcases, one single classification model may not lead to high classification accuracy. Instead,multiple classifier systems (ensemble learning methods) have proved to be an effective way toincrease prediction accuracy and the robustness of a learning system [16].Although the application of multiple classifier systems (MCS) to microarray datasetclassification is still a new field, recently some different MCSs have been proposed to deal withthe gene microarray data classification problem. For example, Dettling et al. [ 17]   used a revisedboosting algorithm for tumor classification, Ramon et al. [18] applied Random Forest to tackleboth gene selection and classification problems simultaneously, and Peng [19]  designed a SVMensemble system for microarray dataset prediction.These techniques generally work by means of firstly generating an ensemble of baseclassifiers by applying a given base learning algorithm to different alternative training sets, andthen the outputs from each ensemble member are combined in a suitable way to create theprediction of the ensemble classifier. The combination is often performed by voting for the mostpopular class. Examples of these techniques include Bagging, AdaBoost, Rotation Forest andRandom Forest [20].AdaBoost technique creates a mixture of classifiers by applying a given base learningalgorithm to successive derived training sets that are formed by either resampling from thesrcinal training set or reweighting the srcinal training set according to a set of weightsmaintained over the training set [20]. Initially, the weights assigned to each training instance areset to be equal and in subsequent iterations, these weights are adjusted so that the weight of theinstances misclassified by the previously trained classifiers is increased whereas that of thecorrectly classified ones is decreased. Thus, AdaBoost attempts to produce new classifiers thatare able to better predict the ‘‘hard” instances for the previous ensemble members.The main idea of Rotation Forest is to provide diversity and accuracy within an ensembleclassifier. One possible way to promote diversity can be achieved by a principal component  The International Journal of Multimedia & Its Applications (IJMA) Vol.3, No.1, February 201113   analysis (PCA) based feature extraction for each base classifier. Indeed, the accuracy is soughtby keeping all principal components and also using the whole data set to train each baseclassifier. In view of the fact that both AdaBoost and Rotation Forest are successful ensembleclassifier generation techniques and they apply a given base learning algorithm to thepermutated training sets to construct their ensemble members with the only difference lying inthe ways to perturb the srcinal training set, it is plausible that a combination of the two mayachieve even lower prediction error than either of them.   In   [21]   an ensemble-based technique called RotBoost which is constructed by integrating theideas of Rotation Forest and AdaBoost is proposed. According to this study, RotBoost wasfound to perform much better than Bagging and MultiBoost on the utilized benchmark UCIdatasets. Here, we inspired from RotBoost technique and apply it for the first time on 8publically available gene microarray benchmark data sets. Indeed, we present a comparativestudy of RotBoost results with several ensemble and single classifier systems includingAdaBoost,   Rotation Forest, Bagging and single tree.   Experimental results revealed that theRotBoost ensemble method (with several basis classifiers) perform best among the consideredclassification procedures and thus produces the highest recognition rate on the benchmark datasets.   The rest of this paper is organized as follows. Section 2 our proposed algorithm for an efficientclassification of gene microarray data. Section 3 presents experimental results against 8publically available benchmark gene microarray datasets. Finally, Section 4 concludes thisstudy.     2. MATERIAL AND METHOD   This paper proposes an approach for the construction of accurate and diverse ensemblemembers by means of learning from the best sub-sets of initial microarray genes. The methodproposed in this paper is comprised of three main stages, i.e. feature selection based on fastcorrelation based filter, ensemble classifier generation method using a combination of RotationForest and AdaBoost algorithms and evaluating the generalisation ability of variousensemble/non-ensemble classifier systems.   The details of these stages are discussed in thefollowing sections.  2.1. Datasets In this work, we utilized 8 publicly available benchmark datasets [22]. A brief overview of thesedatasets is summarized in Table 1. Data pre-processing is an important step for handling geneexpression data. This includes two steps: filling missing values and normalization. For bothtraining and test dataset, missing values are filled using the average value of that gene.Normalization is then carried out so that every observed gene expression has mean equal to 0and variance equal to 1. In summary, the 8 datasets had between 2–5 distinct diagnosticcategories, 60–253 instances (samples) and 2000–24481 genes after the data preparatory stepsoutlined above.         The International Journal of Multimedia & Its Applications (IJMA) Vol.3, No.1, February 201114   Table 1.  Description of 8 gene microarray datasets. Dataset   # Total Genes ( T  ) # Instances ( n ) # Classes ( C  ) Colon Tumor2000 62 2Central Nervous System (CNS) 7129 60 2Leukaemia 6817 72 2Breast Cancer 24481 97 2Lung Cancer 12533 181 5Ovarian Cancer 15154 253 2MLL 12582 72 3SRBCT 2308 83 4  2.2. RotBoost Ensemble Technique As it was stated before, Rotboost is constructed by integrating the ideas of Rotation Forest andAdaBoost ensemble classifier generation techniques with the aim of achieving even lowerprediction error than either of these individual techniques. Consider a training set L which isdefined as follows:  1 {( , )} N i i i L x y = = (1) Assume that the above training set consisting of  N  independent instances, in which each sample( x i , y i ) is described by an input attribute vector x i as follows:  ( ) 1 2 , ,..., d i i i id  x x x x R = ∈ (2) and a class label y i which takes value from the label space φ = {1, 2, …, k  }. Now, in a typicalclassification problem, the goal is to use the information only from L to construct classifiers thathave good generalization ability, namely, perform well on the previously unseen test data whichare not used for learning the classifiers.For simplicity of the notations, let X  be an N  x d  matrix composed of the values of  d  inputattributes for each training instance and Y  be an N  -dimensional column vector containing theoutputs of each training instance in L . Alternatively, L can be expressed as concatenating X  and Y  horizontally, that is, L = [ X    Y  ]. Now, we can show the base classifiers which are included intoan ensemble classifier, say, C  * by C  1 , C  2 , . . . , C  T  [20]. Indeed, let E  = ( X  1 , X  2 , . . . , X  d  ) T be theattribute set composed of  d  input attributes. Before starting on proposing the RotBoostalgorithm, we briefly review the ensemble methods AdaBoost and Rotation Forest as follows.AdaBoost [23] is a sequential algorithm in which each new classifier is built by taking intoaccount the performance of the previously generated classifiers.In this ensemble method, a set of weights w t  ( i ) ( i = 1, 2, ... N  ) are maintained over the srcinaltraining set L . The weights initially set to be equal (namely, all training instances have the sameimportance). In subsequent iterations, these weights are adjusted so that the weight of theinstances misclassified by the previously trained classifiers is increased whereas that of thecorrectly classified ones is decreased. In this way, the difficult input samples can be betterpredicted by the next trained classifiers [20].In AdaBoost, the training set L t  utilized for learning each base classifier C  t  is acquired byeither resampling from the srcinal training set L or reweighting the srcinal training set L  according to the updated probability distribution w t  maintained over L . Here, the resamplingscheme is applied as it has less complexity for implementation. Indeed, each base classifier C  t  is
Search
Similar documents
View more...
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks