Arts & Culture

A novel framework for imputation of missing values in databases

Description
Abstract Many of the industrial and research databases are plagued by the problem of missing values. Some evident examples include databases associated with instrument maintenance, medical applications, and surveys. One of the common ways to cope
Categories
Published
of 18
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  692 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 37, NO. 5, SEPTEMBER 2007 A Novel Framework for Imputation of Missing Values in Databases Alireza Farhangfar, Lukasz A. Kurgan, Member, IEEE  , and Witold Pedrycz, Fellow, IEEE   Abstract —Many of the industrial and research databases areplagued by the problem of missing values. Some evident examplesinclude databases associated with instrument maintenance, med-ical applications, and surveys. One of the common ways to copewith missing values is to complete their imputation (filling in).Given the rapid growth of sizes of databases, it becomes imper-ative to come up with a new imputation methodology along withefficient algorithms. The main objective of this paper is to developa unified framework supporting a host of imputation methods.In the development of this framework, we require that its usageshould (on average) lead to the significant improvement of accu-racy of imputation while maintaining the same asymptotic com-putational complexity of the individual methods. Our intent is toprovide a comprehensive review of the representative imputationtechniques. It is noticeable that the use of the framework in thecase of a low-quality single-imputation method has resulted in theimputation accuracy that is comparable to the one achieved whendealing with some other advanced imputation techniques. Wealso demonstrate, both theoretically and experimentally, that theapplication of the proposed framework leads to a linear compu-tational complexity and, therefore, does not affect the asymptoticcomplexity of the associated imputation method.  Index Terms —Accuracy, databases, missing values, multipleimputation (MI), single imputation. I. I NTRODUCTION M ANYofindustrialandresearchdatabasesareplaguedbyan unavoidable problem of data incompleteness (miss-ing values). Behind this serious deficiency, there are a numberof evident reasons, including imperfect procedures of manualdata entry, incorrect measurements, and equipment errors. Inmany areas of application, it is not uncommon to encounterdatabases that have up to or even more than 50% of their en-tries being missing. For example, an industrial instrumentationmaintenance and test database maintained by Honeywell [31]has more than 50% of missing data, despite the strict regulatoryrequirements for data collection. Another application domainoverwhelmed by missing values arises in medicine; here, al-most every patient record lacks some values, and almost everyattribute used to describe patient’s records is lacking values forsome patient’s record [17]. For example, a medical database of patients with cystic fibrosis with more than 60% of its entriesmissing was analyzed in [30]. One of the reasons why medical Manuscript received October 1, 2004; revised May 25, 2005. This work wassupported in part by the Natural Sciences and Engineering Research Council of Canada. The work of W. Pedrycz was supported by the Canada Research ChairsProgram. This paper was recommended by Associate Editor D. Zhang.The authors are with the Department of Electrical and Computer Engineer-ing,UniversityofAlberta,Edmonton,ABT6G2V4,Canada(e-mail:farhang@ece.ualberta.ca; lkurgan@ece.ualberta.ca; pedrycz@ece.ualberta.ca).Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TSMCA.2007.902631 databases are so heavily exposed is that most medical dataare collected as a by-product of patient care activities, ratherthan for an organized research protocol [17]. At the same time,the majority of prior studies related to missing data concernrelatively low, usually below 20%, amounts of missing data[1], [4], [41]. In contrast, in this paper, we are concerned withdatabases with up to 50% of missing data.Missing values make it difficult for analysts to realize dataanalysis. Three types of problems are usually associated withmissing values: 1) loss of efficiency; 2) complications inhandling and analyzing the data; and 3) bias resulting fromdifferences between missing and complete data [3]. Althoughsome methods of data analysis can cope with missing values ontheir own, many others require complete databases. Standardstatistical software works only with complete data or uses verygeneric methods for filling in missing values [31]. Other dataprocessing packages that are used for visualization and mod-eling often use and display only the complete records or mapmissing values to an arbitrary fixed value, e.g., − 1 or 999999,thus leading to distortion of the presented results. Hence, in allsuch cases, imputation plays an important role. It could alsobe invaluable in cases when the data needs to be shared, andthe individual users may not have resources to deal with theirincompleteness [33], [44].There are two general approaches to deal with the problemof missing values: They could be ignored (removed) or imputed(filled in) with new values. The first solution is applicable onlywhen a small amount of data is missing. Since in many casesdatabases contain relatively large amount of missing data, it ismore constructive and practically viable to consider imputation.A number of different imputation methods have been reportedin the literature. Traditional imputation methods use statisticsand rely on some simple algorithms such as mean and hot-deck imputation, as well as complex methods including regression-based imputation and expectation–maximization (EM) algo-rithm. In recent years, a new family of imputation methods,which uses machine learning (ML) algorithms, was proposed.Another major development comes in the form of the multipleimputations (MIs) first described by Rubin in the 1970s [43].In this case, each missing value is imputed m times (usually, m is between 3 and 5) by the same imputation algorithm,which uses a model that incorporates some randomness. As aresult, m “complete” databases are generated, and usually, theaverage of the estimates across the samples is used to generatethe final imputed value. The development of such methods wasmainly driven by a need to improve accuracy of the imputation.Early methods were very simple and computationally inex-pensive. Newer methods use more complex procedures, whichcould improve the quality of imputation, but come at a highercomputational effort. At the same time, we have witnessed a 1083-4427/$25.00 © 2007 IEEE  FARHANGFAR et al. : NOVEL FRAMEWORK FOR IMPUTATION OF MISSING VALUES IN DATABASES 693 Fig. 1. Database containing missing values. rapid growth of size of databases. Recently published resultsof a 2003 survey on the largest and most heavily used com-mercial databases show that the average size of Unix databasesexperienced a 6-fold increase when compared to year 2001. ForWindows databases, this growth was 14-fold. Large commer-cial databases now average ten billion data points [57].The main objective of this paper is to build a new framework aimed at the improvement of the quality of existing imputationmethods (we will be referring to them as base imputationmethods). The framework should meet three requirements.1) It should improve accuracy of imputation when comparedto the accuracy resulting from the use of a single baseimputation method.2) An application of the framework to a base imputationmethod should not worsen its asymptotic computationalcomplexity.3) It should be applicable to a wide range of generic(base) imputation methods, including both statistical andML-based imputation techniques.To meet these requirements, in the proposed framework, weimpute some of the missing values several times. Furthermore,the overall environment is characterized by several importantfeatures that clearly distinguish it from some other MI methods.• It imputes only a subset of the missing values multipletimes. The imputation is executed in an iterative manner.At each iteration, some high-quality imputed values areaccepted, and the remaining lower quality missing val-ues are imputed again (multi-imputed). Assuming that ateach iteration half of the values are imputed (the frame-work uses mean-based parameter to select imputed values,which for data with normal distribution approximates to ahalf of values) and that ten iterations are executed, then thenumber of imputations becomes equal to 9  i =0 12 i k = k +12 k +14 k + ··· +1512 k < 2 k where k is the number of missing values.Incontrast,inthecaseofMIs,everymissingvalueisimputedseveral times, and for the typical values of  m , the number of imputationsisnotlessthan 3 k ,butitcanbeaslargeas 10 k [52].Therefore, the framework is more efficient.• It uses the high-quality accepted imputed values from thepreviousiterationstoimputetheremainingmissingvalues.Incontrast,MIsusetheoriginaldatabase containing allthemissing values and, thus, do not take advantage of alreadyimputedvalues.Therefore,theimputationprocedureoftheproposed framework is possibly more accurate since, ineach iteration, more data are used to infer the imputationmodel for imputing the remaining missing values. Thishypothesis was confirmed experimentally in this paper.Extensive experimental results presented in this paper showthat the proposed imputation framework results in substantialimprovement of imputation accuracy. We show that using theproposed framework with very simple imputation methods,such as hot deck, produces accuracy of imputation that sur-passes quality of results generated by advanced statistical andMI methods while preserving low computational overhead.This advantage is clearly demonstrated with the use of theproposed framework to imputation technique of linear com-plexity (i.e., an ML-based imputation using Naïve Bayes). Theresulting imputation method was also linear, and its accuracyis higher than that of any of several other imputation methods,including complex statistical and MIs techniques.This paper is organized in the following manner. Wefirst review a number of representative imputation methods(Section II). Section III elaborates on the structure of the pro-posed framework. In Section IV, we report on experimental re-sults and offer an extensive comparative analysis. Conclusionsand recommendations are covered inSection V.Throughout thetext, the term database pertains to a relational data set.II. B ACKGROUND AND R ELATED W ORK  A. Background  In what follows, we are concerned with databases consistingof one or multiple tables, where columns describe attributes(features), and rows denote records (examples or data points).Fig. 1 shows a typical database involving five attributes; notethat some of them have missing values denoted by “?”. Ingeneral, the attributes can be numerical discrete, numerical con-tinuous, and nominal. In this paper, we are dealing with impu-tation procedures for discrete attributes, i.e., discrete numericaland nominal. We note that the two main application areas of missing data imputation procedures are concerned with equip-ment maintenance databases [31] and survey data [23], [29],[43], [45], both of which use discrete data.Some of the missing data imputation algorithms are super-vised; that is, they require some class attribute. They imputemissing values one attribute at a time by setting it to be theclass attribute and using data from the remaining attributesto generate a classification model, which is used to performimputation.The three different modes that lead to introduction of miss-ing values are: 1) missing completely at random (MCAR);2) missing at random (MAR); and 3) not missing at random(NMAR) [31], [33]. The MCAR mode applies when the distri-bution of a record having a missing value for an attribute doesnot depend on either the complete data or the missing data.This mode usually does not hold for nonartificial databases.Its relaxed version, i.e., the MAR mode, where the distributiondepends on data but does not depend on the missing data itself,  694 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 37, NO. 5, SEPTEMBER 2007 Fig. 2. Flow of operations in MI. is assumed by most of the existing methods for imputationof missing data [51], and therefore, it is also assumed in thispaper. In case of the MCAR mode, the assumption is that thedistribution of missing and complete data are the same, whereasfor the MAR mode, they are different, and the missing data canbe predicted by using the complete data [33]. The third mode,i.e., the NMAR, where the distribution depends on the missingvalues, is rarely used in practice.  B. Related Work  The existing methods for dealing with missing values canbe divided into two main categories: 1) missing data removaland 2) missing data imputation. The removal of missing valuesis concerned with discarding the records with missing valuesor removing attributes that have missing entries. The latter canbe applied only when the removed attributes are not needed toperform data analysis. Both removals of records and attributesresult in decreasing the information content of the data. Theyare practical only when a database contains a small amount of missing data and when the ensuing analysis of the remainingcomplete records will not be biased by the removal [31]. Theyare usually performed in the case when dealing with missingdata introduced in the MCAR mode. Another method belong-ing to the same category proposes substituting missing valuesfor each attribute with an additional category. Although thismethod provides a simple and easy-to-implement solution,its usage results in substantial problems occurring during thesubsequent analysis of the resulting data [55].The imputation of missing values uses a number of differ-ent algorithms, which can be further subdivided into single-imputation and MI methods. In the case of single-imputationmethods, a missing value is imputed by a single value, whereasin the case of MI methods, several, usually likelihood ordered,choices for imputing the missing value are computed [43].Rubin defines MIs as a process where several complete data-bases are created by imputing different values to reflect uncer-tainty about the right values to impute. At the next step, eachof the databases is analyzed by standard procedures specificfor handling complete data. At the end, the analyses for eachdatabase are combined into a final result [11], [44]. Fig. 2illustrates the flow of operations in MI procedure.Several approaches have been developed to perform MIs.Li [32] and Rubin and Schafer [42] use Bayesian algorithmsthat support imputation by using posterior predictive distrib-ution of the missing data based on the complete data. TheRubin–Schafer method assumes the MAR mode, as well asmultivariate normal distribution for the data. Alzola and Harrellintroduce a method that imputes each incomplete attributeby cubic spline regression given all other attributes, withoutassuming that the data can be modeled by a multivariate distrib-ution [2]. The MI methods are computationally more expensivethan the single-imputation techniques, but at the same time,they better accommodate for sample variability of the imputedvalue and uncertainty associated with a particular model usedfor imputation [31]. Detailed description of MI algorithms canbe found in [45], [51], [52], and [59].Both the single-imputation and MI methods can be dividedinto three categories: 1) data driven; 2) model based; and 3) MLbased [31], [33], [38]. Data-driven methods use only the com-plete data to compute imputed values. Model-based methodsuse some data models to compute imputed values. They assumethat the data are generated by a model governed by unknownparameters. Finally, ML-based methods use the entire availabledata and consider some ML algorithm to perform imputation.The data-driven methods include simple imputation proce-dures such as mean, conditional mean, hot-deck, cold-deck,and substitution imputation [31], [49]. The mean and hot-deck methods are described in detail later in this paper, whereas theremaining methods are only applicable to special cases. Thecold-deck imputation requires additional database, other thanthe database with missing values, to perform imputation, whichis usually not available to data analyst. The substitution methodis applicable specifically to survey data, which significantlynarrows down its possible application domains.Several model-based imputation algorithms are describedin [33]. The leading methods include regression-based,likelihood-based, and linear discriminant analysis (LDA)-basedimputation. In regression-based methods, missing values for agiven record are imputed by a regression model based on com-plete values of attributes for that record. The method requiresmultiple regression equations, each for a different set of com-plete attributes, which can lead to high computational cost.Also, different regression models must be used for differenttypes of data; that is, linear or polynomial models can be usedfor continuous attributes, whereas log–linear models are suit-able for discrete attributes [31]. The likelihood-based methodscan be considered to impute values only for discrete attributes.They assume that the data are described by a parameter-ized model, where parameters are estimated by maximum-likelihood or maximum a posteriori procedures, which usedifferent variants of the EM algorithm [18], [33].Recently, several ML algorithms were applied to the designand implementation of imputation methods. A probabilisticimputation method that uses probability density estimates andBayesian approach was applied as a preprocessing step foran independent module analysis system [13]. Neural networkswere used to implement missing data imputation methods[26], [55]. An association rule algorithm, which belongs to thecategory of algorithms encountered in data mining, was usedto perform MIs of discrete data [58]. Recently, algorithms of supervised ML were used to implement imputation. In thiscase, imputation is performed one attribute at a time, where theselected attribute is used as a class attribute. An ML algorithmis used to generate a data model from data associated withcomplete portion of the class attribute, and the generated modelis used to perform classification to predict missing values of the class attribute. Several different families of supervised MLalgorithms, such as decision trees, probabilistic, and decision  FARHANGFAR et al. : NOVEL FRAMEWORK FOR IMPUTATION OF MISSING VALUES IN DATABASES 695 Fig. 3. Structure of the proposed framework. rules [18], can be used; however, the underlying methodologyremains the same. For example, a decision tree C4.5 [39], [40]and a probabilistic algorithm Autoclass [14] were used in [31],whereasadecisionrulealgorithmCLIP4[15],[16]andaproba-bilistic algorithm Naïve Bayes were studied in [22]. A decisiontree along with an information retrieval framework was usedto develop incremental conditional mean imputation in [19].In [4], a k -nearest neighbor algorithm was used. Statistical andML-based imputation methods are briefly compared in [23].Also recently, ML-based imputation methods were experi-mentally compared with data-driven imputation, showing theirsuperiority in terms of imputation accuracy [22].The development of the new missing data imputation meth-ods was mainly driven by the need to improve accuracy of imputation. The simplest data-driven imputation methods werefollowed by model-based methods and MI procedures. As aresultof thisevolution, complex and computationally expensivealgorithms, such as MI logistic regression, were developed.At the same time, because of recent rapid growth of databasesizes, researchers and practitioners require imputation methodsto be not only accurate but also scalable. MIs and ML-basedimputation methods are characterized by a relatively highaccuracy, but at the same time, they are often complex andcomputationally too expensive to be used for real-time impu-tation or for imputing in large databases [49]. We show, boththeoretically and experimentally, that the proposed framework has linear asymptotic complexity with respect to the numberof records. Therefore, as long as the base imputation methodhas linear or worse complexity (to the best of our knowledge,there are no sublinear imputation methods), the application of the framework does not worsen the base method’s complexity.The proposed framework consists of three modules, which areconcerned with performing mean pre-imputation, using confi-dence intervals, and applying boosting, respectively. Extensiveexperimental tests show that the application of the proposedframework improves accuracy of the base imputation methodand, at the same time, preserves its asymptotic complexity.Applying the framework to a very simple imputation method,such as hot deck, on average, improves its accuracy to matchaccuracy of complex model-based imputation methods, such asmultiplepolytomouslogisticregressionimputation,whileatthesame time being significantly faster and easier to implement.This paper concerns the imputation of discrete attributes.This limitation is imposed by the considered base imputationmethods; that is, in the case of ML-based imputation, onlydiscrete attributes can be imputed. We note that the proposedframework is applicable to imputation methods that handlecontinuous attributes, and its extension to these methods willbe the subject of future work.III. P ROPOSED F RAMEWORK The overall architecture of the proposed framework is visu-alized in Fig. 3. It consists of three main functional modules:1) mean pre-imputation; 2) application of confidence intervals;and 3) boosting. All of those are visualized as shadowedboxes.Let us briefly discuss the functionality of each of thesemodules. The missing values are first pre-imputed (module 1),i.e., temporarily filled with a value that is used to perform impu-tation, using a fast linear mean imputation method. Next, eachmissing pre-imputed value is imputed using a base imputationmethod, and the imputed value is filtered by using confidenceintervals (module 2). Confidence intervals are used to select themost probable imputed values while rejecting possible outlierimputations. Once all the values are imputed and filtered, eachof them is assigned with a value that quantifies its quality; thatis, it might be expressed as a probability or a distance. Based onthese values, the boosting module (module 3) accepts the besthigh-quality imputed values, whereas the remaining imputedvalues are rejected, and the process repeats with the newpartially imputed database. After ten iterations, all the remain-ing imputed values are accepted, and the imputed database isoutputted.Wenotethatanyimputationmethod,i.e.,datadriven,model based, or ML based, can be used as the base method.  A. Imputation Methods This section provides a short description of several imputa-tion methods being used in the proposed framework or in theexperimental section of this paper. A description of how theselected methods are incorporated in the proposed framework is also provided. The selection of the imputation methods wasdriven by the following principles. The base methods thatwill be tested with the proposed framework should be simpleenough to show that they can be improved by the application of the framework to match or surpass the quality of complex high-quality model-based imputation methods. They should alsorepresent both data-driven and ML-based categories. Therefore,hot-deck imputation and ML-based imputation that use NaïveBayes algorithms were selected.To provide comprehensive evaluation, the framework withthe selected two base methods should be compared with ad-vanced high-quality model-based imputation methods, as wellas fast data-driven methods. Therefore, two MI methods, i.e.,LDA-based method and multivariate imputation that combineslogistic, polytomous, and linear regressions, and three data-driven methods, i.e., mean, hot deck, and MI by sampling, areused in the experimental section.  696 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 37, NO. 5, SEPTEMBER 2007 TABLE IS UMMARY OF THE I MPUTATION M ETHODS U SED IN T HIS P APER 1) Single-Imputation Methods: In the mean imputation , themean of the values of an attribute that contains missing data isused to fill in the missing values. In the case of a categoricalattribute, the mode, which is the most frequent value, is usedinstead of the mean. The algorithm imputes missing values foreach attribute separately. Mean imputation can be conditionalor unconditional, i.e., not conditioned on the values of othervariables in the record. Conditional mean method imputesa mean value, that depends on the values of the completeattributes for the incomplete record [8]. In this paper, the un-conditional mean, which is computationally faster and thereforecan be efficiently used with large data sets, is used both toimpute the missing values as a stand-alone method and toperform pre-imputation of the missing values in the proposedframework.In the hot deck  , for each record that contains missing values,the most similar record is found, and the missing values are im-puted from that record. If the most similar record also containsmissing information for the same attributes as in the srcinalrecord, then it is discarded, and another closest record is found.The procedure is repeated until all the missing values aresuccessfully imputed or the entire database is searched. Whenno similar record with the required values filled in is found, theclosest record with the minimum number of missing values ischosen to impute the missing values. Several distance functionscan be used [23], [45], [48]. In this paper, a computationallyfast distance function is used, which assumes a distance of 0between two attributes if both have the same numerical ornominalvalueor,otherwise,assumesadistanceof1.Adistanceof 1 is also assumed for an attribute for which any of the tworecords has a missing value. In the case of supervised databases,which are used in this paper, the hot-deck method takes ad-vantage of the class information to lower computational time.Since, usually, certain correlations exist between records inthe same class, the distance is computed only between recordswithin the same class.In regression , imputation is performed by regression of themissing values using complete values for a given record [26].Several regression models can be used, including linear, lo-gistic, polytomous, etc. Logistic regression applies maximum-likelihood estimation after transforming the missing attributeinto a logit variable, which shows changes in natural log oddsof the missing attribute. Usually, logistic regression model isapplied for binary attributes, polytomous regression for discreteattributes, and linear regression for numerical attributes.  Naïve Bayes is an ML technique based on computing prob-abilities [21]. The algorithm works with discrete data andrequires only one pass through the database to generate a clas-sification model, which makes it very efficient, i.e., linear withthe number of records. Imputation based on the Naïve Bayesconsistsoftwosimplesteps.Eachattributeistreatedastheclassattribute, and the data are divided into two parts: 1) trainingdatabase that includes all records for which class attribute iscomplete and 2) testing database for which the records aremissing.First,priorprobabilityofeachnon-classattributevalueand frequency of each nonclass attribute value in combinationwith each class attribute value are computed on the basis of thetraining database. The computed probabilities are then used toperform prediction of class attribute for testing database, whichconstitute the imputed values. 2) MI Methods: One of the most flexible and powerful MIregression-based methods is the multivariate imputation bychained equations (MICE) [9], [10]. The method provides afull spectrum of conditional distributions and related regressionmodels. MICE incorporates logistic regression, polytomousregression, linear regression and usesGibbs sampler togenerateMI [12]. MICE is furnished with a comprehensive state-of-the-art missing data imputation software package [28]. We will useit in the experimental section of this paper. It provides Bayesianlinear regression for continuous attributes, logistic regressionfor binary attributes, and polytomous logistic regression forcategorical data with more than two categories. MICE alsodelivers a comprehensive library of nonregression imputationmethods, such as predictive mean, unconditional mean, multi-ple random sample imputation that is suitable for the attributesin the MCAR model, and LDA for categorical data with morethan two categories. LDA is a commonly used technique fordata classification and dimensionality reduction [34]. At thesame time, it serves as a statistical approach to classification-based missing data imputation. The LDA method is particularlysuitable for data where within-class frequencies are unequal, asit maximizes the ratio of between-class variance to the within-class variance to assure best separations.Table I summarizes all methods that are used in this pa-per. Three single-imputation and four MI methods were used.The methods include data-driven, model-based, and ML-based
Search
Similar documents
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks