A new model for the inference of population characteristics from experimental data using uncertainties: Part II. Application to censored datasets

A new model for the inference of population characteristics from experimental data using uncertainties: Part II. Application to censored datasets
of 9
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  Analytica Chimica Acta 533 (2005) 31–39 A new model for the inference of population characteristics fromexperimental data using uncertaintiesPart II. Application to censored datasets Wim P. Cofino a , ∗ , Ivo H.M. van Stokkum b , Jaap van Steenwijk  c , David E. Wells d a Wageningen University, Environmental Sciences Group, Subdepartment of Water Resources, Hydrology and Quantitative Water Management Group, Nieuwe Kanaal 11, 6709 PA Wageningen, The Netherlands b  Department of Physics Applied Computer Science, Division of Physics and Astronomy, Faculty of Sciences, Vrije Universiteit, De Boelelaan 1081, 1081 HV Amsterdam, The Netherlands c  Ministry of Transport, Public Works and Water Management, Directorate General for Public Works and Water Management, Institute for Integrated Inland Water Management and Waste Water Treatment RIZA, P.O. Box 17, 8200 AA Lelystad, The Netherlands d Fisheries Research Services, Victoria Road, Aberdeen, UK  Received 4 April 2004; received in revised form 2 November 2004; accepted 2 November 2004Available online 15 December 2004 Abstract This paper extends a recent report on a model to establish population characteristics to include censored data. The theoretical background isgiven. The application given in this paper is limited to left-censored data, i.e.  less than  values, but the principles can also be adopted for othertypes of censored data. The model gives robust estimates of population characteristics for datasets with complicated underlying distributionsincluding lessthan valuesofdifferentmagnitudeand lessthan valuesexceedingthevaluesofnumericaldata.Theextendedmodelisillustratedwith simulated datasets, data from interlaboratory studies and temporal trend data on dissolved cadmium in the Rhine river. The calculationsconfirm that inclusion of left-censored values in the computation of population characteristics improves assessment procedures.© 2004 Elsevier B.V. All rights reserved. Keywords:  Censored data; Censored samples; Left-censored data; Maximum likelihood estimation (MLE); Detection limit; Robust statistics; LOD; LOQ;  Lessthan  values; Mean; Standard deviation; Interlaboratory studies; Proficiency testing 1. Introduction Censored data, i.e. datasets that include non-numericalvalues, are frequently encountered in different fields of science [1–4]. The non-numerical values may be known tobe below a certain limit, e.g. left-censored data as  less than values, and/or above an upper limit. Values below a limitof quantification (LOQ 1 ) are frequently encountered both in ∗  Corresponding author. Tel.: +31 317 474304; fax: +31 317 484885.  E-mail address: (W.P. Cofino). 1 This paper uses LOQ to denote the limit that is reported when  less than data are encountered. Values above this limit are referred to as ‘numericaldata’. environmental studies and in interlaboratory studies [5–7].Assumptions need to be made if these  less than  values areto be incorporated into the calculation of the populationcharacteristics. Apart from the removal of   less than  valuesfrom the dataset, a common approach is to substitute the  lessthan  values by a constant value like the LOQ itself, half theLOQ or zero. The most widely accepted and recommendedsubstitution is half the LOQ. However, several studies haveshown that simple substitution methods perform poorly incomparison to other methods in summary statistics [8–10].In order to improve the estimate of the summary statistics,methods have been developed that combine the numericalvalues with extrapolation of below-limit values, assuminga specific probability density function (pdf). The maximum 0003-2670/$ – see front matter © 2004 Elsevier B.V. All rights reserved.doi:10.1016/j.aca.2004.11.008  32  W.P. Cofino et al. / Analytica Chimica Acta 533 (2005) 31–39 likelihood method and log probability plotting are twoexamples [11]. In many environmental datasets  less than values occur along with potential outliers in the right handtail of the distribution. Robust estimation techniques havebeen developed to deal with such situations [3].Recently, a new model to calculate the population charac-teristics for experimental data has been reported [12]. Thismodel does not assume unimodality of the distribution andprovides a robust estimation of population characteristics. Inthis paper, the development of this model is described whichincludesleft-censoredvalues.Followinganoutlineofthethe-ory,theapproachisillustratedwithcalculationsonsimulateddatasets, on data from interlaboratory studies and on datafrom water quality monitoring. The model can be adapted inthe same manner to include other types of censored data. 2. Theoretical background Data arise from a measurement process which, when un-dercontrol,givesanoutputthatcanbedescribedbyaspecificprobability density function (pdf). A pdf can be attributed toa particular dataset by adding up the pdf’s associated with allthe individual independent measurements. The overall pdf constructed in this manner is the starting point for the model.Instead of calculating the mean of the data, the model setsouttoestablishthemostprobablevaluegiventheoverallpdf.The mathematical procedure borrows the concept of wave-functions from quantum mechanics. This enables the use of powerful matrix algebra. As an analogue to wavefunctions,observation measurement functions (OMF,  ϕ i ) 2 are definedasthesquarerootoftheprobabilitydensityfunctionwhichisattributedtotheindividualobservationinquestion.Thesetof OMFs forms a space, or a basisset, in which so called popu-lation measurement functions (PMFs 2 ) are constructed. TheconstructionofthePMF Ψ  i  isalinearcombinationofOMF’s,i.e.  Ψ  i  =  c ij  ϕ j  . A normalised, squared PMF is a pdf.In the model, the coefficients  c ij  are obtained by seekingforthe(unnormalised)PMFwhichhasthehighestprobabilityin the basisset. The probability of PMF i  is obtained as theintegral    Ψ  2 i  d x . Mathematically we have to establish theset of coefficients for which the integral    Ψ  2 d x  is maximal.The mathematical procedure uses the method of Lagrangemultipliers and imposes the additional constraint, that thesum of the squared coefficients is equal to one.The mathematical elaboration requires a solution to theeigenvector–eigenvalue equation  Sc = λ c . In this equation,  S  represents the matrix of overlap integrals. For example, thematrixelement S  12  iscalculatedas    ϕ 1 ϕ 2  d x ,i.e.theintegral 2 In this and following papers, the terminology is changed somewhat incomparison with reference [12]. Laboratory measurement function is re-placed by observation measurement function, interlaboratory measurementfunctionisnowdenotedaspopulationmeasurementfunction.Thismodifica-tionisappliedasthescopeofthemodelismuchbroaderthaninterlaboratorystudies. of the product of OMF 1  and OMF 2 .  S  12  provides a quanti-tative measure how well the two observations agree, takingthe respective pdf’s into account. The overlap integral canrange between 0 (no overlap) and 1 (100% overlap) when theobservations have identical pdf’s.Themodelrendersinasetof  n basisvectorsOMFatotalof  n  eigenvectors with eigenvalues  λ . The eigenvalue  λ i  givesthe probability in the basisset of the corresponding eigen-function  i . The highest probability and thus maximum valuefor λ isequaltothenumberofdata n ,whichisobtainedwhenall data have exactly the same pdf. In this case, each OMFhas a coefficient which is equal to 1 / √  n . The eigenvectorwith the highest eigenvalue  λ  is the PMF 1 . The remaining n − 1 linear combinations are ranked according to probability(i.e. eigenvalue) and are denoted as PMF 2 ,  ... , PMF n . PMF 2 and higher PMF’s may sometimes be additional modes, butarefrequentlyonlyclustersofdataorderedaccordingtotheirdegree of overlap. Each squared PMF effectively describes apart of the pdf of the ensemble of data. When the squaredPMFs are summed together over the entire concentrationrange, the pdf of the entire dataset is reconstructed.For each PMF  Ψ  i  the expectation value and variance canbe calculated as follows: m i  =    xΨ  2 i  d x    Ψ  2 i  d x, s 2 i  =    x 2 Ψ  2 i  d x    Ψ  2 i  d x −  ¯ m 2 i In addition to the mean and standard deviations of eachmode or cluster, the eigenvalues  λ  enable the quantitative as-sessment of the degree of comparability and the character(unimodal, bimodal) of the dataset. To this end, the programconverts the eigenvalue of the mode or cluster proper into apercentageoftheoverallpdf.Thepercentagethereforequan-titatively describes which fraction of the dataset is accountedfor by the PMF in question.The model is extended for use with  less than  values byapplying the appropriate probability density functions. Astraightforward approach can be taken when no assumptionsare made regarding the probability density function under-lying a  less than  value. In such a case, in a first approxi-mation each concentration between zero and the LOQ hasan equal probability. We can then use the square root of a rectangular probability density function as basisfunction.Explicitly, when a  less than  value is reported, the basis-function is equal to √  1 / LOQ in the interval between zeroand LOQ and zero otherwise. These basisfunctions have anexpectation value  m i  =    ϕ 2 i  x d x = LOQ / 2 and a variance    ϕ 2 i  x 2 d x −  ¯ m 2 i  = LOQ 2 / 12. When specific knowledge of the measurement process and the properties of the measuredobjectisavailable,itwouldbepossibletouseotherprobabil-ity density functions. Montville and Voigtman derived pdf’sfor the instrumental limit of detections [13]. These pdf’s canbe used when the model is specifically applied to such data.The implicit assumption made with the maximum likelihoodmethodandlogprobabilityplottingtechniquesentailsthattheLOQs are cut off from the population formed by the numeri-  W.P. Cofino et al. / Analytica Chimica Acta 533 (2005) 31–39  33 cal values, implying that a concentration just below the LOQis more likely rather than near zero. To mimic this assump-tion in a simple way, in this paper a basisfunction has beendefined as the square root of a simple triangular pdf. Thistriangular pdf has the form (2 / LOQ 2 ) x  for concentrationsbetween zero and LOQ and zero otherwise, with an expec-tation value  m i  =    ϕ 2 i  x d x = 2 × (LOQ / 3) and a variance    ϕ 2 i  x 2 d x −  ¯ m 2 i  = LOQ 2 / 18.Recently,thekerneldensityapproachhasbeenproposedtostudythefeaturesofthepopulation[14].Inthismethod,eachdatapoint is assigned a normal distribution with a fixed stan-dard deviation. This standard deviation is obtained using theh-estimator, which is optimised so as to obtain a meaningfulappearanceofthegraphicalrepresentationsofthepopulation.As with the kernel density approach, our model uses pdfsas building blocks. The key difference lies in linking the pdfstotheconceptofmeasurementfunctionsandbyusingmatrixalgebra to calculate the features of the population as outlinedabove. The model has an implementation, the normal dis-tribution approximation (NDA), which does not require theindividual uncertainties of the datapoints [12]. In this imple-mentation each observation is attributed a normal distribu-tion with one and the same standard deviation. This standarddeviationisestimatedsoastoreproducethepopulationchar-acteristics of a normal distribution quantitatively. The kerneldensitymethodandthenormaldistributionapproximationof themodelproduceverysimilargraphsofthepopulation.Thekernel density approach and our model are complementary,however our model provides additional tools for exploratorydataanalysis(e.g.graphicalrepresentationoftheoverlapma-trix, see Fig. 2 of the paper, and plots of the eigenvectors, see[12]) as well as the quantitative results in addition.The model is very flexible and can be applied in vari-ous ways both with respect to the type of probability densityfunctions, e.g. normal distributions, Students  t  -distribution,rectangular distributions, and the uncertainty characteristics,e.g.standarddeviationsreportedbylaboratoriesoracommonstandard deviation.The program [12] has been extended to include  less than values. Integrals between basisfunctions invoking the prod-uct of the square root of a normal distribution respectively arectangular or triangular pdf as described above are obtainedby numerical integration. Integrals among the rectangular orthe triangular functions are carried out using the analyticalfunctions. Integrals among basisfunctions based on the nor-mal pdfs are obtained as previously reported. The program isprovided as a free Matlab toolbox upon request. 3. Comparison of methods on simulated datasets The extended model is demonstrated using a simulateddataset following the approach described by Kuttatharm-makul et al. [2]. A total of 250 datasets consisting of twelveobservations were generated from a normal distribution withmean 1.09 and standard deviation 0.20. Subsequently, obser-vations less than one were treated as a  less than  value withLOQ=1. Only datasets with at least one LOQ were includedin the calculations. The means and standard deviations foreach dataset were calculated with two methods: the Cohenmaximum likelihood method estimator [2] and the model us-ing a rectangular pdf for the LOQs. The Cohen maximumlikelihood method was selected since it is regarded as an ap-propriate approach to incorporate left-censored data into theevaluation [2]. The main restriction in the use of the methodisthatitrequiresthedatatobenormallydistributedanditcanonly accept one value for the left-censored data. The resultsof the calculations are depicted in Fig. 1.The Cohen maximum likelihood estimator and the modelgive comparable results when the number of   less than  valuesis below five. The two methods disagree when the numberof   less than  values exceeds five. The Cohen maximum likeli-hood estimator requires the numerical data also at high LOQpercentages to estimate the characteristics of the assumedunderlying distribution and thus to calculate mean and stan-darddeviationadjustedforLOQs.Themodeldoesnotinvokeanyassumptionaboutthecharacteroftheoverallpopulation.When more than five LOQs are present, the model indicatesthat the dataset is bimodal. The first mode consists of thesix or more  less than  values which all have the same pdf. Inprinciple, the expectation value of this mode is 0.5 (i.e., theexpectation value of the individual basisfunctions). Higherexpectation values occur when numerical data with a valueclosetoonearepresent.Suchdatahavepdfsthatoverlapwiththe pdfs of the  less than  values. Because of this overlap, theexpectation value of the first mode is increased. The secondmode consists of the numerical values. In a conventional in-terpretation, the model indicates that the numerical data areoutliers when the number of   less than  values is greater thanfive. For an interlaboratory study, the interpretation might bethat the higher values are attributed to false positives.When the number of   less than  values equals five, the levelof agreement between the Cohen maximum likelihood esti- Fig.1. Resultsoftwomethodstocalculatethemeanofleft-censoreddataon250 simulated datasets. The line  y =  x  is drawn to facilitate the comparison.  34  W.P. Cofino et al. / Analytica Chimica Acta 533 (2005) 31–39 Table 1Calculations on polybrominated flame retardants (data from De Boer and Cofino, 2002)Matrix and congener All data Only numerical data Numerical data and LOQs ≤ NDAmean of numerical data a Nobs ExpectationvalueS.D. % Nobs ExpectationvalueS.D. % Nobs ExpectationvalueS.d. %Eel—BDE 209 12 0 . 78 2 . 49 35.2 4 0 . 078 0 . 083 34.6 5 0 . 074 0 . 082 27.9Eel—BDE 119 9 0 . 042 0 . 048 51.9 4 0 . 038 0 . 023 52.6 4 0 . 038 0 . 023 52.6Mussel—BDE 153 11 0 . 034 0 . 034 39.1 7 0 . 047 0 . 018 44.4 9 0 . 037 0 . 019 37.1Cormorant—BDE 66 8 0 . 15 0 . 14 44.2 3 0 . 063 0 . 018 50.1 5 0 . 039 0 . 024 37.5Porpoise Liver—BDE 209 13 4 . 71 8 . 12 37.1 4 7 . 50 3 . 17 47.2 10 1 . 59 1 . 77 36.6Sediment 7—BDE 75 6 0 . 036 0 . 045 40.0 4 0 . 26 0 . 132 44.5 6 0 . 036 0 . 045 40.0 a The NDA mean of the numerical data is the expectation value of PMF 1  obtained by applying the normal distribution approximation (NDA) implementationof the model to the numerical data. The NDA approach does not require the specification of the uncertainties of the laboratories [12]. mator and the model varies significantly. This can be tracedback to the characteristics of the dataset. Depending on thedistribution of the numerical data the first mode is made upbythenumericaldata,the lessthan data,orbyacombinationof both. In the first case, a good correspondence with the Co-henmethodisobtained.Inthelattertwocases,theagreementwith the Cohen method is less good.The calculations indicate that the Cohen maximum likeli-hoodestimatorandthemodelgivecomparableresultsexceptwhen the number of   less than  values is high. This differencearises as the approaches are based on different principles.The Cohen method assumes a normal distribution for thenumerical data and corrects for the  less than  values. Ourmodelsetsouttocalculatetheperformancecharacteristicsof the ‘first mode’ of the dataset, regardless whether this modeis composed of numerical or censored data. The availabilityof statistical methods based upon different principles is anadvantage. When the outcomes of the methods differ, thedataset should be inspected. It should be judged whether theassumptions underlying the statistical methods are met. Thenatureofthemeasurementshouldbetakenintoaccount—aremeasurement problems (e.g. contamination, incompleteresolution) possible? The statistical procedures have thusto be complemented by chemical expert judgement. This judgement will determine whether it is possible to make astatementabouttheperformancecharacteristicsofthedatasetat all. 4. Case study I—interlaboratory study onpolybrominated diphenylethers (PBDEs) The result of a recent interlaboratory study on PBDEs hasbeen reported [5]. Datasets in this study contained a smallnumberofobservationswitharelativelyhighnumberofleft-censored data which varied considerably in magnitude. Thenumerical data exhibit a wide scatter and had difficult under-lying distribution profiles. A selection of the data from thisstudy are used to illustrate the extended model.Initially, calculations were made with the full dataset andthen with the dataset without the  less than  values. In mostcases, inclusion of the  less than  values had a small effect.In Table 1, results are given for some difficult datasets. ForBDE119and209ineelandBDE66incormorantthecalcula-tionsonthefulldatasets,includingall lessthan values,giveahigherexpectation valuethanthecalculations onthedatasetsfrom which all the  less than  values have been removed. Thispattern is caused by  less than  values with high LOQs. Thiseffect is illustrated for BDE 209 in eel with a graphical rep-resentation of the overlap matrix given in Fig. 2. The numer-ical data exhibit a poor comparability (observations 9–12 inFig. 2). In this case, the model gives an expectation value of 0.078 ± 0.082 for the first mode, representing 34.6% of thedataset. This expectation value is determined predominantly Fig. 2. Graphical representation of the overlap matrix for BDE 209 in eel.The overlap integral can have values between 0 and 1. The bar on the rightdepicts the relationship between gray scale and the magnitude of the overlapintegral—white represents an overlap of 1, black represents no overlap. Theobservations 9–12 are numerical data, the observations 1–8 are  less than values.Theobservationsareorderedaccordingtotheirmagnitude.Thefigurehas been divided into three zones defined by Roman numerals I, II and III.I is a 4 × 4 matrix of the numerical data, II is a 8 × 8 matrix depicting theoverlaps between the  less than  values and III depicts the overlaps betweenthe numerical data and  less than  values. The row of data at the bottom of thefigure provides the concentrations reported.  W.P. Cofino et al. / Analytica Chimica Acta 533 (2005) 31–39  35 by the observation 10, which overlaps moderately with boththe observations 9 and 11 (overlaps respectively 0.16 and0.34). The expectation value for the entire dataset includ-ing all  less than  values is calculated to be 0.78 ± 2.5, whichaccounts for 35.2% of the dataset. The 10-fold increase inexpectation value is due to the rise of a new cluster withstrongly overlapping data along with the introduction of the less than  values. This cluster includes the LOQs <.32, <.4,<.88 and <1.5 (observations 5, 4, 6, and 3 in Fig. 2). Similareffects occurs with the introduction of LOQs into the calcu-lations for BDE 119 in eel and BDE 66 in cormorant. Thisobservation suggests that the magnitude or the indicative in-formation of a LOQ is important in any assessment. Clearly,the indicative information of LOQs which are an order of magnitude or more greater than numerical data is virtuallyzero. An example is the LOQ of <50 for BDE 209 in eelis which substantially greater than the expectation value of 0.078 based on the reported numerical data. The degree towhich the calculations are affected by the high LOQs de-pends on the nature of the dataset. When there is a largenumber of laboratories reporting numerical data that are ingood agreement amongst themselves, the presence of a lim-ited number of high LOQs only has a small effect. Effectsbecome greater when there is a small number of numericaldataandseveral‘high’LOQsoccurwhichoverlapwiththem-selves and/or with numerical outliers. LOQs higher than themedian of the numerical data probably contain the true con-centration, but provide little information and may perturb thecalculations.In this paper the calculations have been repeated with aconstraint on the magnitude of LOQs which can be accepted.The constraint imposed was that only LOQs are includedwhich are equal to or less than the expectation value ob-tained for the set of numerical data with the normal distri-bution approximation of the model [12]. The advantage of this approach is that an unwanted effect on the calculationsarisingfromhighLOQsisprevented.Thedisadvantage,how-ever,isthatthecut-offpointforLOQsintroducesasubjectiveelement in the calculations.The outcome of these calculations are also indicated inTable 1. For BDE 209 in eel there is only one LOQ that sat-isfies the criterion for inclusion. This observation, number 1,exhibits a small overlap with the numerical data (observa-tions 9–12, Fig. 3), so that the means of the calculations withand without this LOQ differ little. However, the inclusion of the LOQs for BDE 209 in porpoise liver and for BDE 75 insedimentsevenhasapronouncedeffectontheoutcomeofthecalculations. In each case it is essential to use the calculated Fig. 3. Overview of results and the summed measurement functions for   -HCH and pp ′ -DDT in biological tissue showing the need for inclusion of theleft-censored values. Expectation value and standard deviation of PMF i  are indicated by horizontal bars in the bottom panels. The   -HCH dataset contains 7left-censored data (observation numbers 9–15, the pp ′ -DDT dataset contains 15 left-censored data (observation numbers 13–27).
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks