Internet & Web

Combination of dynamic time warping and multivariate analysis for the comparison of comprehensive two-dimensional gas chromatograms

Combination of dynamic time warping and multivariate analysis for the comparison of comprehensive two-dimensional gas chromatograms
of 7
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
   Journal of Chromatography A, 1216 (2009) 2866–2872 Contents lists available at ScienceDirect  Journal of Chromatography A  journal homepage: Combination of dynamic time warping and multivariate analysis for thecomparison of comprehensive two-dimensional gas chromatogramsApplication to plant extracts  Jérôme Vial a , ∗ , Hicham Noc¸airi b , Patrick Sassiat a , Sreedhar Mallipatu a , Guillaume Cognon c ,Didier Thiébaut a , Béatrice Teillet d , Douglas N. Rutledge b a Laboratoire Environnement et Chimie Analytique, ESPCI ParisTech – CNRS UMR 7121, 10 rue Vauquelin, 75 005 Paris, France b « Ingénierie Analytique pour la Qualité des Aliments » , UMR INRA/AgroParisTech, Laboratoire de Chimie Analytique, 16 rue Claude Bernard, 75231 Paris Cedex 05, France c Institut de Recherche Criminelle de la Gendarmerie, 1 Boulevard Théophile Sueur, 93 110 Rosny sous Bois, France d  Altadis, Centre de Recherche, 4 rue A. Dessaux, 45 404 Fleury les Aubrais, France a r t i c l e i n f o  Article history: Available online 13 September 2008 Keywords: GC × GCDynamic time warpingMultivariate analysisTobaccosMass spectrometry a b s t r a c t Comprehensivetwo-dimensionalgaschromatography(GC × GC)isnowrecognizedasthepreferredtech-niqueforthedetailedanalysisandcharacterizationofcomplexmixturesofvolatilecompounds.However,for comparison purposes, taking into account all the information contained in the chromatogram is farfrom trivial. In this paper, it is shown that the combination of peak alignment by dynamic time warpingand multivariate analysis facilitated the comparison of complex chromatograms of tobacco extracts. Thecomparisonisshowntobeefficientenoughtoprovideacleardiscriminationamongthreetypesoftobacco.A tentative interpretation of loadings is presented in order to give access to the compounds which differfrom one sample to another. Once located, mass spectrometry was used to identify markers of tobaccotype.© 2008 Elsevier B.V. All rights reserved. 1. Introduction GC  ×  GC appears nowadays to be the prime analytical tool forthestudyofcomplexmixturesofvolatilecompounds[1–4].Indeed, in GC × GC, a bidimensional separation system, orthogonal or nottruly orthogonal [5], is achieved to provide a huge peak capacity via on line hyphenation of two GC columns of different polaritiesusing a modulator [6,7]. Depending on the nature of the station- ary phases used for the two columns, structured two-dimensional(2D) chromatograms (visualized either as color plots or contourplots) can be obtained where spots having different colors as afunction of detector response replace the usual peaks of classi-cal one-dimensional (1D) chromatograms. Using an optimized setof columns, compounds are organized in the color plot accordingto both carbon numbers and chemical structure [8], which facili- tates interpretation. This technique has proven its usefulness andreliabilityinvariousareassuchasthepetroleumindustry[9,10],fla- vorsandfragrances[11,12],environmental[13,14],andfood[15,16]. However,handlingchromatogramswithseveralhundredsofspots,even more for petroleum samples [17,18], is not simple. It has ∗ Corresponding author. Tel.: +33 1 40 79 47 79; fax: +33 1 40 79 47 76. E-mail address: (J. Vial). been proven that individual or group quantification was possibleon GC  ×  GC chromatograms and gave results of similar quality asthose obtained with GC [9,19]. Theoretically it is possible to quan- tify each compound individually; however the time required foran operator to do so becomes rapidly unacceptable, especially if response coefficients are to be taken into account. Paradoxically,petroleum samples which are among the most complex samples,areinaratherfavoredsituation.Effectively,inthiscasethedetailedmolecular information is seldom required and group quantifica-tion using flame ionization detection (FID) without the need of response coefficient, is sufficient for the characterization or com-parison of samples [18]. For other types of natural samples like plantextractsorfragrances,despitethelowernumberofspots,thesituation is much less favorable, as the organization of the chro-matograms is much less evident because of the greater diversityin chemical structures present. In this case, a global comparisonof chromatograms could highlight some specific compounds pre-senting significant differences in concentration from one sampleto another and which could be used as markers. Up to now thereis no general strategy described in the literature to carry out sucha global comparison, except the individual quantification of eachcompoundwhichisnotrealisticinpractice.Theideaofthepresentstudyistodesignastrategybasedontheuseofchemometrictoolsto answer this matter. Effectively, a picture, i.e. a GC  ×  GC chro- 0021-9673/$ – see front matter © 2008 Elsevier B.V. All rights reserved.doi:10.1016/j.chroma.2008.09.027   J. Vial et al. / J. Chromatogr. A 1216 (2009) 2866–2872  2867 matogram, could be considered a set of pixels and each pixel is avariable with a given intensity for each sample. So a GC × GC chro-matogram could be seen as a set of variables (one for each pixel)that could be handled thanks to multivariate analysis techniques.Among them, the most famous one is principal component analy-sis (PCA). PCA is a powerful tool for the interpretation of large datatables [20–22]. Actually, this projection method is able to extract the main information from the srcinal data set by affecting it to adimensionally reduced space. This space is defined by linear com-bination of variables, called principal components (PCs). PCs arecomputed iteratively in such a way that they convey less and lessinformation while being orthogonal. The plots of individuals, i.e.samplesanalyzedbyGC × GC,inthenewdefinedsetofcoordinateaxesarecalledscoreplotswhereastherepresentationsoftheinitialvariablesconstitutetheloadingplots.However,thedifficultyislessin the choice of the best adapted multivariate analysis techniqueto compare color plots, than in solving the difficult problem src-inating from the unavoidable small offsets in retention times thatalways occur when repeating the same analysis. Time alignmentstrategies have been proposed to overcome this problem with 1Dchromatograms [23,24]. They were generally based on the piece- wise linear correlation optimized warping (COW) and applied toliquidchromatography(LC)withdiodearraydetectiondata[23]or gas chromatography for the characterization of gasoline samples[24]. With bidimensional separations, the alignment become evenmorecriticalbecauseoftheinherentlyhigherrelativevariability,atleastinrepeatabilityconditions,oftheretentiontimesintheseconddimensionwhosedurationisonlyafewseconds.Severalalignmentstrategies have been proposed in the literature, Fraga et al. used analgorithmbasedontheminimizationofthepseudorankofamatrixformed by the juxtaposition of a reference chromatogram and achromatogram to be aligned [25], and Pierce et al. applied a piece- wise alignment to GC × GC chromatograms of petroleum samples[26].QuiterecentlyZhangetal.usedtheCOWalgorithmforaligningGC × GC–MS data [27] and Suits et al. applied Warp2D for aligning LC–MS data [28]. In the present paper a global strategy combining dynamic time warping (DTW) for alignment and principal compo-nentanalysisorindependentcomponentanalysis(ICA)isproposedfor comparison of GC × GC chromatograms. DTW was selected forits capacity to carry out alignment without the need of the pres-ence  a priori  of reference compounds. This strategy was appliedto the comparison of tobacco extracts. Special focus was given tothe chemical interpretation of the loadings, and how they could beused to detect potential markers. The need to have interpretableloadings justified the implementation of both PCA and ICA. Massspectrometrydatacouldthenbeusedtoidentifythesemarkers,asshown using an example. 2. Experimental  2.1. GC  × GC apparatus A Trace GC  ×  GC system from Thermo-Electron Corporation(Courtaboeuf, France) equipped with a Merlin Microseal injec-tor (Merlin Instrument Company, CA, USA) was used. It wasequipped with a double jets carbon dioxide cryogenic modulator[29], and a split/splitless injector. To avoid discrepancies relatedto a poor trapping of the compounds in the modulator, the two jets were moved closer to the column than in the srcinal con-figuration. The set of columns presenting the best compromiseboth in terms of separation and ageing was as follows. The firstcolumn was an apolar capillary column VF-1ms, Varian (Les Ulis,France), 15m × 0.25mm, 1.0  m. This column was connected to aDB 1701 1.5m × 0.1m, 0.1  m from Agilent Technologies (Wald-bronn,Germany).Connectionsbetweencolumnsweremadeusingdeactivated presstight connectors from Restek (Evry, France). Theflow rate of the carrier gas was 1mL/min and the injector was setat 240 ◦ C. In order to have a preconcentration of the solutes at thebeginning of the column by recondensation, a cold trapping wasapplied to the splitless injection. The temperature program usedstarted at 40 ◦ C for 40s, then an increase at 60 ◦ C/min was applieduntil 70 ◦ C, and after a plateau of 3min, 2.5 ◦ C/min were applieduntil 240 ◦ C. The injected volume was 2  L. The injection was car-ried out in splitless mode with a surge of 400kPa during 40s. Atypicalmodulationperiodof5swasused.Detectionwascarriedoutwith the quadrupolar mass spectrometer DSQI (Thermo-Electron).Thetransferlinewassetat250 ◦ C.Classicalelectronimpact(70eV)was used, only the mass range was limited to 40–240m/z so thatthe acquisition frequency (around 30Hz) is compatible with GC ×  GC data. Excalibur software was used for data acquisition, andthen data were imported into Hyperchrom S/W software for thevisualisation of the 2D chromatograms. Hyperchrom S/W offersthe possibility to export the 2D chromatogram as a text file. Thematrix obtained could then be read into Matlab version R2007b(Mathworks, Natick, MA, USA), for alignment and PCA or ICA.  2.2. Gases Liquid CO 2  was of industrial grade and purchased from AirLiquide (Le Plessis Robinson, France). Pure gasses, i.e. helium(99.9995%) and nitrogen (99.9995%) were purchased from Messer(Asnières, France). Air was obtained from a compressor after filtra-tion.  2.3. Tobacco extracts TobaccoextractsprovidedbytheAltadisCompanywereinhex-anesolvent.Theywereobtainedbythe“LikensNickerson”process[30] directly from scaferlatis (tobacco leaves cut into small pieces).This steam distillation extracts the neutral and acidic fraction of volatilecompoundsresponsiblefortobaccoaromawhileminimiz-ing the nicotine extraction. In 1D GC, nicotine could hinder someminor compounds. Three types of tobaccos were considered: Ori-ental, Burley and Virginia. For each type, four different sampleswere available, corresponding to four different batches of differentsrcins. One extract was available for each sample. They were arbi-trarily referred to as O2, O7, O9 and O10 for the 4 Oriental tobaccoextracts, as B1, B2, B3 and B4 for the 4 Burley tobacco extracts,and as V1, V2, V3 and V4 for the 4 Virginia tobacco extracts. Allsamples were analyzed in the shortest possible period of time tolimit as much as possible chromatographic variability. Typical GC × GC chromatograms of the extracts of the three types of tobaccosobtained in the conditions previously described are given in Fig. 1. 3. Data processing  The comparison of GC  ×  GC chromatograms is in some wayssimilartothecomparisonofimages.InaGC × GCdatamatrix,each“pixel” is characterized by three numbers: two time coordinates(retentions times along the first and the second dimension) anda value for signal intensity (here the Total Ion Current). So if eachpixel is considered a response, multivariate analysis could be usedto compare samples in the space defined by these responses. Asalready indicated in the literature [24], and as will be also demon- strated below, applying directly PCA to raw GC  ×  GC data leadsto an unsatisfactory separation of the samples in the multivariatespace of the responses. This problem comes from the unavoidablevariations that occur in GC peak retention times. To compensatefortheseoffsets,DynamicTimeWarping(DTW)wasappliedtothe  2868  J. Vial et al. / J. Chromatogr. A 1216 (2009) 2866–2872 Fig. 1.  Typical chromatograms of the three types of tobacco extracts obtained using the conditions described in Section 2: (a) Oriental, (b) Burley, (c) Virginia. The signal is the total ion current for  m /  z   in the range 40–240. morevariableseconddimensionchromatogramstoalignthepeaks.Matlab routines [31] were used for chromatogram alignment, PCAand ICA.  3.1. Principal component analysis and independent component analysis PCA [32–36] and ICA [37] are somewhat related mathemat- ical techniques. They both provide a linear decomposition of a data set. PCA was selected because it is the historical andmost commonly used technique and could be considered thefirst intention approach. The fundamental difference is that PCAcalculates vectors in the multivariate space that correspond tothe directions of maximum dispersion of the samples and thatare mutually orthogonal. PCA “scores” are the coordinates of the samples on these vectors and PCA “loadings” are the con-tributions of the original variables to the construction of thevectors.   J. Vial et al. / J. Chromatogr. A 1216 (2009) 2866–2872  2869 ICA tries to extract the srcinal signals from a set of mixed sig-nals,basedonthereasonablehypothesisthattheseoriginalsignalsare independent. In other words, PCA tries to find a sequence of uncorrelatedvariables(PCs)whichcontainasmuchofthevarianceof the data as possible. For this reason, PCA is often an effectivecompression technique: by keeping the first few PCs, most of thevariance (and optimistically “information”) in the data can be cov-ered.TheindependentcomponentscalculatedbyICAapproximatethe sources that are assumed to be independent. These ICA “load-ings” are not just contributions of the srcinal variables but anapproximation of the “pure” signals. Although, as in PCA, the ICA“Scores” are the coordinates of the samples on the vectors, they donot correspond to the maximum dispersion.Usually PCA or ICA is mainly used to represent the differentsamples in a space with a reduced number of dimensions, usu-ally 2 or 3. This kind of representation called scores plot allowsto see to which extend samples are different. Very often, littleattention is paid to the loadings, i.e. the vectors that define thisspace. In the present situation, loadings could be interpreted asvirtual GC × GC chromatograms and so differences between load-ings could be related to the presence or absence of compounds.Comparison of a given loading, in relation with the respectiveprojections of the samples, is a mean to track specific mark-ers.  3.2. Dynamic time warping  A method of pre-treatment has been developed on a vari-ant of dynamic time warping (DTW) [38–40]. The idea of time warping srcinally came from speech recognition and its appli-cation to chromatography was first proposed by Wang andIsenhour [41] as a method of chromatographic peak match-ing.  3.2.1. Dynamic time warping algorithm Suppose we have two chromatograms, A and B, of length  n .A = a 1 ,a 2 ,...,a n B = b 1 ,b 2 ,...,b n (1)ToaligntwochromatogramsusingDTW,weconstructan n -by- n matrix where the ( i th,  j th) element of the matrix contains the dis-tance (i.e.  d ( i ,  j )=( a i − b  j ) 2 ). Each matrix element ( i ,  j ) correspondsto the alignment between the points  a i  and  b  j . This is illustratedin Fig. 2(a). A warping path  W   is continuous (in the sense statedbelow) set of matrix elements that defines a mapping between Aand B. The  k th element of   W   is defined as  w k  = ( i,j ). So we have: W   = w 1 ,...,w k ,...,w K  , n ≤ K <  2 n − 1 .  (2)The warping path is typically subject to several constraints:(1) Boundary conditions: w 1  = (1 , 1) and  w K   = ( n,n ) .  (3)Thisrequiresthewarpingpathtostartandfinishindiagonallyopposite corner cells of the matrix.(2) Continuity:Given  w k  = ( i,j ), then  w k − 1  = ( i ′ ,j ′ ), where  i − i ′ ≤ 1  j −  j ′ ≤ 1Thisrestrictstheallowablestepsinthewarpingpathtoadja-cent cells (including diagonally adjacent cells).(3) Monotonicity:Given  w k  = ( i,j ), then  w k − 1  = ( i ′ ,j ′ ), where  i − i ′ ≥ 0  j −  j ′ ≥ 0This forces the points in  W   to be monotonically spaced intime.There are exponentially many warping paths that satisfy theabove conditions. However, we are only interested in the path thatminimizes the warping cost as indicated by Eq. (4).DTW(A , B) = min   K   k = 1 w k   (4)Thispathcanbefoundusingdynamicprogrammingtoevaluatethe following recurrence, which defines the cumulative distance   ( i ,  j )asthedistance d ( i ,  j )foundinthecurrentcellandtheminimumof the cumulative distances of the adjacent elements as indicatedby Eq. (5).   ( i,j ) = d ( a i ,b  j ) + min {   ( i − 1 ,j − 1) ,  ( i − 1 ,j ) ,  ( i,j − 1) }  (5)The warp path to    ( i ,  j ) must pass through one of those threegrid cells, and since the minimum possible warp path distance isalready known for them, all that is needed is to simply add thedistance of the current two points to the smallest one. Since thisequation determines the value of a cell in the cost matrix by usingthe values in other cells, the order in which they are evaluated isveryimportant.Thecostmatrixisfilledonecolumnatatimefromthe bottom up, from left to right. Fig.2.  (a)Toalignthechromatograms,awarpingmatrixwasconstructed;searchfortheoptimalwarpingpathisshownwithsolidsquares.(b)Theresultingoptimalwarpingpath.  2870  J. Vial et al. / J. Chromatogr. A 1216 (2009) 2866–2872 Fig. 3.  PCA scores plot obtained on without DTW. Oncetheentirematrixisfilled,awarppathmustbefoundfrom   (1,1)to   ( n , n ).Thewarppathisactuallycalculatedinreverseorderstarting at    ( n , n ). A “greedy” search is performed that evaluatescells to the left, down, and diagonally to the bottom-left (Fig. 2b). Whichever of these three adjacent cells has the smallest value isaddedtothebeginningofthewarppathfoundsofar,andthesearchcontinues from that cell. The search stops when    (1,1) is reached. 4. Results and discussion 4.1. Effect of peak alignment on principal component analysisresults First, PCA was applied to the raw chromatographic data.The resulting score plots on the first two components is givenFig. 3. Despite visible differences between chromatograms of dif-ferent types of tobaccos, PCA is not able to discriminate thethree types. This result confirmed the need for peak align-ment. DTW was then applied to the results of the 12 GC  ×  GCanalyses. It was decided to carry out an alignment only alongthe second chromatographic dimension as changes in the first Fig. 5.  PCA scores plot obtained after DTW. dimension of less than the modulation period are of no conse-quence on the retention time in that direction. It is expectedthat the second dimension, which is both isothermal and short,is more likely to be affected by offsets than the first dimen-sion, at least under repeatability conditions. Fig. 4 provides anillustration of the effectiveness of DTW on chromatographic pro-files.PCA scores based on aligned chromatograms are given in Fig. 5.After alignment by DTW, score plots easily discriminate thethree types of tobaccos as three distinct groups. This result clearlydemonstrates the necessity to align chromatograms along the sec-ond dimension before implementing PCA. However, wider datasets including more samples and presenting a higher variabilitycould require alignment along both dimensions. Further studieswill investigate this point. 4.2. Independent components analysis of aligned chromatograms As explained previously, the loadings should also be of use tointerpret the nature of the samples. In the present case, the load-ingsmaybevisualizedasvirtualGC × GCchromatograms.Attemptsto give a chemical meaning to PCA loadings were not convinc- Fig. 4.  Superposition of chromatograms obtained along the second dimension at a time of 50min in the first dimension for the 4 samples of Burley type.  Left  : before DTW(mean correlation equal to 59.4%),  Right  : after DTW (mean correlation equal to 89.8%).
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks