A framework to spatially cluster air pollution monitoring sites in US based on the PM2.5 composition

A framework to spatially cluster air pollution monitoring sites in US based on the PM2.5 composition
of 11
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  A framework to spatially cluster air pollution monitoring sites in USbased on the PM 2.5  composition Elena Austin a, ⁎ , Brent A. Coull b , Antonella Zanobetti a , Petros Koutrakis a a Department of Environmental Health, Harvard School of Public Health, Boston, MA 02115, USA b Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA a b s t r a c ta r t i c l e i n f o  Article history: Received 13 November 2012Accepted 7 June 2013Available online xxxx Keywords: Multi-pollutant mixturesCluster analysisEffect modi fi cationAir pollution pro fi lesk-Means Background:  Heterogeneity in the response to PM 2.5  is hypothesized to be related to differences in particlecompositionacrossmonitoringsiteswhichre fl ectdifferencesinsourcetypesaswellasclimaticandtopographicconditions impacting different geographic locations. Identifying spatial patterns in particle composition is amultivariate problem that requires novel methodologies. Objectives: UseclusteranalysismethodstoidentifyspatialpatternsinPM 2.5 composition.Verifythattheresultingclusters are distinct and informative. Methods:  109 monitoring sites with 75% reported speciation data during the period 2003 – 2008 were selected.These sites were categorized based on their average PM 2.5  composition over the study period using k-meansclusteranalysis.Theobtainedclusterswerevalidatedandcharacterizedbasedontheirphysico-chemicalcharac-teristics, geographic locations, emissions pro fi les, population density and proximity to major emission sources. Results:  Overall 31 clusters were identi fi ed. These include 21 clusters with 2 or more sites which were furthergrouped into 4 main types using hierarchical clustering. The resulting groupings are chemically meaningfulandrepresentbroaddifferencesinemissions.Theremainingclusters,encompassingsinglesites,werecharacter-ized based on their particle composition and geographic location. Conclusions: Theframeworkpresentedhereprovidesanoveltoolwhichcanbeusedtoidentifyandfurtherclas-sify sites based on their PM 2.5  composition. The solution presented is fairly robust and yielded groupings thatwere meaningful in the context of air-pollution research.© 2013 Elsevier Ltd. All rights reserved. 1. Introduction First demonstrated by the Harvard Six-City study (Dockery et al.,1993) and the American Cancer Society Study (Pope et al., 1995), the association between PM and mortality has been replicated in manypopulations, both within the United States and abroad (Pope andDockery, 2006). However, the magnitude of the effect has displayedconsiderable heterogeneity across studies (Bell et al., 2005; Janssen etal., 2002; Samet et al., 2000; Zanobetti et al., 2002). It is possible thatthisobservedheterogeneityofeffectmaybeattributedtotheconsider-able differences in the PM composition across these study sites (Bell etal., 2007). This is further con fi rmed by investigations that attribute dif-ferent levels of toxicity to particles from different sources (Laden et al.,2000; Mar et al., 2000; Zhou et al. (2011)). Toxicological studies havedemonstrated the toxic potential of many individual PM componentsincluding sulfate, zinc, nickel and lead (Chuang et al., 2007; Gao et al.,2004; Lippmann et al., 2006; O'Neill et al., 2005).The importance of considering multi-pollutant mixtures in air pollu-tionwashighlightedin2004bytheNationalAcademiesofScience(NAS)(NRC, 2004). In response, the EPA is in the process to develop a multi-pollutant air quality management plan as described in their Multi-Pollutant Report of 2008 (EPA, 2008). Adopting a multi-pollutantapproach is extremely challenging due to the highly complex interac-tions between source emissions, atmospheric processes and effects onhuman health and ecosystems. One of the key components of a multi-pollutantapproachistheabilitytocapturethemultivariaterelationshipbetweenpollutantsatagivensite.Abettergraspofthisrelationshipwillenhance our understanding of the interaction between pollutants aswell as further the human health effects related to exposure to thesecomplex mixtures.The EPA has considered a variety of ways in which air pollutantsmight interact with each other (Table 1). Practically however, becauseofaknowledgegapinthe fi eld,theEPAisforcedtoconsiderallpollutantinteractionsasadditive(Mauderlyetal.,2010).Populationsareexposed Environment International 59 (2013) 244 – 254  Abbreviations:  NAS, National Academies of Science; EPA, Environmental ProtectionAgency; PM 2.5 , Particle matter with a diameter of 2.5  μ  m or less; EC, Elemental Carbon;OC, Organic Carbon; CPC, Condensation Particle Counter; AQS, Air Quality Standards. ⁎  Corresponding author at: Harvard School of Public Health, Landmark Building, 4thFloor West, Boston, MA 02215, USA. Tel.: +1 617 384 8837; fax: +1 617 384 8859. E-mail address: (E. Austin).0160-4120/$  –  see front matter © 2013 Elsevier Ltd. All rights reserved. Contents lists available at SciVerse ScienceDirect Environment International  journal homepage:  daily to complex mixtures of pollutants, some of which are known orsuspected to cause health effects at ambient concentrations. Under-standing the effect of the mixtures on health, rather than the effect of the individual components is a crucial step that must be undertaken inorder to further our knowledge of this  fi eld. Therefore, it is essentialthat exposure assessment develop new tools to describepopulation ex-posures that moves beyond relating individual pollutant concentrationat a given site on a given day.There are currently a limited number of approaches that allow fortheinvestigationofmulti-pollutantmixturesinepidemiologicalstudies(Dominici et al., 2010; Vedal and Kaufman, 2011). Exposure data istypically represented in high dimensionality data sets in which eachpollutantisassigned aconcentration foreachtimeperiod of observa-tion. Previous published multivariate approaches have includedfactor analysis methods and principle component methods such asspeci fi c rotation factor analysis (Koutrakis and Spengler, 1987), ab-solute principal-component analysis (Thurston and Spengler,1985), UNMIX (Henry and Kim, 1990; Kim and Henry, 1999) and positive matrix factorization (Paatero and Tapper, 1994). Thesemethods have been successful at identifying individual source con-tributions to integrated daily measurements samples at a speci fi c orgivensite.Theresultsofthesemultivariatemethodsareusedbyepide-miologists in time series analysis to investigate the health effects asso-ciated with speci fi c sources (Schwartz et al., 2002; Thurston et al.,2005).Weproposeanapproachthatusesclusteranalysistoidentifyspatialpatterns in air pollution data. Short- and long-term patterns in air pol-lution as well as spatial distribution patterns have been identi fi ed anddescribed in the literature (Beelen et al., 2009; Jerrett et al., 2005;Koutrakisetal.,2005;Lefohnetal.,2010).Atasinglesite,thesepatternsaretheresultofdiurnalvariationsinUVintensity,season,temperature,cloud cover, mixing height as well as changes in source emissions suchashigher traf  fi c densityonweekdays,increasedpower plant emissionsduring high demand periods and increase wood combustion in thewinter.Betweensites,differencesinairpollutionpatternscanbeattrib-utedtodifferentsourcetypes,differentclimaticconditions,distributionof regional pollutants over a geographic area and differences in soilcomposition.Unsupervised cluster analysis encompasses a broad range of algo-rithms that identify multivariate patterns in data sets. Two broad cate-gories of these algorithms are hierarchical and partitioning algorithms.Theoutputofthealgorithmmaybe “ hard ” ifeachobservationisattrib-utedtoonlyoneclusteror “ fuzzy ” ifanobservationmaybeassignedtoacertaindegreetomorethanonecluster.Inthisanalysis,wewereinter-esting in identifying a  “ hard ”  solution so that each site was uniquelyassigned to a single cluster.Recently, we used cluster analysis to identify distinct daily multi-pollutant pro fi les at a given site, Boston, MA, (Austin et al., 2012).Clustering has been used previously to describe diurnal variation ingaseous and particle pollutants (Adame et al., 2012; Flemming et al.,2005). K-means clustering was used by Kim et al. (2008) in order to group sites based on the temporal  fl uctuation of PM 2.5 . Hierarchicalclustering has also been used to identify distinct sources of volatileorganic compounds based on the grouping of the measured concen-trations (Kavouras et al., 2001). It has also been used to provide a de-scription of regional chemical and transport processes associatedwith particular regimes and can inform which sources may be mostimportant in the development of pollution episodes. Beaver andPalazoglu (2006) used an aggregated solution of k-means clusteranalysis to characterize classes of ozone episodes occurring in theSanFranciscobay.Pakalapatietal.(2009)usedhierarchicalclusteringand sequencing to group air  fl ow patterns associated with elevatedozone concentrations. Cluster analysis has also been used to clusterback trajectories to identify different classes of synoptic regimesover the duration of the trajectories (Comrie, 1996; Taubman et al.,2006).In this paper, cluster analysis will be used to group sites across theUnited States based on their PM 2.5  composition pro fi les using datacollected between 2003 and 2008. The main interest is identifyinglong-term differences in the composition of PM 2.5  across the differentsites. These clusters of cities will then be characterized and validatedbased on physico-chemical characteristics, geographic locations,emission pro fi les, population density and position with respect tomajor emitter sources. It is anticipated that this novel approach willallow for a better understanding of the heterogeneity in PM 2.5  com-position across the United States. We hope that the identi fi ed clusterscan be used to further investigate the heterogeneity in the relation-shipbetweenPM 2.5  concentrationandmortalityandmorbidityacrossthe United States. 2. Methods  2.1. Data collection DataforthisanalysiswasobtainedfromtheHEIAirQualityDatabase(2010). This database includes pollutant concentrations from the EPA'sAQS Particulate Matter Air Quality Data. The PM 2.5  massand speciationdataisavailablefor54COREsitesand234supplementalsitesfrom2000to 2010. These are 24-h samplers, midnight to midnight local standardtime, with different sampling frequencies depending on the site loca-tion. Emissions data for each site is obtained from the National Emis-sions Inventory Data of 2002 and Census population data from the2000Census.Werequirethatsiteshavelessthan25%missingobserva-tionsfor the elements of interest. Inaddition,we requirethat each sea-son within the time period has less than 25% missing data. This is toensure that the site means are not unduly in fl uenced by missing datawithin a given season. This resulted in 109 sites with complete datasetsbetweenJanuary 2003 and December 2008. Thesedates werecho-sen in order to maximize the number of sites with 5 years of completedata. At each site, sampling occurred every 3rd or every 6th daythroughout the year. Fig. 1 presents the location of the sampling sites.  2.2. Data preparation ThevariablesusedintheclusteringwerethefollowingcomponentsofPM 2.5 :totalEC,totalOC,SO 42 − ,NO 3 − ,Na + ,NH 4+ ,Se,Si,Ca,Fe,Ni,V,Cu,Zn,Pb, Mn,As, Cr,andK.Otherelements obtainedaspart ofthespecia-tion of the fi lters were considered were excluded either because of theanalytical measurement was judged to be unreliable or because a largeproportion of the measurements were below the detection limit. Foreach site, an overall site mean of each variable was obtained. Thesemeans were divided by the mean PM 2.5  concentration of that site tocreate a unique set of species fractions used to characterize the PM 2.5 composition. These species fractions re fl ect the unique interplay of sourcesandmeteorologyateachsiteandtheydescribethecompositionof PM 2.5  in a given element at that site (Eq. 1). To eliminate differencesin the order of magnitude between concentration levels of the  Table 1 Interaction of pollutants.(EPA, 2000)Additivity: Effect of the combination equals the sum of individual effectsSynergism: Effect of the combination is greater than the sum of individual effectsAntagonism: Effect of the combination is less than the sum of individual effectsInhibition: AcomponenthavingnoeffectreducestheeffectofanothercomponentPotentiation: AcomponenthavingoneffectincreasestheeffectofanothercomponentMasking: Two components have opposite, canceling effects such that noeffect is observed from the combination245 E. Austin et al. / Environment International 59 (2013) 244 –  254  measuredpollutants,thespeciesfractionsweretransformedtoarobustz-score as described in Eq. 2.  2.3. Clustering  Themainobjectiveofthisanalysiswastoclustertogethercitieswiththe most similar species fractions. Clustering of the mean values of themulti-pollutant pro fi les represents the overall population exposures inthese cities over the study period. These clusters may improve ourunderstanding of the heterogeneity in the long-term effects of PM 2.5 exposure among populations.The k-means algorithm used was developed by Hartigan andWong (1979). It seeks to partition M points in N dimensions into kclusters. This iterative algorithm searches for a local solution thatminimizes the Euclidean distance between the observations and thecluster centers. Advantages of the k-means algorithm are that it iseasilyimplemented and hasbeen used in a wide range of applicationsand is computationally ef  fi cient ( Jain et al., 1999; Steinley, 2006). Ithasalsobeensuggestedthatthisalgorithmissomewhatlesssensitiveto outliers than hierarchical clustering methods (Punj and Stewart,1983). The initial k-values used in the algorithm can be randomlyselected from the dataset being clustered, or the initial values canbe speci fi ed by the user. In this case, we chose to specify the initialvalues of the clusters in order to increase the stability of the solution.Several methods have been proposed to initialize k-means. We usedhierarchical clustering (described below) to identify k-centers andthen using these centers to initialize k-means. Maitra et al. (2010)found this method of initializing k-means performed best forsmall datasets. Following the hierarchical analysis with k-means hadthe advantage of minimizing the impact of outlier points on thesolution.A major obstacle in using k-means is that the number of clusters(k) must be assigned a priori based either on pre-existing knowledgeof the data or observable characteristics of the data set. Althoughthere was no pre-existing knowledge of the number of unique spatialclusters to expect, we used characteristics of air pollutant mixtures inorder to make the best possible selection. This is consistent with therecommendation of  Jain et al. (1999) that subject speci fi c knowledge -120 -110 -100 -90 -80 -70    3   0   3   5   4   0   4   5 Longitude        L     a      t       i      t     u       d     e Fig. 1.  Chemical speciation sites (n = 109). Eq. 2 Modified Z-score.  Z  ij  ¼ SF ij − Median SF  j  Median  SF  ij − Median SF  j  where:Z ij  represents the robust z score of the Fraction of Species  j  atsite  i SF ij  represents the Fraction of Species  j  at a site  i SF  j  represents the Fraction of Species j at each site Eq. 1 Species fraction. SF ij  ¼ S  ij PM 2 : 5 i where:SF ij  represents the Fraction of Species  j  at a site  i PM 2 : 5 i  represents the mean PM 2.5  concentration at site  iS  ij  represents the mean concentration of Species  j  at site  i 246  E. Austin et al. / Environment International 59 (2013) 244 –  254  is the best way to select the number of clusters. We considered thevariability of pre-de fi ned pollutant ratios within each cluster. Solu-tions with less total variability within the clusters were judged to bebetter than solutions with more variability within each cluster.Pollutant concentration ratios considered were: SO 42 − /NO 3 − , EC/OC,Ni/V and Fe/Si. The rational was that solutions that were better atrecognizing sites with similar pollution pro fi les would also minimizethe variability of these important pollutant rations within eachcluster. The variability of the ratios was reduced by maximizing thedecrease in overall change in deviation as described in Eq. 3. Thepercent change in overall deviation represents how effectively differ-ent solutions capture the unique sources that contributed to themixture at the individual sites. The advantage of using this indicatoris that it explicitly uses knowledge of air pollution sources andcontributions to inform the decision of how many clusters best de-scribe the data. The rational for selecting these ratios is discussedbelow.In addition to maximizing the decrease in the overall deviation,we sought to minimize the number of clusters containing a singlesite in each solution. As the total number clusters increases, the num-ber of clusters including only 1 site likewise increases. This leads to adecrease in the % change in overall deviation without necessarilyresulting in a more interpretable solution. K-means was performedusing the function kmeans in R v.2.15.1.  2.4. Hierarchical clustering (Ward's method) Ward'shierarchicalclusteringmethod(Ward,1963)isanagglomer-ative process that begins with 1 cluster for every observation and theniteratively combines the points that lead to the minimal increase inthe sum of squares. Because this method is agglomerative, the solutionreached is constrained by the previous choices made by the algorithm.Therefore, for a given number of clusters, the solution reached by theWard method is often not the solution that has the minimal sum of squares error. An advantage of this method is that it produces clustersthatarerelativelycompact.Itiscriticizedforsometimesproducingclus-tersthataretoosmallforthegivendata(Cormack,1971).Inthispaper,hierarchical clustering was used to initialize k-means. It was also usedafter the analysis was completed to group together the clusters withthe most similar enrichment factors. Hierarchical clustering wasperformed using the function hclust in R v2.15.1.  2.5. Enrichment factors Enrichment factors were calculated in order to better compare theclusters. These enrichment factors represent the enrichment of agiven constituent (element) of PM 2.5  within a cluster as comparedto the entire sample (Eq. 1).  2.6. Grouping clusters Clusters were grouped together based on the enrichment factorswithin each cluster. The clustering was performed with hierarchicalclustering using the hclust function in R v.2.15.1.  2.7. Comparing clustering solutions (Rand Index) The Rand Index is a measure of similarity between two differentpartitions of the same data set. The Rand index ranges between 0and 1 where 0 indicates that two data clusters do not agree on anypair of points and 1 indicating that the data clusters are exactly thesame. The Rand Index represents a weight of the sites classi fi ed to-gether in the two solutions versus the sites classi fi ed separately(Rand, 1971). In this paper, we used the adjusted Rand Index inorder to compare different clustering solutions. The adjusted RandIndex,  fi rst proposed by Hubert and Arabie (1985) the adjustedRand Index corrects the Rand Index for the random chance thatpairsareclassi fi ed together. Steinley(2004)suggestedthatanadjust-ed Rand Index greater than 0.9 re fl ected excellent agreement, valuesgreater than 0.8 re fl ected good agreement, values greater than 0.65indicated moderate agreement and less than 0.65 indicated pooragreement. 3. Results  3.1. Selecting the number of clusters k Selecting the value of k is a balance between the advantage in de-creasing the variability in the diagnostic ratios within clusters andminimizing the number of single city clusters in a given solution.Fig. 2 presents the overall variability of the pollutant ratios alongsidethe number of single city clusters for solutions containing between 1and 50 clusters. Based on the desire to balance these two features, 31clusterswasselectedastheoptimalvalueasitrepresentsasigni fi cantdrop in the overall decrease in variability measure (55% as comparedto the dataset as a whole) while there are 11 clusters that containonly a single site. Other possible values of k were explored includingk = 26 and k = 37. The solution of k = 26 was judged to not be sat-isfactory because it lacked good distinction between east and westcoast cities. The solution for k = 37 was judged to be too unwieldybecause of the high number of single city clusters. Eq. 3 Change in overall deviation. Decrease in overall deviation  %  ð Þ ¼  100   1 −∑ 4 i ¼ 1 ∑ k j ¼ 1 1SSE i SSW ij "#! where:SSW represents the sum of squared errorsSSE represents the sum of squared errors i  representsthediagnosticratio(SO 4 /NO 3 ,EC/OC,Ni/V,Fe/Si)  j  represents the individual cluster (1 to k) Eq. 4 Enrichment factors. EF ij  ¼ S ij PM 2 : 5 i  S  j PM 2 : 5 where:EF ij  represents the Species Fraction of species i at site jS ij  represents the mean Species Concentration (Fe, OC, Na + ,etc.) at site iS  j  represents the mean Species ConcentrationPM 2.5  represents the concentration of PM 2.5 i  the different sites  j  represents the different elements 247 E. Austin et al. / Environment International 59 (2013) 244 –  254   3.2. Chemical characteristics The chemical characteristics for the clusters containing 2 or morecities are presented as heatmaps in Fig. 4. These heatmaps representthe log of the enrichment factors of the pollutants of interest. Forthe heatmap representation, the enrichment factors were logarithmi-cally transformed so that a value of 0 represents no enrichment, 1represents 2.7 times enrichment and − 1 represents 0.4 times enrich-ment. The clusters are presented in 4 groupings, where there aresome overall similarities between the clusters in the same grouping.The similarities were determined based on hierarchical clustering of the enrichment factors in each cluster.  3.3. Geographic distribution The locations of the 31 clusters identi fi ed are presented by group inFigs.6 – 9.Evidentinthesemaps,isthatinsomecases,sitesthataregeo-graphicallyclosebelongtodifferentclustersThisisduetodifferencesincomposition, even at nearby monitoring sites and will be discussedfurther below. There is a clear separation between coastal and interiormonitoring sites as well as between western, central and eastern sites.This agrees with previous studies showing that PM 2.5  composition isrelated to geographic location and re fl ects the impacting sources andclimatic conditions (Bell et al., 2007; Zanobetti and Schwartz, 2008).  3.4. Concentration ratios To aid in cluster interpretation, the log of the pollutant concentra-tionratios ofselectedspecies arepresentedasa heatmapin Fig. 5.Sim-ilartotheenrichmentfactors,thepollutantratioshavebeennormalizedand represent the ratio in a particular clusterascompared to the entiresample. These normalized values have been log transformed so that avalue of 0 represents nodifferencebetweenthe cluster and the sampleasawhole,avalueof1represents2.7timesincreaseoftheratiowithinthis cluster and sample as a whole and a value of  − 1 represents a 0.4relationship between the ratio in this cluster and the whole sample. 0 10 20 30 40 50 0 10 20 30 40 50    0   1   0   2   0   3   0   4   0   5   0   6   0 # of Clusters (k)    O  v  e  r  a   l   l   D  e  c  r  e  a  s  e   i  n   V  a  r   i  a   b   i   l   i   t  y   (   %   )    0   5   1   0   1   5   2   0   2   5 # of Clusters (k)    #   S   i  n  g   l  e   S   i   t  e  s Fig. 2.  Selecting the number of clusters (k).    0 .   2   0 .   4   0 .   6   0 .   8   1 .   0 # Clusters (k)    A   d   j  u  s   t  e   d   R  a  n   d   I  n   d  e  x 0 10 20 30 40 50 Fig.3. VariabilityoftheadjustedRandIndexasafunctionofthenumberofclustersselected.(The vertical line represents the number of clusters, k = 31, selected for this study). -1 0 1 Color Key    1 2 3 4 5 6   1   2 7 8   1   9 9   1   0   1   4   1   5   1   6   1   1   1   3   1   7   1   8   2   0 Na+CrKAsMnPbZnCuVNiFeCaSiNH4NO3SeSO4OCECEastern & Central Industrial Western Coastal Fig. 4.  Heatmap of the log of the species enrichment factors by cluster.248  E. Austin et al. / Environment International 59 (2013) 244 –  254
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks