of 13
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  Integrated Computer-Aided Engineering 23 (2016) 115–127  115 DOI 10.3233/ICA-150505IOS Press Fast and low cost prediction of extreme airpollution values with hybrid unsupervisedlearning Ilias Bougoudis, Konstantinos Demertzis and Lazaros Iliadis ∗  Department of Forestry and Management of the Environment and Natural Resources, Democritus University of Thrace, Orestiada, Greece Abstract.  Air pollution is the problem of adding harmful substances or other agents into the atmosphere and it is caused byindustrial, transport or household activities. It is one of the most serious problems of our times and the determination of theconditions under which we have extreme pollutants’ values is a crucial challenge for the modern scientific community. Theinnovative and effective hybrid algorithm designed and employed in this research effort is entitled Easy Hybrid Forecasting(EHF). The main advantage of the EHF is that each forecasting does not require measurements from sensors, other hardwaredevices or data that require the use of expensive software. This was done intentionally because the motivation for this work wasthe development of a hybrid application that can be downloaded for free and used easily by everyday common people with noadditional financial cost, running in devices like smart phones. From this point of view it does not require data from sensors orspecialized software and it can offer people reliable information about extreme cases.Keywords: Fuzzy C-means, neural gas, self-organizing maps, feed forward neural networks, random forest, air pollution, featureselection 1. Introduction The high concentration of chemicals or particles inthe atmosphere results in the alteration of its struc-ture, composition and characteristics and causes seri-ous health problems to the humans and generally to allliving creatures and ecosystems. The sources of thisproblemareman-causedactivitiesconcentratedmainlyin urban areas (e.g. energy production from solid orliquid fuel, transport, industries, building heating sys-tems and dust). There are primary air pollutants di-rectlyemitted (e.g.CO, NO, NO 2 , SO 2 ) andsecondaryones (e.g. O 3 ) that are caused as results of chemicalreactions. Though the atmosphere has its own mecha-nisms to alienate the pollutants, the extreme cases are ∗ Corresponding author: Lazaros Iliadis, Department of Forestryand Management of the Environment and Natural Resources, Dem-ocritus University of Thrace, 193 Pandazidou st., NOrestiada 68200,Greece. E-mail:; web: mainly due to unfavorable meteorological conditionsthat impede the attenuation of the contaminants. Someof these conditions even accelerate the creation of airpollution.The analysis and forecasting of high pollution lev-els in the atmospheric air is one of the most importanttasks of the environmental science and research, dueto its impact in the fauna and flora and in the humanhealth. This task is even more imperative for denselypopulatedurbanareas [29]. Thoughenvironmentalsci-ence has evolved during the last decade, there is stillgreat need of more reliable models, in order to developreliable prevention and control strategies.This research paper presents an innovative, accu-rate and effective fast and low cost forecasting systemwhich allows the prediction of extreme air pollutantsvalues. 1.1. State of the art  Somecomputationalresearcheffortshavebeendonein the past towards air pollution modeling. Iliadis et ISSN 1069-2509/16/$35.00 c  2016 – IOS Press and the author(s). All rights reserved  116  I. Bougoudis et al. / Fast and low cost prediction of extreme air pollution values with hybrid unsupervised learning al. [19] have developed FFNNs in order to modelOzone concentrations in Athens. Patterns of air qual-ity have been developed for Mexico City by Nemeand Hernandez [26], whereas Karatzas and Voukantsishave done the same for the city of Thessaloniki [20].Skön et al. have analyzed indoor air quality usingSOM [28]. Li and Chou have investigated air pollu-tion spatial variation with SOM [22]. Glorennec ap-plied SOM to forecast Ozone peeks [13].Also Vong et al. [31] have built a forecasting sys-tem based on Support vector machines (SVMs), Xiaoet al. [12] proposes a novel hybrid model combiningair mass trajectory analysis and wavelet transforma-tion to improve the artificial neural network (ANN)forecast accuracy of daily average concentrations of PM 2 . 5 , while Zabkar et al. [32] have applied methodsof machine learning to the problem of ground levelozone forecasting. All this proposes used measureddata and data calculated by the numerical weather pre-diction model or stations. On the other hand Lopez-Rubio et al. [23] introduce Bregman divergences inself-organizing models, which is based on stochasticapproximation principles, so that more general distor-tion measures can be employed. A procedure is de-rived to compare the performance of networks usingdifferent divergences. Moreover, a probabilistic inter-pretation of the model is provided, which enables itsuse as a Bayesian classifier. Experimental results showthe advantages of these divergences with respect to theclassical Euclideandistance.Also Menéndezet al. [25]proposes a new algorithm, named genetic graph-basedclustering (GGC), takes an evolutionary approach in-troducing a genetic algorithm (GA) to cluster the sim-ilarity graph. The experimental validation shows thatGGC increases robustness of spectral clustering andhas competitive performance in comparison with clas-sical clustering methods. Donos et al. [11] have pre-sented a study to provide a seizure detection algorithmthat is relatively simple to implement on a microcon-troller, so it can be used for an implantable closed loopstimulation device. The classification of the featuresis performed using a random forest classifier. FinallyQuirós et al. [27] have extended the traditional defi-nitions of k-anonymity, l-diversity and t-closeness of fuzzy sets as a way to improve the protection of pri-vacy in microdata. The performance of these new ap-proaches is checked in terms of the risk index.In this paper we propose a new, highly efficient,hybrid model for the prediction of extreme values of air pollutants, which uses as input variables only datawhichdonot includepollutantvalues ormeasurementsfrom specialized hardware or software. More specifi-cally, the prediction of a future pollutant’s value is de-termined by the specific attributes of the time periodwe want to predict (Year, Month,Day, Hour), tempera-ture (Air Temperature) and values Cluster_Id and Sta-tion_Id, which are calculated automatically by geolo-cation based services.This is doneintentionallybecause the motivationforthis work is the development of a hybrid applicationthat can be downloaded for free and used easily byeveryday common people with no additional financialcost, running in devices like smartphones and tablets.This would result in low cost applications that wouldoffer people wide access, regardless of space and timeand without the requirementof a specialized expensivedevice.An application that will be able to use the EHFmodel will be runningmachine learning algorithms di-rectly from the user’s device. For example the oper-ating systems Android-SDK [1] and the iOS-SDK [2] are running applications written in Java, Python andother programming languages and special libraries ex-ist for this purpose like the MLP Neural Net [3] foriOS and the Neuroph [4] for Android which allowthe development of machine learning applications inmobile devices. Also another development approachcouldoperatebasedontheCloudSoftwareasa Service(SaaS), Cloud Platform as a Service (PaaS) or CloudInfrastructure as a Service (IAAS). According to thismethodology the application will be able to functionwith the interaction of the device that will provide thedatavectorsofthe independentanddependedvariablesand also by using the cloud service that will providethe machine learning mechanisms. 1.2. Innovation of the EHF project  In a previous work of our research team [16], SOMhas been used in order to cluster the obtained pollu-tants’ concentrations.The ultimate goal was to find themost isolated group, where all the extreme values of pollutants were concentrated. This vital group shouldcomprise of the meteorological and temporal charac-teristics of the extreme hazardous pollutants. In thisclustering effort only the pollutants’ records were usedas inputs.The use of the EHF model capable of forecastingextreme air pollutants’ values will enable the develop-ment of an information system towards recording andassessing the level of air pollution in urban centers.The aim and the innovationare threefold.First, in each   I. Bougoudis et al. / Fast and low cost prediction of extreme air pollution values with hybrid unsupervised learning  117Table 1Statistical analysis for the Athinas stationNumber of CO NO NO 2  O 3  SO 2 Records 89057MAX 21.4 908 377 253 259MIN 0.1 1 1 1 2MODE 0.8 8 60 3 2COUNT 4925 2352 1500 6641 10434AVERAGE 1.83 59.43 63.61 32.85 9.65STDEV 1.48 90.12 27.09 28.61 9.39Table 2Statistical analysis for the Patision stationNumber of CO NO NO2 O3 SO2Records 102065MAX 24.6 953 533 145 445MIN 0.1 1 1 1 2MODE 1.1 14 83 1 2COUNT 3315 816 1175 7635 6412AVERAGE 2.52 116.46 88.01 21.67 20.79STDEV 1.96 101.24 37.79 21.94 21.48 forecasting the parameters do not require measure-ments from sensors or other hardware devices. More-over, the most important target of such a system willbe its application and execution in everyday low costdevises like smart-phones that will use their embeddedapplications to forecast the extreme air pollutants’ val-ues in a real time mode. The potential users of suchan application will have easy and continuous access tothe above forecasting and they will have the potentialto phase serious air pollution incidents. On the otherhand these devices will be functioning in an alterna-tive way with ecological consciousness. Finally it willenable the determination of immediate prevention andprotection measurements and actions. 2. Data The data used herein were related to the followingfour measurement stations located in the wider Atticaarea: Athinas, Patision, Peristeri and Piraeus. The datarecords contained hourly measurements of the follow-ing air pollutants: CO, NO, NO 2 , O 3 , SO 2 . CO wasmeasured in mg/m 3 whereas all others were measuredin  µ g/m 3 . For this research and for each station thedata recordscovertheperiodfrom2000till 2012.2013data vectors were used to test the developed forecast-ing framework.The following tables contain a statistical analysis of the air pollution concentrations for the period 2000–2012.Each data record contains the concentrations of thefive available air pollutants and also the current Year, Table 3Statistical analysis for the Peiraias stationNumber of CO NO NO 2  O 3  SO 2 Records 75597MAX 13.3 902 296 217 293MIN 0.1 1 1 1 2MODE 0.5 5 47 4 2COUNT 6442 2879 1082 3382 8130AVERAGE 1.19 46.26 60.25 38.06 16.64STDEV 0.94 56.65 28.24 31.31 20.25Table 4Statistical analysis for the Peristeri stationNumber of CO NO NO 2  O 3  SO 2 Records 75665MAX 11.5 447 353 284 272MIN 0.1 1 1 1 2MODE 0.3 1 12 4 2COUNT 15002 13337 1709 1926 14126AVERAGE 0.68 14.63 37.63 57.52 10.94STDEV 0.68 30.92 26.75 38.41 13.86Table 5Overall statistical analysisNumber of CO NO NO 2  O 3  SO 2 Records 342384MAX 24.6 953 533 284 445MIN 0.1 1 1 1 2MODE 0.4 1 60 3 2COUNT 21942 14854 3979 15732 39101AVERAGE 1.64 63.62 64.40 36.12 14.78STDEV 1.58 86.73 35.77 32.65 17.75 Month, Day, Day ID (which is equal to 1 for Monday,2 for Tuesday and so on) plus Hour of measurement.Moreover,itcontains7meteorologicalfactorsname-ly: Air Temperature (T), Relative Humidity (RH), AirPressure (PR), Solar Radiation (SR) available for allyears except 2013, Percentage of Sunshine (SUN) (upto2010),WindSpeed(WS) andWindDirection(WD). 2.1. Contribution of clustering Initiallybeforetheapplicationofanytypeofcluster-ing, there was an effort to forecast the actual values of eachair pollutant(dependedfeature),withoutthe inputofany otherindividualpollutantin the independentpa-rameters.This effortwas not successful and the perfor-mance was quite low even though several algorithmswere tried namely: (Feed Forward Neural Networks,Random Forest, and Support Vector Machines-SVM).The following Fig. 1 presents the performance eval-uation of one of these unsuccessful efforts for CO andfor Athinas station.The clustering was done in order to determinethe conditions under which extreme pollutant values  118  I. Bougoudis et al. / Fast and low cost prediction of extreme air pollution values with hybrid unsupervised learning Fig. 1. Air pollutant forecast before using clustering. emerge. Due to the specific physicochemical compo-sition and to the adverse benign conditions related toeach pollutant we have developed two types of groupsof extreme values for each measuring station. The firstcategory was related to the extreme values of the fourprimary pollutants (CO, NO, NO 2 , SO 2 ) and the sec-ond one to the extreme values for Ozone ( O 3 ) which isa secondary pollutant. The final result was the creationof the  EDF  ijpr  ( i  to the number of primary pollutantsand  j  is the number of measuring stations) files con-taining all of the records related to high concentrationsof primary pollutants for each station and the creationofthe  EDF  ijsec  ( i to thenumberofsecondarypollutantsand  j  is the number of measuring stations) files relatedto the O 3 . 2.1.1. Clustering Clustering has been performed with using Fuzzy C-means, Neural Gas Artificial Neural Networks(NGANN), Unsupervised Self Organizing Maps(UNSOM)andSemi SupervisedSelf OrganizingMaps(SEMSOM).When SOM’s are employed the following three ba-sic procedures are executed:(i) Competition: For every training vector sample x n the neurons calculate the similarity functionvalue, where the neuron with the highest valueis the winner. The Euclidean distance betweenthe input vector  x  = ( x 1 ,...,x d ) T  x  ∈  R  andthe weight vector  w i  = ( w il ,...,w id ) T  of thecompeting neurons is the similarity function.(ii) Cooperation: The winning neuron  i  defines itstopological  h j,j  from the surrounding neuronswho adjusted their weights to the input vector.The distance between the winning neuron  i  andneuron  j  is symbolized as  d j,i  so that the topo-logical neighborhood  h j,i  is a function of   d j,i which satisfies two conditions:(a) It should be symmetric to the point of thelocal minimum (point of winning neuron)where  d j,i  =  0.(b) The amplitude of the function should bereduced monotonically as the distance  d j,i from the winning neuron increases. Thefunction that satisfies the above limitationsand it was used in this research is the fol-lowing Gaussian h j,i ( x ) = exp  − d 2 j,i 2 σ 2   (1)Where  σ , is the effective width of the topo-logical neighborhood,which defines the de-gree of participation of the winning neuronneighborhoodneuronsto the trainingphase.Thevalueofthis parameteris reducedinev-ery epoch according the function below σ  ( n ) =  σ 0  exp  − nτ  1  , n  = 0 , 1 , 2 ,... (2)It should be mentioned that  σ 0  is the initialvalue of the effective width and τ  1  =  n 0 ln( σ 0 )  (3)(c) SynapticWeight Adaption.In this last train-ing stage the weight vectors of the compet-itive neurons are updated. The value of thischange is given by the following Eq.: ∆ w j  =  ηh j,i ( x )  ( x − w j ) ,  (4)where i isthewinningneuronand  j  isaneu-ron in its neighborhood. Given the weightvector  w j ( n )  for a specific time point  n ,we estimate the new vector for the moment n  + 1  from the following Eq. (5): w j  ( n  + 1) =  w j  ( n ) +  η ( n ) h j,i ( x ) ( n )( x ( n ) − w j  ( n )) .  (5)   I. Bougoudis et al. / Fast and low cost prediction of extreme air pollution values with hybrid unsupervised learning  119Table 6Groups of Extreme CO, NO, NO 2 , SO 2 , values based on Neural Gas clustering for Athinas stationNumber of Records 2027 Year M D D_Id H CO NO NO 2  O 3  SO 2  T RH PR SR WS WDMAX 2012 12 31 7 24 21.4 908 323 75 125 36.05 96 1022 812 6.6 16MIN 2000 1 1 1 1 0.2 288 4 1 2 0.11 23 984.5 0 0 0MODE 2011 1 8 5 9 5 376 102 3 13 10.8 83 1007 0 1.3 3MODE_COUNT 298 502 105 345 385 42 17 49 831 113 26 120 23 1122 166 356AVER. 2006 6.90 15.32 3.86 9.71 6.98 452.3 99.49 3.55 25.95 11.58 76.77 1008 73.51 1.06 3.98STDEV 3.94 4.81 8.93 1.90 6.69 2.61 110.0 33.18 4.15 16.80 4.60 10.65 5.43 121.13 0.56 3.59 The learning rate  η ( n )  starts from the value around 0.1and it is gradually reduced to 0.01 by using the aboverelation [17,21].It is really important to clarify that the actual differ-ence between UNSOM and SEMSOM is that UNSOMcreatesautomaticallythe n × n topologicalmaps n × N  ,whereas the SEMSOM creates maps with dimensionsdefined by the ANN designer.On the other hand the clustering process with theNGANN require the monitoring of the parameters λ i ,λ f   that control the learning rate, the  ε i ,ε f   parame-ters that define the initial and the final training rate of the ANN and the  t max  that defines the maximum num-ber of epochs of the algorithm.Considering that  t  is the current epoch,  t max  is thetotal number of epochs,  x  is the input signal producedin the beginning of each epoch,  n  stands for everynode,  n w  is the weight vector assigned to each nodeand  k  is the degree that characterizes each node afterthe sorting step, the algorithm goes as follows:(i) A group of nodes is initially created and eachoneof themis assigneda randomweight vectorbased on the distribution.(ii) A vector  x  is created which is the input signal.All of the nodes are ranked in increasing orderbased on the Euclidean distance  x − n w  2 be-tweentheircorrespondingvectorsandtheinputsignal.(iii) The weights of the nodes are adjusted based ontheir ranking order expressed by the followingEq. (6): −→ n w  ←−→ n w +[ −→ n w × e ( t ) × h ( k ) × (  x ) −−→ n w ] (6)were h ( t ) = exp   − kσ 2 ( t )   (7)and σ 2 ( t ) =  λ ι ×  λ f  λ i  tt max (8)and finally ε ( t ) =  ε ι ×  ε f  ε i  tt max (9)(iv) The final stage is an iterative process which in-creases the epoch counter ( t ) every time till the t max  condition is satisfied [15,21,24,30].Finally, in the cases that Fuzzy C-means was em-ployed any point  x  had a set of coefficients giving thedegree of being in the  k th cluster  w k ( x ) . With FuzzyC-means, the centroid of a cluster is the mean of allpoints, weighted by their degree of belonging to thecluster [8,10]: C  k  = Σ x w x  ( x ) m x Σ x w x  ( x ) m  (10)The degree of belonging,  w k ( x ) , is related inverselytothedistancefrom x totheclustercenterascalculatedon the previous pass. It also depends on a parameter  m that controls how much weight is given to the closestcenter. 2.2. Clustering analysis The statistical data of the extreme air pollutiongroups created from NGANN algorithm for the Athi-nasstationarepresentedintheTables6and7.Itshouldbeclarifiedthat in all ofthe tables,M stands formonth,D for date, D_Id is a code used to identify each daywhere 1 stands for Monday and H for the hour of theday.During the study of the extreme pollutants’ values(EXPV)we havereachedthefollowinggeneralanden-vironmental conclusions:(i) The EXPV are divided in two basic groupsnamely: The group with the primary pollutants(CO, NO, NO 2 , SO 2 ) and the one with the sec-ondary pollutants ( O 3 ).
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks