Food & Beverages

Data modelling and the application of a neural network approach to the prediction of total construction costs

Data modelling and the application of a neural network approach to the prediction of total construction costs
of 8
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  Introduction The importance of models to estimate the cost of buildings has been highlighted by Ferry et al. (1999).Newton (1991) reviewed over 60 cost models and classied the techniques used to develop each model under eight headings, including regression techniques.However, in both cases no mention was made of theapplication of neural networks. Elhag and Boussabaine(1998) developed neural network models to predict thetender price of school buildings, and later (Elhag andBoussabaine, 1999a) they developed two models topredict the tender price of ofce buildings using linearregression and neural network techniques. They foundthat both techniques produced models that were ableto map the underlying relationship between the costfactors and the tender price but, because the samplesize was small (30 and 36 projects, respectively), con-cluded that more projects were required for meaning-ful conclusions to be drawn.This paper describes the development of neuralnetwork models of total construction project cost basedon recent historical project data. The initial impetusfor the research was the paucity of data available thatcan provide reliable information about the relative costsof using different procurement routes. However, inattempting to develop a model to address this strate-gic decision, it immediately became apparent that thisvariable cannot be isolated from the many other costsignicant variables in a building project (Harding Construction Management and Economics (2002) 20 , 465–472 Data modelling and the application of a neuralnetwork approach to the prediction of totalconstruction costs MARGARET W. EMSLEY*, DAVID J. LOWE, A. ROY DUFF, ANTHONYHARDING and ADAM HICKSON  Manchester Centre for Civil and Construction Engineering, UMIST, PO Box 88, Manchester M60 1QD, UK  Received 17 January 2001; accepted 19 April 2002Neural network cost models have been developed using data collected from nearly 300 building projects.Data were collected from predominantly primary sources using real-life data contained in project les, withsome data obtained from the Building Cost Information Service, supplemented with further information,and some from a questionnaire distributed nationwide. The data collected included nal account sums and,so that the model could evaluate the total cost to the client, clients’ external and internal costs, in additionto construction costs. Models based on linear regression techniques have been used as a benchmark for eval-uation of the neural network models. The results showed that the major benet of the neural networkapproach was the ability of neural networks to model the nonlinearity in the data. The ‘best’ model obtainedso far gives a mean absolute percentage error (MAPE) of 16.6%, which includes a percentage (unknown)for client changes. This compares favourably with traditional estimating where values of MAPE between20.8% and 27.9% have been reported. However, it is anticipated that further analyses will result in the devel-opment of even more reliable models.  Keywords : Cost modelling, neural networks, linear regression analysis *Author for correspondence. e-mail: Construction Management and Economics ISSN 0144–6193 print/ISSN 1466-433X online © 2002 Taylor & Francis Ltd 10.1080/01446190210151050  et al  ., 1999a), and a model is required that incorpo-rates all the cost signicant variables, the values of which are known at the early stage of the project.This work has been carried out in two stages. l An initial pilot study was made, where poten-tially cost signicant variables were identied,the availability of data was investigated andstrategies for data collection were established.In addition appropriate modelling strategieswere examined, and preliminary testing of thesemethods was carried out, using a relatively smallnumber (46) of data sets (Duff et al. , 1998). l A full scale study was made using data fromnearly 300 projects, and hence addressing oneof the deciencies in the model developed byElhag and Boussabaine (1999a), in which moresophisticated models were developed.Both the data requirements and the data collectionprocesses are described, and ways in which both theinput variables (such as frame type) and output vari-ables (cost) may be best represented in the model arediscussed. For the purposes of comparison, linearregression models have also been developed, and theresults obtained are given before the development of the neural network modelling process is explained. Therst two sets of models to be developed use rst theve and then the nine most signicant variables iden-tied by the linear regression modelling process, andthe nal set incorporates all the variables. Data requirements The model variables may be divided into input variablesand output variables. Initially, 43 input variables wereidentied, subsequently reduced to 41, as two variableswere eliminated (sanitary installations and disposalinstallations) because almost no variation in their de-nition was found among the project data collected.Input variables were further categorized as projectstrategic, site related or design related (Table 1).The identication of potentially cost signicant vari-ables was achieved through a thorough literature searchof over 60 publications, supplemented by discussionswith the professional collaborators (see Acknowledge-ments). In addition, with the exception of ‘quality of building’, all other input variables are encapsulated inthe cost analyses published by the Building CostInformation Service (BCIS). This is less than the 67variables identied by Elhag and Boussabaine (1999b),but it should be noted that that list also includes factorsaffecting construction project duration and factors clas-sied as contractor attributes, which would not beknown at the stage when the present model is intendedto be used.A criticism of previous cost models is that they usedonly the tender price to evaluate cost, whereas in realitythe cost to the client of a building contract is the nalcontract sum. This is very rarely the same as the tenderprice, and Corbett and Rowley (1999) suggested thatthe nal account sum should be made available to costplanners, whereas the BCIS, for example, provide onlythe tender price. The model described here has beendeveloped using nal account gures as the outputvariable. In addition, the whole cost to the clientincludes not only the nal contract sum but also theprofessional fees and whatever resources the client hasprovided to the project. Models were developed separ-ately for construction costs and client costs, but theymay be summed to give the total project cost to theclient.The variables of time and geographic location wereaccommodated through the application of the BCIScost indices to bring all projects to a common locationand base date. The costs predicted, using the model,were then adjusted by the appropriate indices for thetime and location in question.Where projects included external works, demolition,ttings or specialist services, their associated costs wereremoved from the nal account gure and the appro-priate proportions were removed from the contract pre-liminaries and clients’ costs (Harding et al  ., 2000a).This was done because these costs are subject to widevariation, largely independent of the main variablesdening the building. For example, for the projectswhere data were collected, the cost of external worksvaried from 1% to 30% of the total contract sum. Suchvariation makes these features impossible to modelaccurately and they are more reliably estimated inde-pendently. Data collection In total, the data collection programme resulted in thecollection of 288 full data sets from predominantlyprimary sources, supplemented by some secondarydata. Primary sources The professional collaborators provided a great deal of the data required, and contact was established also withorganizations, primarily quantity surveying and projectmanagement practices, that were willing to providedata. A data pro-forma was developed to assist boththe researchers and, more importantly, those collabor-ators willing to carry out the data retrieval themselves. 466 Emsley et al.  This method of collection provided the great major-ity of building cost analyses. Thirty-nine ofces werevisited from 20 different organizations across theUnited Kingdom. Secondary sources The BCIS publishes cost analyses for construction pro-jects, and these fullled the data requirements, exceptfor the: l nal account; l actual duration; l quality of building; l clients’ external costs; and l clients’ internal costs.In order to obtain this additional information, a ques-tionnaire was sent to BCIS subscribers, yielding 29 setsof data.In addition to these questionnaires, data wereobtained from a much more extensive mail-shotadministered to 1239 practising quantity surveyors, allof whom had been canvassed by telephone, but thisyielded only six additional projects. Data representation Each of the variables has been analysed in order todetermine the best way of representing that variable inthe modelling process. The ways these variables arerepresented fell into four distinct groups.The rst of these groups comprised variables thatare real numbers, for example, ‘duration’ and ‘no. lifts’.Where the range of these variables differed by morethan one order of magnitude it was more appropriateto use the logarithm of that value, to ensure that therange of values was more evenly distributed.The remaining variables are categorical variables thatrepresent one of a choice of categories. As a general ruleit is best that a single input is used for a variable onlywhen that variable has some meaning as a single vari-able (Tarassenko, 1998), i.e. if the value of the variableincreases then it must represent an increase in some fac-tor that inuences the outcome of the model. For somevariables, obtaining such an order was simple. Forexample, clearly with ‘site access’ there is an orderbetween ‘unrestricted’, ‘restricted’ and ‘highly restric-ted’, inasmuch as an increase in the restriction to accesswill be expected to cause an increase in cost. Thereforethis variable can be represented by a single input.There were a great many more variables for whichno such order was immediately apparent, e.g. ‘inter-nal wall nishes’, where the variable represents the costof different material combinations that will make upthe nish. The value of the input was set to be thestandard cost per m 2 of each nish, which provides anorder proportional to how much each nish is expectedto impinge upon the nal building cost.For some categorical variables a consistent ordercould not be identied, because the actual differencesin cost between the possible choices are uncertain. Forexample, for ‘frame type’ (where the choice was ‘ insitu ’, ‘masonry’, ‘precast’, ‘steel’ or ‘timber’), there isa lack of consensus on the comparative costs, and itis impossible to ascertain a consistent ordering, interms of cost, that would apply in all circumstances.Therefore, a binary input coding (yes/no) was appliedto each possible choice, thus treating each such cate-gorical variable as a series of binary variables.  Neural network modelling to predict cost  467 Table 1 Classication of input variablesProject strategic variables:Contract formProcurement strategyQuality of buildingDurationPurposeTendering strategySite related variables:Site accessType of locationTopographyType of siteDesign related variables:Air conditioningInternal doorsRoof proleCeiling nishesInternal wallsShape complexityElectrical installationsInternal wall nishesSpecial installationsEnvelopeNo. liftsStair typesExternal doorsNo. storeys above groundSubstructureExternal wallsNo. storeys below groundStructural unitsFloor nishesMechanical installationsUpper oorsFrame typePilingWall-to-oor ratioFunctionProtective installationsWindowsGIFARoof constructionHeightRoof nishes  Comparison of modelling techniques A comparison of linear regression analysis and neuralnetworks has been made elsewhere (Harding et al  .,1999b) and some preliminary analyses carried out(Harding et al  ., 2000b). However, in this situation, themain advantage that neural networks offer is theirability to capture the nonlinearity that inevitably willexist between variables. Nonlinear regression tech-niques can also be used to account for nonlinearity,but have the disadvantage that the user must havedetailed knowledge about the appropriate nonlinearrelationship between the predictor variables and themean values of the observations (Christensen, 1996).However, when applying neural networks, these rela-tionships are determined implicitly by the model andtherefore do not need to be specied. Representation of cost (output variable) Previously, cost models have often used the raw cost asthe dependent variable. However, there are a numberof assumptions implicit in this choice of variable. First,it is assumed that the standard deviation of errorremains constant. That is to say, the cost of a small pro-ject can vary by the same monetary amount as a largeproject. This is highly unlikely to be the case. Further,regression model tting minimizes the squares of theerrors, so models developed using this technique will beinherently biased towards minimizing the errors for verylarge projects, where the errors are greatest. Thereforeit is unlikely to be a good predictor of the cost of smallerprojects. Given that the costs of projects in the data col-lected vary between £36000 and £15800000, theinuence of errors on the cost of the largest projects isseveral orders of magnitude more than those of thesmallest projects, so the effect will be pronounced.The second inherent assumption that is questionedis that the effects of any variable are best expressed asa xed cost change. If, for example, the specication of the oor nishes changed to one of a higher cost, thecost of the building would be expected to rise. However,the cost of a small building would not be expected torise by the same amount as the cost of a very large one,but as a proportion of the building size or cost.These criticisms raise serious questions as to themeaningfulness of models produced by using raw cost as the predictor for a linear regression model.Therefore, three other possible models were tested. Log of building cost In order to address the problem of the large cost dif-ferences, a common solution is to model the log of thecost. This assumes that the log of cost is normally dis-tributed, and that a change in any variable within themodel will cause a proportionate change in cost. Thedistribution that corresponds to a normal distributionof the log of cost is the one whose mean is the projectcost, and whose standard deviation is a xed propor-tion of project cost. When the normal distribution isconverted to raw cost for any project, it is a positivelyskewed distribution such that the peak of the proba-bility density function is less than the mean. It mightbe argued that the skewed nature of this function couldbe a better representation of the possible variation inproject cost than a true (unskewed) probability distri-bution, because generally there is more scope for theproject costs to be much higher than expected ratherthan much lower. Cost per m 2 The cost per m 2 is the cost predictor most used byquantity surveyors, as it provides a measure of costthat is essentially independent of building size. If thisvalue were to be used in a regression model, then itwould assume that any variation in project cost is pro-portional to the size of the building, rather than thecost. This may seem to be an unrealistic solution,because projects that are of a higher specication (andhence a higher cost per m 2 ) might be expected to showcorrespondingly higher variations in cost. However, ithas the advantage of removing the understood linearrelationship between GIFA and project cost from themodel. This should allow the modelling to focus onother, less understood, inuences on project cost. Log of cost per m 2 The log of cost per m 2 makes the same assumptionsas the log of cost, in that variations in project cost areproportional to the expected cost. However, it also pro-vides a variable that is devoid of the linear relationshipbetween cost and size, in the same way as the cost perm 2 output. Although this makes little difference toregression models, it could be useful in neural networkmodelling, if the correlation between cost and GIFAcreates difculties for those models that use the log of cost in learning the relationship between cost and othervariables, because this could stop the neural networklearning only this relationship to the detriment of others. Factor analysis Factor analysis was used to establish the underlyingdimensions of the input variables, conrming that all 468 Emsley et al.  items should be retained. Principal component extrac-tion with varimax rotation was also used and, althoughthis indicated that the true number of underlyingdimensions lay between eight and nine factors, regres-sion models created using the factor scores resulted in R 2 values less than those generated using the srcinalinput variables. Therefore a decision was reached toproceed using the srcinal input variables. Linear regression analysis The rst aim of the regression analysis was to developa robust regression cost model that would make auseful benchmark against which neural network modelscould be measured. Second, it was desirable to iden-tify those variables that demonstrated a strong linearrelationship with the cost, to assist in the managementof neural network training. The software used wasSPSS 7.5 for Windows.In order to create a predictive regression model, twomethods were attempted: forwards and backwardsmodelling. Models to predict cost per m 2 , log of costper m 2 and log of cost were generated for each method.The number of statistically signicant variables in eachmodel varied between eight, in the forward stepwiselog of cost model, and 14, in both the log of cost andlog of cost per m 2 backward models. Throughout themodels 19 different variables were used. A summaryof the results is given in Table 2.Five variables appeared in all six models: ‘GIFA’,‘function’, ‘duration’, ‘mechanical installations’ and‘piling’. This suggests that these are the key linear costdrivers in the data. A further four variables appearedin ve models: ‘internal wall nishes’, ‘frame type’,‘site access’ and ‘protective installations’.The log of cost backwards models outperformed theother models by most of the percentage error mea-sures. However, the differences between all the modelswere small. One of the reasons the log of cost and logof cost per m 2 models performed so well was that theyincorporated more variables of whose inclusion in themodel we could be condent, although they may notnecessarily be the most useful variables in terms of which to model, because their values may be uncer-tain at the early estimating stage.As the models exhibited similar performance, itmight be more appropriate to consider the spread of error. Neural networks minimize error using the least-squares approach. As this can be sensitive to non-uniformity of standard deviation and non-normality of error, the spread and normality of error were assessedfor each model by considering scatter plots of erroragainst the value of the independent variable, all of which displayed the same tendency for the models tooverestimate the cost of cheaper projects and under-estimate the cost of more expensive projects. The factthat as much as 30% of the error appeared to arisefrom this suggests that some key drivers of buildingcost were not being represented adequately. This ariseseither from their non-inclusion or, more likely, fromnonlinearities in the data, in which case it may beexpected that a neural network will perform better. Neural networks Three sets of models were developed, using: l the ve variables that were incorporated in allsix of the linear regression models; l the nine variables found in ve of the six regres-sion models; and l all the variables for which data have been col-lected.In addition, an optimum combination of variables wasused, determined by using a combination of the for-wards and backwards stepwise feature selection algor-ithm of a generalized regression neural network(GRNN) and a genetic algorithm (GA) global opti-mization search technique. Because theoretically a GAis better able to come up with an optimum combina-tion of variables, and is a randomized process, this wasrepeated four times and the forwards and backwardsalgorithms were executed once each. By excluding allthe variables that do not appear in the feature selec-tion results, the model was reduced to a 25-variablemodel. However, as the performance of this model waspoorer on both measures of performance than the all-variable model, suggesting that one or more signicantvariables had been omitted from the analysis, thisapproach was abandoned. The software used wasTrajan Neural Network Simulator Release 4.0E.In order to assess the best approach to the modelling,a number of different networks were tried: three- andfour-layer multi-layer perceptrons (MLPs), radial basisfunctions (RBFs) and generalized regression neural net-works (GRNNs). Of these alternatives, three-layerMLP networks (with one hidden layer) offered the bestperformance, in terms of the associated values of R 2 andmean absolute percentage error (MAPE).To train the network, backpropagation, conjug-ate gradient descent, Levenberg Marquardt, quick  Neural network modelling to predict cost  469 Table 2 Linear regression model resultsCost per m 2 Log of cost per m 2 Log of cost R 2 MAPE R 2 MAPE R 2 MAPEForward0.66820.8%0.64820.0%0.64420.1%Backward0.66621.7%0.66119.3%0.66119.3%
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks