Computers & Electronics

A Survey on Various Approaches for Weblog Analysis Using Hadoop

Description
Weblogs are rich source of the information. All the activities of the visitors are captured in the weblogs and stored on the server. If too many visitors are hitting website everyday then there will be massive amount of weblogs. Traditional systems
Published
of 3
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
   IJSRD - International Journal for Scientific Research & Development| Vol. 3, Issue 11, 2016 | ISSN (online): 2321-0613 All rights reserved by www.ijsrd.com   399 A Survey on Various Approaches for Weblog Analysis using Hadoop Dhruv V Kapatel 1  Dr. Premal J Patel 2 1,2 Department of Computer   Engineering 1,2 Ipcowala Institute of Engineering & Technology Dharmaj, Anand, Gujarat, India-388430  Abstract   —    Weblogs are rich source of the information. All the activities of the visitors are captured in the weblogs and stored on the server. If too many visitors are hitting website everyday then there will be massive amount of weblogs. Traditional systems fail to store and process huge amount of logs. Hadoop[7] suits best for processing these logs. In this paper we studied various approaches to analyze weblog using hadoop.  Key words:   Hadoop, Weblog, Big Data, Analytics I.   I NTRODUCTION  Big data became high profile buzzword in last few years. There is a huge explosion of data. Even 90% data in the world are created in last two years  [6] . Weblog captures all the actions performed by user on website. Size of the logs may be in petabytes or terabytes depends on popularity of the website. Volume is not only problem here weblogs are in semi structured or unstructured format  [1] . So it is challenging to store and analyze weblogs. Website admin/owner is always eager to know how visitors are using their website. With weblog analysis we provide deep insight to their website. It helps them to take better decisions. Hadoop is framework for store and process large data sets on cluster of commodity hardware  [7] . There are two main components of hadoop HDFS (Hadoop distributed file system) and map reduce engine. Hadoop works on bring computation to the data that is the fundamental behind the speed of hadoop. There are many ways to analyze weblogs using native mapreduce code or high level programing tool such as pig  [12]  and hive  [13] . We will discuss all the different approaches to analyze weblogs using hadoop in section III. II.   H ADOOP    A.    Architecture of Hadoop Fig.1 shows a multi-node Hadoop cluster. Hadoop provides location awareness compatible file system. HDFS is the backbone of the Hadoop system which stores the data by replication and makes different copies of data on to the different rack for the purpose of fault tolerance. Hadoop clusters consist of a Master node and multiple worker nodes. Master node consists of Job tracker, Task tracker, datanode and namenode. Task tracker and datanode operate on slave nodes. HDFS works on large cluster. It hosts the file system using namenode. Secondary namenode generates snapshots of namenode. Similarly, the  job scheduling is done by job tracker. All the information stored on the datanode is in known to the namenode. When user interacts with Hadoop system, the namenode becomes active and it possesses all the information about datanode data. Fig. 1: Multi Node Hadoop Cluster  [8]    B.    Hadoop Distributed File System (HDFS) Hadoop distributed file system (HDFS) supports java programs for storing the data and contains portable file system with the features of scalability and distributed across multiple machines. Fig. 2 shows the HDFS file system. User can access and store their data with the help of namenode. In such a case, namenode becomes a single point of failure. As such, a secondary namenode is always active and it takes snapshots from namenode and stores all the information itself. If the namenode fails, it can recover data from the secondary namenode. Secondary namenode behaves like a checkpoint. HDFS has the advantage of data awareness between task tracker and job tracker. C.    Map-Reduce Map-reduce is the programming model that works on the large datasets with parallel and distributing algorithm on cluster. Map-reduce program is composed of Map function which performs sorting and filtering of large data sets. Reduce function performs the summary operation which combines the result and gives optimized result. Working of Map-Reduce is shown in Fig.3. These functions are run parallel on large data. The computation is done on key/value pairs. Fig. 2: Architecture of Hadoop Distributed File System  [9]     A Survey on Various Approaches for Weblog Analysis using Hadoop (IJSRD/Vol. 3/Issue 11/2016/095)   All rights reserved by www.ijsrd.com   400    Map step: The node, called the masternode, takes the input, divides a large problem are into smaller sub-problems and distributes them among worker nodes in a multi-level tree structure. The worker nodes process these sub-problems and pass the result back to the master node.    Reduce step: Reduce function accepts input from Map-Function, combines the answers to all the sub-problems, collect it in master node and forms the output. Map operations are run in parallel and accordingly, the reducer performs the reduce phase on same keys presented in same time. Each map function is associated with a reduce function. Fig. 3: Mapreduce Data Flow with Multiple Reduce Tasks [10]  III.   R ELATED WORK  In this section represents essential contributions on weblog analysis using hadoop. Hemant Hingave and Prof. Rasika Ingle  [1]  presented approach to analysis weblogs using hadoop. They had analyzed weblogs using native map reduce and RDBMS and compared time taken by both approaches. Results show that hadoop takes less time than RDBMS. Drawback of this approach is that it requires more development time and effort than higher level language tool pig and hive  [11] . It also required efficient java developers to code map reduce programs and have to perform QA to gain efficiency. That will increase overall development cost. Chen-Hau wang, Ching-Tsorng Tsai,Chia-Chen Fan and Shyan-Ming Yuan [2]  had proposed hadoop based weblog analysis. Flow chart of their approach shown in Fig.4.They had analyzed weblogs using pig  [12] . Con of this approach is that administrator has to manually generate the reports based on result of pig script. Fig. 4: Flow Chart of System  [2]  Sayalee Narkhede,Trupti Baraskar and Debajyoti Mukhopadhyay [3] had done hit cunt from weblog files using hadoop. They ran pig script over results of map-reduce code to find hit count. Results are provided in form of charts and graphs.   ZhenQi Wang and HaiLong Li  [4]   analysed weblog and find interesting browsing pattern with parallelization of Apriori algorithm. They compared execution time of traditional apriori algorithm on single node and parallel implementation of apriori algorithm on multiple nodes. Results show that parallel apriori is faster than traditional apriori. In 2014 Savitha K and Vijaya MS had presented approach to mine web server logs using distributed cluster and big data technology. They had analyzed weblogs for session identification and found urlcount for each unique ip address using native map reduce code. They did same thing with non-hadoop, pseudo distributed mode and fully distributed mode and compared results. Results show that fully distributed mode of hadoop takes less time. Like the first paper con of this approach that with native map reduce code development time and effort is high. We cannot expect each data analyst with efficient java skill. However, hadoop native map-reduce is faster than pig and hive. But we can overcome performance cost with pig and hive by adding extra machines to cluster. The development cost with map-reduce is much higher than the cost of adding extra machines. IV.   C ONCLUSION  In this paper we discussed various approaches to analyze weblogs using hadoop. By analyzing weblogs we can reveal important details from weblogs. With the use of hadoop it reduces overall processing time. Native map reduce code required more development time and effort than higher level language tool such as pig and hive. Pig and hive are great tools for analysis and increase productivity of analyst. We can analyze logs of e-commerce site to provide deep insight to retailers about how visitors are using their website. Apache benchmark shows that hive is faster than pig  [14] . As a future work we can analyze e-commerce logs using hive and compared performance with pig. R EFERENCES  [1]   Hemant Hingave and Prof. Rasika Ingle, “ An approach for MapReduce based Log analysis using Hadoop”, ICECS, pp.   1264  –   1268, Feb 2015 [2]   Chen-Hau wang, Ching-Tsorng Tsai,Chia-Chen Fan and Shyan- Ming Yuan,” Hadop based weblog analysis system”, 7th International Conference on Ubi-Media Computing and Workshops, July 2014 [3]   Sayalee Narkhede,Trupti Baraskar and Debajyoti Mukhopadhyay, “Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment” IE EE, March 2014 [4]   ZhenQi Wang and HaiLong Li, “Research of massive Web log data mining based on cloud computing”, 5th International Conference on Computational and Information Sciences(ICCIS), pp. 591-594,June 2013 [5]   Savitha K and Vijaya MS ,”Mining  of Web Server Logs in a Distributed Cluster Using Big Data Technologies”,   A Survey on Various Approaches for Weblog Analysis using Hadoop (IJSRD/Vol. 3/Issue 11/2016/095)   All rights reserved by www.ijsrd.com   401 International Journal of Advanced Computer Science and Applications(IJACSA), 5(1), pp. 137-142, 2014 [6]   http://www-01.ibm.com/software/data/bigdata/what is-big-data.html [7]   http://hadoop.apache.org [8]   https://en.wikipedia.org/wiki/Apache_Hadoop  [9]   http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoophdfs/ HdfsDesign.html   [10]   Tom White, Hadoop the definative guide Yahoo Press, third edition, 2012. [11]   https://www.dezyre.com/article/-mapreduce-vs-pig-vs-hive/163 [12]   https://pig.apache.org [13]   http://hive.apache.org [14]   https://issues.apache.org/jira/browse/HIVE-396
Search
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks