Documents

Big Data for Conventional Programmers Moscovich

Description
Hadoop
Categories
Published
of 14
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  Big Data for Conventional Programmers Big Data - Not a Big Deal by Efraim Moscovich, Senior Principal Software Architect, CA Technologies Processing large volumes of data has been around for decades (such as inweather, astronomy, and energy applications). It required specialized and expen-sive hardware (supercomputers), software, and developers with distinct pro-gramming and analytical skills.As the push for “Big Data” collection and analysis becomes more prevalent inthe general business community, there is an increased demand for systems andlanguage environments that can run on commodity, inexpensive hardware andsoftware that can be programmed and operated by programmers and analystswith average, mainstream skills.This article examines several languages and tools that were designed for BigData processing, and discusses the present and desirable attributes of data-in-tensive programming languages, in particular, as they are related to ease of use,data abstraction, data flow and data transformations. Introduction What is “Big Data”? To paraphrase a section in McKinsey Global Institute’s Big Data report, “BigData” refers to datasets whose sizes are beyond the ability of typical softwaretools to capture, store, manage, and analyze them.” 1 The definition is intentionally imprecise. Big Data is not defined in precise ter-abytes or petabytes since it can change as the technology advances, and by sec-tor or type of data. In addition to the elusive size of the data, the Big Data term implies methodolo-gies and tools for processing and analyzing the data to produce useful resultsthat cannot be inferred or calculated using other methods in an efficient man-ner.From personal experience in Capacity Management, I used to write programs toread, summarize, and analyze large mainframe datasets that span years’ worthof performance data stored on tapes (SMF records). The programs would run fordays and produce a few reports. That was certainly “large data” processing. The language in use, SAS, lackedmany important attributes for data-intensive computing, such as parallel anddistributed processing, but it did the job on the expected volume and in the ex-pected time.The ubiquitous use of Google Search, which the users perceive to scan terabytesof data just for them, and produce results instantly, raised the expectations levelto a much higher degree. If a teenager can easily search for her favorite bandand get all relevant information in an instant, why can’t a corporate analyst dothe same with the company’s mountains of data? Problem statement: How to process “Big Data” using the current state-of-the-arttechnology in agreed upon time and budget? About the author: Efraim Moscovich is a Senior PrincipalSoftware Architect in the Office of theCTO at CA Technologies. He has over 25 years of experience in ITand Software Development in variouscapacities.Efraim’s areas of expertise includevirtualization and cloud computing,automation, eventmanagement/complex eventprocessing, internationalization andlocalization, performance management,Windows internals, clustering and high-availability, large scale softwarearchitecture, continuous integration,automated testing, scripting languages,and diagnostics techniques.He is an active participant in OASISTOSCA technical committee and DMTFCloud Management working group.Prior to joining CA Technologies, Efraimworked on large scale performancemanagement and capacity planningprojects at various IT departments.Efraim has a M.Sc. in Computer Sciencefrom New Jersey Institute of Technology.  The Road to Hadoop Even though CPUs became faster and faster (Moore’s law 2 ), the speed of access-ing data on disks or volumes was still orders of magnitude slower than the CPUs.Programs that needed to process large amounts of data didn’t benefit much fromthe added speed (referred to as I/O-bound programs). Modern languages supportmulti-programming concepts such as multi-threading implemented as libraries(e.g., POSIX pthreads) for C++, or built-in in Java. Some specialized languages sup-port co-routines or parallel collection processing.To take advantage of servers with multiple CPUs and multiple cores, parallel pro-gramming models can be used to reduce the time it takes to process a certainload of computation or data manipulation. Task Parallelism In general, certain segments of the program that are ‘not sequential’ are brokendown into multiple chunks that can run simultaneously on different CPUs on thesame machine, thus reducing the total time it takes to run the program. The pro-gram can be designed by the programmer to use parallelism explicitly, by thecompiler based on program source analysis, or by the compiler with the aid of hints provided by the programmer. Task parallelism is suitable for many cases,but in general, it is considered complex, especially when the programmers haveto use it explicitly. Issues and Limitations Generic parallel computing greatly increases coding complexity since the pro-grammer has to deal with the additional overhead associated with it.Examples:ã Coordination of concurrent tasks: Extra code is needed to spawn multiple tasks,wait for their completion, wait until one signals to proceed, and pack/send/un-pack data between tasks.ã Parallelization of algorithms: Creating parallel versions of algorithms is notstraightforward. For example, see “Parallelization of Quicksort” 3 .ã Shared memory locking and synchronization : Shared memory that is used bymultiple threads has to be carefully managed and explicitly protected by locksor semaphores to avoid data corruption.In addition, hardware that supports massive parallelism is specialized, complexand expensive. In many cases the hardware and software are designed specifi-cally for one project or target audience. For example, see IBM Blue Gene/Q 4 . Data Parallelism The Data Parallel programming model is more suitable for scenarios where theexact same operations can be applied on multiple, independent data elementssuch as records, files, documents, and web pages.Turning sequential processing into parallel: // Sequential version foreach (var item in collection) {Process(item);}// Parallel equivalentParallel.foreach( collection, Process(item) ); Figure 1 Sequential to Parallel Generic parallel computing greatly increases coding complexity since the programmer has to deal with the additional overhead associated with it.  Simplistically, we want the programming language and its runtime to support amethod of running loop iterations in parallel and to handle the distribution of the processing functions with the correct item on any number of availableprocessors automatically. SIMD On specialized hardware this is called Single Instruction, Multiple Data (SIMD) –the same instruction is performed on multiple data at lockstep (for example, in-crementing all the entries in an array simultaneously).SIMD hardware is not wide spread, and the number of operations that can bedone in parallel is relatively small (<100) and fixed. Also, SIMD requires vectorprocessors which are expensive. Parallel SQL Since standard SQL is the de facto language for data manipulation across multi-ple platforms it is natural to expect that, perhaps, a form of parallel SQL cansolve the problem.Indeed, SQL database servers from Oracle and Microsoft have allowed someform of parallel execution of queries and bulk operations for years. However, it iscomplex (think nested outer and inner joins) when done explicitly by the pro-grammer, or limited mostly to parallel execution on the same node 5 .For many computing domains such as batch processing of log and text files forweb crawling and page analysis, the computations on the data are relativelysimple. These techniques use operations such as reading, filtering, extraction,counting, reordering, matching and similar. However, when the input data is verylarge these computations have to be distributed across thousands of machinesin order to finish the job in a reasonable amount of time. To achieve this, a large amount of code has to be written to distribute the pro-grams and data to multiple nodes, parallelize the computations, and handle co-ordination of load-balancing and failures. The actual ‘useful’ code is relativelysmall and is obscured by the amount of boiler-plate ‘overhead’ code. Jeffrey Dean and Sanjay Ghemawat from Google tried to simplify this by creat-ing MapReduce 6 . Distributed Data Parallelism MapReduce MapReduce was inspired by the Map and Reduce operators in functional lan-guages such as LISP. Logically, MapReduce can be reduced (no pun intended) tothe following:ã Treat the data as a set of <Key, Value> pairsã Input readers read the data and generate <Key, Value> pairsã The user provides two functions, Map and Reduce, that are called by the run-timeã Map:o Take a list of <Key, Value> pairs, process them, and output a list of new<Key, Value> pairso Each pair is processed in parallelã Reduce:o Take all of the values associated with the same <Key>o Process and output a new set of values for this Keyo Each reducer can be processed in parallel For many computing domains such asbatch processing of log and text files for web crawling and page analysis,the computations on the data are relatively simple.  To illustrate MapReduce, let’s consider the problem of producing a summary of word occurrences in a large collection of text documents.The programmer has to write two functions:1. The Map function will iterate over the contents of a single doc and produce(word, count) pairs.2. The Reduce function will combine the separate counts for each word and pro-duce a single result.The MapReduce library and runtime handles all the rest: managing the nodeswhere the computations take place, ‘serving’ the documents to the ‘map’ opera-tions, collecting and combining intermediate results, and running the reduce op-erations on the combined list.The MapReduce programming model has several advantages: ã The conceptual model is simple to understand: just two basic operationsã It is generic enough for expressing many practical problems that deal withdata analysisã The implementation hides the messy details of parallelization and fault re-covery ã It scales well on thousands of commodity machines and terabytes of dataã It can be incorporated into many procedural and scripting languages.ã Automates data distribution & result aggregation ã Restricts the ways data can interact to eliminate locks (no shared state = nolocks!)Despite the simplicity of the concepts, MapReduce has several disadvantages:ã Using MapReduce with a conventional programming language such as Java,C++, or Python is not simple. The programmer has to code even simple opera-tionsã One-input, two-phase data flow is rigid, and hard to adapt to other applica-tionsã Opaque nature of the map and reduce functions impedes optimizationDespite the shortcomings, MapReduce makes a large subset of distributed prob-lems easier to code. MapReduce programming has gained in popularity after itwas implemented by Apache in the Hadoop project. Hadoop Hadoop MapReduce is a programming model and software framework for writ- Map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, 1 ); Reduce(String key, Iterator values):// key: a word// values: a list of countsint result = 0;for each v in values:result += ParseInt(v);Emit(AsString(result)); Figure 2 Map, Reduce Pseudocode The MapReduce programming modelhas several advantages . . and disadvantages.
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks