Data & Analytics

Big Data - HDInsight and Power BI

Description
Big Data is one of the hot topics and has got the attention of the IT industry globally. It is a popular term used to describe the exponential growth and availability of data, both structured and unstructured. And big data may be as important to business – and society – as the Internet has become. More accurate analyses may lead to more confident decision making. And better decisions can mean greater operational efficiencies, cost reductions and reduced risk. This presentation focuses on why, what, how of big data as we explore some of Microsoft's big data solutions - HDInsight azure service and PowerBI, providing insights into the world of Big data.
Published
of 43
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  • 1. Big Data HDInsight and Power BI Prasad Prabhu
  • 2. WHAT IS BIG DATA?
  • 3. WHAT IS BIG DATA?
  • 4. INFO IN TABULAR FORMAT ROWS & COLUMNS DEFINED SCHEMA PRIMARY KEY RELATIONSHIPS FOREIGN KEY
  • 5. How do we analyze this data?
  • 6. TRADITIONAL DW/BI ENVIRONMENT Data Warehouse ETL ERP/ CRM,
  • 7. EVOLUTION OF DATA Internet of things Wikis / Blogs Audio / Video Log Files Text/Image Social Sentiment Data Market Feeds eGov Feeds Weather Click Stream Sensors / RFID / Devices Spatial & GPS Coordinates WEB 2.0 Mobile Advertising eCommerce Collaboration Digital Marketing Search Marketing Web Logs Recommendations ERP / CRM Sales Pipeline Payables Payroll Inventory Contacts Deal Tracking Exabytes (10E18) Exabytes (10E18) Petabytes (10E15) Petabytes (10E15) Terabytes (10E12) Gigabytes (10E9) Velocity - Variety Volume 1980 190,000$ 2010 0.07$ 1990 9,000$ 2000 15$ Storage/GB ERP / CRM WEB 2.0 Internet of things Terabytes (10E12) Gigabytes (10E9) Storage/GB
  • 8. DATA IS GROWING
  • 9. 90% of the world’s data has been created in the last 2 years Source:SINTEF
  • 10. Source: IBM
  • 11. 3 ‘V’S OF BIG DATA VOLUME (Size) VARIETY (Structure) VELOCITY (Speed)
  • 12. How do we handle this massive amount of data which comes in different forms and at some speed ?
  • 13. TOMORROWS DW/BI ENVIRONMENT Business Critical Data Warehouse ETL New data sources
  • 14. WHAT IS HADOOP? Apache Hadoop is an open source system to reliably store and process a LOT of information across many commodity computers Began life as an open source implementation of Google’s Map/Reduce and GFS papers. Now used at many major web companies at massive scale (1000’s of node, PB’s of storage) Key attributes: • Open source • Highly scalable • Runs on commodity hardware • Redundant and reliable (no data loss) • Batch processing centric – using “Map-Reduce” processing paradigm
  • 15. 2 CORE COMPONENTS OF HADOOP Distributed Processing (MapReduce) Distributed Storage (HDFS)
  • 16. HADOOP IS JUST A FILE SYSTEM Head Node Data Node Data Node Data Node Data Node Data Node File
  • 17. HADOOP IS JUST A FILE SYSTEM Head Node Data Node Data Node Data Node Data Node Data Node Replicated 3 times File Read Optimised & Failure Tolerant
  • 18. MAP + REDUCE = EXTRACT, LOAD + TRANSFORM REDUCE MAP Raw Data Raw Data Raw Data Raw Data Mapper Mapper Mapper Mapper Data Data Data Data Reducer Output
  • 19. MAP REDUCE ANALOGY – BLOGGER ANALYSIS Hi John, As you know we are building the blogging platform blogger2.com, I need some statistics. I need to find out, Across all blogs ever written on blogger.com, how many times 1 character words occur(like 'a', 'I'), How many times two character words occur (like 'be', 'is').. and so on till how many times do ten character words occur. • Occurrence of one character words – Around 937688399933 • Occurrence of two character words – Around 23388383830753434 • .. hence forth till 10 I know its a really big job. So, I will assign, all 50,000 employees working in our company to work with you on this for a week. I am going on a vacation for a week, and its really important that I've this when I return. Good luck. Regards, CEO
  • 20. THE ECOSYSTEM Query (Hive) Distributed Processing (MapReduce) Distributed Storage (HDFS) ODBC Legend Red = Core Hadoop Blue = Data processing Purple = Microsoft integration points and value adds Orange = Data Movement
  • 21. HADOOP SOLUTIONS
  • 22. INTRODUCING HDINSIGHT  HDInsight is Microsoft’s 100% Apache compatible Hadoop distribution  Available as a Microsoft Azure service  Develop in .NET and Java  Built on Hortonworks Data Platform (HDP)  Can be automated with PowerShell and Command Line  Empowers organizations with new insights on previously untouched unstructured data, while connecting to the most widely used BI tools on the planet
  • 23. HDINSIGHT ARCHITECTURE
  • 24. DEMO
  • 25. RUNNING A MAP REDUCE JOB
  • 26. USE C# - WORD COUNT
  • 27. CONTINUED..
  • 28. RUN SQL LIKE COMMANDS USING HIVEQL
  • 29. COMMON SCENARIOS
  • 30. SENSOR DATA IN NFL
  • 31. CLICKSTREAM & HEATMAP
  • 32. USING EXCEL TO CONNECT TO HDINISGHT
  • 33. POWER BI = POWER PIVOT + POWER QUERY + POWER MAP
  • 34. NATURAL LANGUAGE USING POWER BI
  • 35. SUMMARY  Growing data – Not necessarily structured  Storage is really cheap  Need systems that do not enforce structure on write but on read.  Just don’t validate but analyze and find patterns, perform exploratory analysis, predict outcomes  Find ways to make big data simpler to business users – empower them so that business can take more informed decisions.
  • 36. Data Hadoop Analytics
  • 37. Q&A
  • 38. http://azure.microsoft.com/bigdata http://www.microsoft.com/powerbi Sign up for 30 day free trial REFERENCE LINKS
  • 39. THANK YOU
  • Search
    Related Search
    We Need Your Support
    Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

    Thanks to everyone for your continued support.

    No, Thanks