Technology

BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES

Description
This talk, is part of the MIT Program on Information Science brown bag series (http://informatics.mit.edu) This talk reviews emerging big data sources for social scientific analysis and explores the challenges these present. Many of these sources pose distinct challenges for acquisition, processing, analysis, inference, sharing, and preservation. Dr Micah Altman is Director of Research and Head/Scientist, Program on Information Science for the MIT Libraries, at the Massachusetts Institute of Technology. Dr. Altman is also a Non-Resident Senior Fellow at The Brookings Institution. Prior to arriving at MIT, Dr. Altman served at Harvard University for fifteen years as the Associate Director of the Harvard-MIT Data Center, Archival Director of the Henry A. Murray Archive, and Senior Research Scientist in the Institute for Quantitative Social Sciences. Dr. Altman conducts research in social science, information science and research methods -- focusing on the intersections of information, technology, privacy, and politics; and on the dissemination, preservation, reliability and governance of scientific knowledge.
Categories
Published
of 41
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  • 1. Sources of Big Data for the Social Sciences Micah Altman Director of Research MIT Libraries Prepared for Program on Information Science Brown Bag Series MIT August 2015
  • 2. Roadmap Sources of Big Data for the Social Sciences  What the @#%&! Is “big data”?  Two examples of big data in social & health sciences  Open questions  Potential roles for libraries Big Data Challenges Acquisition Retention Analysis Access
  • 3. Sources of Big Data for the Social Sciences Credits & Disclaimers
  • 4. DISCLAIMER These opinions are my own, they are not the opinions of MIT, Brookings, any of the project funders, nor (with the exception of co-authored previously published work) my collaborators Secondary disclaimer: “It’s tough to make predictions, especially about the future!” -- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White, etc. Sources of Big Data for the Social Sciences
  • 5. Collaborators & Co-Conspirators  Workshop Series Co-Organizers – U.S. Census Bureau  Cavan Capps  Ron Prevost  Research Support  Supported by the U.S. Census Bureau Sources of Big Data for the Social Sciences
  • 6. Related Work Main Project:  Census-MIT Big Data Workshop Series projects.informatics.mit.edu/bigdataworkshop s Related publications: (Reprints available from: informatics.mit.edu )  Altman, M., D. O’Brien, S. Vadhan, A. Wood. 2014. “Big Data Study: Request for Information.”  Altman, M Altman M, Wood A, O'Brien D, Gasser M., Vadhan, S. Towards a Modern Approach to Privacy-Aware Government Data Releases. Berkeley Journal of Technology Law. Forthcoming.  Altman M, McDonald MP. 2014. Public Participation GIS : The Case of Redistricting. Proceedings of the 47th Annual Hawaii International Conference on Systems Science . Sources of Big Data for the Social Sciences
  • 7. Workshops Series: Big Data and Official Statistics Sources of Big Data for the Social Sciences Acquisition Challenges Using New forms of Information for Official Economic Statistics [August 3-4] Privacy Challenges Location Confidentiality and Official Surveys [October 5-6] Inference Challenges Transparency and Inference [December 7-8] Expected outcomes:  Workshop reports (September, October, December)  Integrated white paper (February)  Identifying new opportunities for statistical agencies  Inform the Census Big Data Research Program. projects.informatics.mit.edu/bigdataworkshops
  • 8. Sources of Big Data for the Social Sciences What the @#%&! is Big Data?
  • 9. Small, Big, Massive & Ginormous Sources of Big Data for the Social Sciences  Data Characteristics: the k “V’s” of big data  Volume  Velocity  Variety  + Veracity  + Variability  + …
  • 10. “Big” is in the use, not just the data Sources of Big Data for the Social Sciences When do challenges of “big” exceed limits of well- selected traditional methods and practices?  Data Management – Workflow & Governance Challenge  Implementation – Performance Challenges  Analysis methods – Inferential Challenges
  • 11. Sources of Big Data for the Social Sciences Why pay attention now?
  • 12. Trends and Challenges Sources of Big Data for the Social Sciences  Trends  Increasingly data-driven economy  Individuals are increasingly mobile  Technology changes data uses  Stakeholder expectations are changing  Agency budgets and staffing remain flat.  The next generation of official statistics  Utilize broad sources of information  Increase granularity, detail, and timeliness  Reduce cost & burden  Maintain confidentiality and security  Multi-disciplinary challenges :  Computation, Statistics, Informatics, Social Science, Policy
  • 13. Sources of Big Data for the Social Sciences Two examples (Good Cop, Bad Cop?)
  • 14. Strategies (and U.S. Debate Strategies) Sources of Big Data for the Social Sciences More Information • Grimmer, Justin, and Gary King. "General purpose computer- assisted clustering and conceptualization." Proceedings of the National Academy of Sciences 108.7 (2011): 2643-2650. • King, Gary, Jennifer Pan, and Margaret E Roberts. 2013. “How Censorship in China Allows Government Criticism but Silences Collective Expression.” American Political Science Review 107 (2 (May): 1-18. Copy at http://j.mp/LdVXqN “Posts with negative, even vitriolic, criticism of the state, its leaders, and its policies are not more likely to be censored… the censorship program is aimed at curtailing collective action by silencing comments that represent, reinforce, or spur social mobilization, regardless of content.” Data Source - Social Media Messages Data: Structure - Network, Unstructured Text, Structured metadata Unit of Observation - Individuals; Interactions Collection Design - Pure observational Desired Inferences - Causal inference – what censorship strategies cause observed reaction - Inference to Population Frame Performance challenges - High volume - Complex network structure - Scaling bespoke algorithms - Sparsity - Systematic and sparse metadata Management Challenges - License - Replication - Revision Control Inferential Challenges - Measurement error – extracting topics from text
  • 15. Using Google Searches to Forecast Disease Outbreaks Sources of Big Data for the Social Sciences More Information • Ginsberg, Jeremy, et al. "Detecting influenza epidemics using search engine query data." Nature 457.7232 (2009): 1012- 1014. • Lazer, David, et al. "The parable of Google Flu: traps in big data analysis." Science 343.14 March (2014). “Big data hubris” is the often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis. Data Source - Google search queries Data: Structure - Quasi-tabular, structured metadata and unstructured text Unit of Observation - Interactions with a system Collection Design - Pure observational Desired Inferences - Predictive inference -- where will flu clusters appear next -- Short-term (nearcasting) -- small-area (fine-spatial granularity) - Inference to general population Performance challenges - Streaming algorithms Management Challenges - Replication - Transparency - Variability Inferential Challenges - External Validity - Measurement error – extracting topics from text - Overfitting - Sampling
  • 16. Comparing Cases Sources of Big Data for the Social Sciences Chinese Censorship Flu Prediction Data Source - Social Media Messages - Google search queries Data: Structure - Network, Unstructured Text, Structured metadata - Quasi-tabular, structured metadata and unstructured text Unit of Observation - Individuals; Interactions - Interactions with a system Collection Design - Pure observational - Pure observational Desired Inferences - Causal inference – what censorship strategies cause observed reaction - Inference to Population Frame - Predictive inference -- where will flu clusters appear next -- Short-term (nearcasting) -- small-area (fine-spatial granularity) - Inference to general population Performance challenges - High volume - Complex network structure - Scaling bespoke algorithms - Sparsity - Systematic and sparse metadata - Streaming algorithms Management Challenges - License - Replication - Revision Control - Replication - Transparency - Variability Inferential Challenges - Measurement error – extracting topics from text - External Validity - Measurement error – extracting topics from text - Overfitting - Sampling
  • 17. Sources of Big Data for the Social Sciences Why is dealing with big data hard?
  • 18. Big Data Challenges Acquisition Source s Incentives Quality Provenance Retention Change Management Integration Security Storage Analysis Bias CausationComputation Visualization Acces s Transparency Reproducibility Durable Access (Preservation) Confidentialiity Challenges of Big Data
  • 19. Challenges of Big Data Acquisition Challenges: Quality, Provenance, Sources
  • 20. Some Sources of Economic Information Challenges of Big Data  Smartphone sensors – GPS +  Vehicle systems  IoT – smart thermostats, fire alarms  Transactions – online, internal  Search behavior – search engine queries  Social media – twitter, FaceBook, LinkedIN  Imagery – satellite, thermal, video  …
  • 21. Source Characteristics Challenges of Big Data  Unit of Observation  Location, virtual service, communication network, individual  Context  Behavior, transaction, environment, statement  Measure characteristics  Measure scale  Measure structure  Accuracy, precision  Frame & Sample characteristics
  • 22. Challenges of Big Data Analysis Challenges: Bias, Computation, Causation, Integration
  • 23. Some Potential Sources of Analysis Error Challenges of Big Data Target Population Frame Selection Super Population Laws (structures) λ β (generates) Parameters • Selection bias • Frame uncertainty • Measurement error • Unknown measurement semantics • Non-independence of measures • Non-independence of samples • Model uncertainty • Unknown causal structure • Shift in measurements, samples, frames
  • 24. Challenges of Big Data Access Challenge: Data Repeatability, Transparency, Preservation
  • 25. Many Initiatives to Improve Scientific Reliability  Retraction monitoring  Data citation  Clinical trial preregistration  Registered replication  Open data  Badges Challenges of Big Data
  • 26. Some Types of Reproducibility Issues Challenges of Big Data • Fraud • Misconduct • Negligence • Bit Rot • Versioning problem • Replication • Reproduction • Extension • Result Validation • Fact Checking • Calibration, Extension, Reuse • Undereporting • Data Dredging • Multiple Comparisons’ P-Hacking • Sensitivity, Robustness • Reliability • Generalizability
  • 27. Ensuring Repeatability & Transparency Challenges of Big Data ‘ ‘’ΩΩΩΩ Theory (Rules, Entities, Concepts) Algorithm (Protocol, Operationalization) Theory (Rules, Entities, Concepts) Theory (Rules, Entities, Concepts) Implementation (Software, Coding Rules, Instrumentation ) Execution (Deployment, House Survey Style, Equipment Setting ) ’ Algorithms (Protocol, Operationalization) Implementations (Software, Coding Rules, Instrumentation Design ) Executions (Deployment, House Survey Style, Operating System, Hardware, Starting Values, PRNG seeds) Structure Formats Versions/Revisions Selections Integrations Instantiations (copies) Execution Context (weather, compiler, operating system system load)
  • 28. Challenges of Big Data Access Challenge: Data Confidentiality, Security
  • 29. Durable, Long-Term Access • Why durable access? • The rule of law require maintaining authentic public records • Scientific advances rely on a cumulative, traceable evidence base • Art, history, culture require durable access to national heritage information • Our nation needs durable access to a strategic information reserve • Humanity needs durable long-term access information in order to communicate to future generations • Big data challenges to durability • Velocity – information is updated, sometime overwritten • Many sources are commercial/private – not routinely archived, preserved • Modeling future value of information • Maintaining privacy and confidentiality Challenges of Big Data
  • 30. Big data challenges…  Anonymization can completely destroy utility  The “Netflix Problem”: large, sparse datasets that overlap can be probabilistically linked [Narayan and Shmatikov 2008]  Observable Behavior Leaves Unique “Fingerprints”  The “GIS”: fine geo-spatial-temporal data impossible mask, when correlated with external data [Zimmerman 2008; ]  Big Data can be Rich, Messy & Surprising  The “Facebook Problem”: Possible to identify masked network data, if only a few nodes controlled. [Backstrom, et. al 2007]  The “Blog problem” : Pseudononymous communication can be linked through textual analysis [Novak wet. al 2004] Source: [Calberese 2008; Real Time Rome Project 2007] Challenges of Big Data
  • 31. Little Data in a Big World  Little Data in a Big World  The “Favorite Ice Cream” problem -- public information that is not risky can help us learn information that is risky  The “Doesn’t Stay in Vegas” problem -- information shared locally can be found anywhere  The “Unintended Algorithmic Discrimination” problem -- algorithms are often not transparent, and can amplify human biases Challenges of Big Data
  • 32. Categorizing Challenges Sources of Big Data for the Social Sciences  Implementation – Performance Challenges  Systems challenges  Exceed capacity of locally managed storage  Location and migration of data becomes critical for performance  Standard backup, recovery and data integrity mechanisms ineffective  Communication bandwidth  Algorithmic Challenges  “in core” vs. “out-of-core” implementations  O(N^2) vs. O(log n) complexity  Static vs. streaming algorithms  Serial vs. massively parallel  Distributed – shared-nothing algorithms  Analysis methods – Inferential Challenges  Sources: Designed vs. “found” data  Model-based vs. data-based  Causal inference vs. Descriptive/ predictive (forecasting) inference  Data Management & Workflow  Provenance  Data quality  Change management  Continuous integration  Accommodating variety – semantics, quality  Transparency and reproducibility  Privacy  Security  Data Governance and Policy  Standards  Incentives  Certifications  Regulation
  • 33. Sources of Big Data for the Social Sciences Some Open Questions About Data Sources
  • 34. Preliminary Observations from First Workshop Sources of Big Data for the Social Sciences Topic: Sources of Economic Big Data Use Case: Commodity Flow Survey Observations:  Different classes of decisions require different sources of data:  E.g. much designed survey data contributes baseline data for decisions about infrastructure and strategic planning  Transaction based big data could contribute frequency and granularity of estimates  In big data, data sources are stakeholders  Businesses need to react quickly and predict the future – and need frequently updated detailed data  Critical to provide a value proposition to business  Critical to develop a trust relationship  Some Potential sources  ERP and DRP operations data  EDI  Mobile Phone  Traffic Data
  • 35. Some Non-Technical Questions About Sources Sources of Big Data for the Social Sciences ● Who are the key stakeholders in big data source, and what are the key stakeholder incentives? ○ What key decisions does this information support for stakeholders? What are the gaps in data from the stakeholder perspective? ○ What are barriers associated with new sources of information? ○ Legal barriers ○ Economic barriers ○ Social/trust barriers
  • 36. Sources of Big Data for the Social Sciences Potential Roles for Libraries
  • 37. Potential Roles -- Infrastructure Sources of Big Data for the Social Sciences  Dissemination  Catalog range of new statistics/indicators , sources  Selection based on quality  Guide proper use  Durability  Ensure long-term accessibility of big-data  Manage provenance, versioning  Provide transparency of new indicators/statistics  Security & Confidentiality  Libraries could be a trusted and accountable 3rd party  Store and integrate data from multiple sources  Could develop expert implementation of privacy best practices
  • 38. Potential Roles - Leadership Sources of Big Data for the Social Sciences  Advocacy  Advocate for quality, transparency, replication, durable access.  Standardization  Develop new methods for big data management  Identify “best practices” for replication, transparency, long-term access  Standardize licenses for reuse, preservation
  • 39. Additional References ● Einav, Liran, and Jonathan Levin. "Economics in the age of big data." Science 346.6210 (2014): 1243089. http://www.sciencemag.org/content/346/6210/1243089.sh ort ● Varian, Hal R. "Big data: New tricks for econometrics." The Journal of Economic Perspectives 28.2 (2014): 3-27. http://people.ischool.berkeley.edu/~hal/Papers/2013/ml.p df  Reimsbach-Kounatze, C. (2015), “The Proliferation of “Big Data” and Implications for Official Statistics and Statistical Agencies: A Preliminary Analysis”, OECD Digital Economy Papers, No. 245, OECD Publishing. http://dx.doi.org/10.1787/5js7t9wqzvg8-en  Kriger, David S., et al. Freight Transportation Surveys. Vol. 410. Transportation Research Board, 2011. http://www.nap.edu/catalog/13627/nchrp-synthesis-410- freight-transportation-surveys Sources of Big Data for the Social Sciences
  • 40. Questions? E-mail: escience@mit.edu Web: informatics.mit.edu Sources of Big Data for the Social Sciences
  • 41. Creative Commons License This work. Managing Confidential information in research, by Micah Altman (http://redistricting.info) is licensed under the Creative Commons Attribution-Share Alike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by- sa/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA. Sources of Big Data for the Social Sciences
  • Search
    Related Search
    We Need Your Support
    Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

    Thanks to everyone for your continued support.

    No, Thanks
    SAVE OUR EARTH

    We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

    More details...

    Sign Now!

    We are very appreciated for your Prompt Action!

    x