MIT BigData Living Lab - a testbed for innovation at MIT

A Testbed for Data Innovation at MIT

Why a living lab for data?

Exploring technical issues and social implications of big data

  • Impacts and benefits of big data with a plethora of new applications
  • Large scale access control, data integration, data governance, analytics, and visualization
  • Understanding incentives and drivers of data collection
  • Demonstrating new approaches to managing data privacy

Leading efforts to safely develop and use big data

  • Enabling innovation and ownership by providing members of the MIT community appropriate access to their own data
  • Developing and demonstrating organizational best practices for collecting and managing information
  • Demonstrating systems that provide useful services to the MIT Community
  • Architecting services so that they are extensible and adaptable by industry and others

The Challenge

A key issue today is that data is siloed, whether its personal data, data inside an organization, or data sharing across different organizations. Data discovery and integration is difficult and presents complex technical, organizational and policy challenges. A Living Lab allows MIT to be a microcosm for many big data efforts whether in government or in industry. One of our goals is to work with MIT in opening up repositories of information on campus that contain the data needed to discover valuable new insights about important topics such as wellness, innovation, learning and sustainability. MIT is well positioned to take a leadership role in demonstrating not only how organizations can leverage data in the future, but how we collect, manage, and use personal information, from setting appropriate privacy policies to demonstrating systems that can implement it in practice.

Research

Living labs is developing a scalable data management platform, allowing us to collect and integrate multiple types of data including: personal data or “small data” (collected by smart phones, activity tracking devices, or new wearable sensors); MIT data (wifi data, campus maps, event data etc); as well as external data types (social media data, transportation data, weather, city data etc).

ModelDB


Developed by the Database Group at CSAIL, led by Vartan, Madden, et al., ModelDB is an end-to-end system that tracks models as they are built, extracts and stores relevant metadata (e.g., hyperparameters, data sources) for models, and makes this data available for easy querying and visualization.

The codebase and instructions for getting set up are available for public use, and a short paper from the HILDA workshop, SIGMOD 2016, is available for reading.

Aurum

Developed by the Database Group at CSAIL, led by Castro-Fernandez and Madden et al., Aurum is a system to tackle data discovery problems at large. It introduces a new discovery algebra, R2QL, that permits users to declare their intuition of what is relevant through a set of data primitives that expose the relations of the underlying data. The algebra relies on a metaschema graph to answer queries in human-scale latencies. Furthermore, Aurum is scalable: it builds the metaschema graph in linear time, despite the complexity of extracting complex relationships among thousands of data sources.

Aurum’s codebase is available for public use. A position paper is available from the ACM. You can test drive Aurum on the State of Massachusetts open data.

BigDAWG


Developed by Lincoln Labs, led by Vijay Gadepally et al., The BigDAWG polystore is a federated DB system for multiple, disparate data models. It supports the notions of location transparency and semantic completeness through islands of information which support a data model, query language and candidate set of DB engines. A prototype of the BigDAWG system has shown great promise when applied to diverse medical data.

A poster and paper are available for viewing.

Data Hub

Developed by the Database Group at CSAIL, led by Madden et al., DataHub, is a scalable, hosted platform for organizing, managing, sharing, collaborating, and making sense of data. Think of it as a mashup of github and postgreSQL, accessible through your web browser. It provides an efficient platform and easy to use tools/interfaces for:

  • Publishing of your own data (hosting, sharing, collaboration)
  • editing and deleting your own data
  • Using other’s data (querying, linking)
  • Making sense of data (analysis, visualization)

DataHub’s documentation, codebase, and API are available for public use.The platform allows testing of new frameworks and applications for collecting and managing personal data, including the Open Personal Data Store (OpenPDS) architecture. Publications are viewable on the DataHub site.

OpenPDS

Developed by the Human Dynamics Group at the MIT Media Lab, led by Pentland et al, OpenPDS provides users control over how applications use their data:

  • stores data in a user specified location
  • provides a secure computation space for 3rd party applications interact with the data
  • allows users to audit when and how applications have used their data
  • allows users meaningful control over how their data is used by different applications

Our goal in building these platform is to enable researchers and students to dream up and run new data-driven applications and projects at MIT. Some examples are below:

Wellness and Health

  • How can we use sensor data from smart phones and wearable devices to better measure and promote wellness on campus?
  • Can we help people track and manage chronic conditions?
  • Is sleep or exercise correlated with student performance?
  • How do student’s activity levels change over the course of a semester?
  • Can we identify patterns in the spread of the flu on campus? and predict flu outbreaks?
  • What motivates students to participate in tracking their wellness?

Social Patterns and Human Behavior

  • How much do people in different departments, labs and organizations co-mingle? What are the informal social relationships between different groups on campus?
  • What data is useful in predicting which parts of the campus are under/over utilized? What are patterns in where people congregate?
  • How can we use data to better understand collaboration and innovation at MIT?

Transportation and Movement

  • What are patterns in how people get to/from campus, when, and via what routes? Can we predict these patterns and offer better on-demand, dynamic services?
  • Which parts of campus are under/over utilized?
  • What are the “traffic” patterns of how people move around campus? What factors most impact patterns of movement on campus? What can we learn about campus safety?
  • How might this inform long term facilities and campus planning?

Aggregating a diversity of data allows us to combine and derive patterns from disparate data types. Even analyzing aggregate anonymized data can reveal new valuable insights about trends and patterns within our community on campus.

Posters from the Annual Meeting

Our Team

Faculty and Staff

David Karger

PI - MIT, CSAIL

Justin Anderson

Senior Programmer

Steve Buckley

Director

Prof. Sam Madden

PI - MIT, CSAIL

Prof. Alex “Sandy” Pentland

PI - MIT, Media Lab

Albert Carter

Programmer

Students

Kelly Zhang

Database MEng Student

Guy Zyskind

Media Lab Graduate Student

Brandon Carter

Undergraduate Research Student

Kimberly Toy

MEng Graduate Student

Special Thanks

Contact

32 Vassar St Building 32G-887 Cambridge, MA 02139 USA

+1 (617) 715-2282

UA-54650835-1