Skip to content

Liberating Public Data

Rich Context

The Rich Context project is led by Julia Lane of the Coleridge Initiative (New York University) and Brian Granger of Jupyter (Cal State – San Luis Obispo).  It addresses the critically important problem associated with massive amounts of research data – how can we find how the data have been used for what measurement, in what fields by what researchers, with what code, and with what results.  Manually reading through 80 million research publications is not the answer – automation and machine learning is.  The project has three parts.  The first is a competition with 20 teams who are using text analysis and machine learning techniques on a series of different pre-processed publication corpora to develop models for identifying the datasets, people, and research methods referenced in each publication.  The second is to then apply these machine learning models on a broader set of publications to validate the results, and then iterate on the most promising to improve the learning algorithms.  The third is to use gamification approaches to a) incentivize human curation of the results and enable patterns to be identified and b) incentivize humans to contribute new tacit knowledge that was hitherto not routinely shared.  The approach is iterative, and we will develop a platform for creating human feedback to improve algorithmic learning about the use of datasets.

This project is now on a clear path to successfully expand the community that is interested in solving the very hard problem identified for empirical researchers and government analysts – for a given dataset, who else worked with the data, on what topics and with what results.

The first phase of the project, the Rich Context Competition, was won by the Allen Institute for Artificial Intelligence with a model that correctly identifies over 50% of named datasets in a corpus of publications. The competition attracted 20 teams from around the world and resulted in four finalists presenting their results at NYU on February 15, 2019 (see the agenda and video here)

The impetus for this project was the Commission on Evidence-based Policy.   The investigators were asked by the Census Bureau to support their decision-making by building an Administrative Data Research Facility which supported the integration of confidential government records, has been featured on FedRAMP Marketplace, and won a 2018 award for government innovation.   The construction of that data facility has been complemented with data science training that supports the Federal Data Strategy.

The rich context approach has also been presented as keynote addresses in multiple social science conferences, including: