Data Curation

The rapid growth in the scale of research data has vastly increased the challenge of managing data effectively to enable analysis and to drive discovery. Data curation, a lifecycle approach to data management, helps ensure data integrity, adds value, and facilitates reuse. ODAI is developing recommendations for a data curation program that will allow Yale, over time, to leverage shared infrastructure to drive common practices in data curation that support data management in active research, publication, citation, access, and reuse. 

Institutional Infrastructure for Data Intensive Scientific Computing

Now that Yale researchers are generating terabytes of data per experiment, they are finding it increasingly difficult to manage this data. In addition, the University is amassing petabytes of data on its centrally supported systems, so the need to rationalize and modularize services and to establish a shard framework that can scale exponentially is becoming increasingly evident.

The emerging data curation program at Yale is based upon an appreciation for differences in methodologies and practices that help determine the longevity and value of information across disciplines. The purpose of the program is to leverage shared infrastructure to drive common practices in data curation, while respecting domain differences, in order to make more useful data available for research.

The following initial steps were undertaken to establish the need for the program:

  1. Research Data Task Force (2009-2010):
    The task force conducted interviews with 34 faculty members and developed recommendations addressing a range of needs, from a consultation center on intellectual property to digital repository services.
  2. Data Management Overview, Data Management Resources, and Data Management Plan Examples
  3. NSF Data Management Plan Resource (2010):
    These web pages provide guidance on the NSF Data Management Plan requirement, as well as examples, links, and referrals.
  4. Data Assessments and Use Cases (2010-2011):
    A series of in-depth interviews were conducted with nine research groups in the sciences and engineering and three research groups in the social sciences, arts, and humanities to identify the stages of data generated in each group's research areas and the requirements and issues associated with each stage. The findings from the assessments were used to develop cross-cutting use cases illustrating recurrent issues and themes.
  5. Mass Storage Working Group (2010-2011):
    The primary deliverable of this working group, sponsored by the Deputy Provost for Science and Technology, is a set of recommendations to the Scientific Computing Strategic Planning Advisory Committee on infrastructure required to support data-intensive research at Yale.

Conclusion

The University is taking steps to develop world-class infrastructure to support research at Yale and to broaden its impact. An emerging data curation program is a vital part of this effort. A mature program should yield considerable benefits to Yale researchers by increasing the competitiveness of their grant proposals, freeing up more of their time to do research, and making more of their data available for citation and reuse. The University will benefit from higher benefit-cost ratios for research related expenditures, greater competitiveness in recruiting and retaining faculty, and reduced risk of noncompliance with funders' mandates.

To find out more about data curation at Yale, please contact the data management planning consultation group.