Presentation by Tony Hey

On March 23, 2011, Steve Girvin, Deputy Provost for Science and Technology, and Meg Bellinger, Director, Office of Digital Assets and Infrastructure, hosted Tony Hey, Corporate Vice President of Microsoft Research Connections, for a presentation on The Fourth Paradigm: Data-Intensive Scientific Discovery. Before joining Microsoft, Hey served as director of the U.K.’s e-Science Initiative and as head of the School of Electronics and Computer Science and dean of Engineering and Applied Science at the University of Southampton. Hey is well known in parallel computing for his work on the initial draft of a specification for a message-passing standard called MPI. Hey was accompanied by Lee Dirks, Director of Education and Scholarly Communication in Microsoft’s External Research division.

The day’s events included a roundtable on data curation and management, meetings with faculty in Computer Science, and a working lunch with faculty in Math, Applied Math, Physics, and Computer Science. Attending the morning roundtable were faculty and researchers from Physics, Computer Science, Bioinformatics, and Molecular, Cellular, and Developmental Biology and staff from ODAI, Information Technology Services, and the Yale University Library. The discussion covered a range of topics. Sequencing at the Yale Center for Genome Analysis, which began operations in January 2010 and hit stride over the summer, has more than doubled since November, when several new sequencers were brought online, and is now approaching 10 trillion bases per month. Other points of discussion included the need for computing clusters in close proximity to data; trade-offs between ease-of-use and performance in research programming; cloud computing as an enabler of the culture of collaboration and openness in genomics (beginning with the Human Genome Project when collaboration was a necessity); the popularization of data-intensive science; and the concomitant challenge of developing computational techniques and infrastructure to make sense of and manage growing volumes of data.

Hey’s afternoon presentation in Sudler Auditorium touched upon the advances in scientific instrumentation powering the data deluge (e.g., the use of sensor networks in seismography); the relative decline in the amount of scientific data that is publicly accessible and its implications for future research (i.e., for reproducibility and reuse); and the changes in policy, technology, and culture needed to disseminate results, reduce repetition in research, and increase the efficiency of funding.