Secondary analysis of publicly available omics data across almost 3 million publications
Nicholas Darci-Maher0, Serghei Mangul0, Kerui Peng1, Dat Duong0, Eleazar Eskin0, Jaqueline Brito1, Anushka Rajesh1, Andrew Smith1, Reid F. Thompson2, Abhinav Nellore2, Casey Greene3, Jonathan Jacobs4
(0) University of California, Los Angeles
(1) University of Southern California
(2) Oregon Health & Science University
(3) University of Pennsylvania
(4) QIAGEN Digital Insights
Find me on Wed Nov 25th, 1:30-2:50pm AEDT in Remo, table 48
Abstract
As today’s high throughput sequencing techniques become increasingly affordable and accurate, the number of publicly available omics datasets is rapidly accumulating. Bioinformatics methods provide unprecedented opportunities for analysis of omics datasets in quantitative biological research. Traditionally, such research has included primary analysis of novel omics data developed as part of the study. However, this data has the potential to be reused, and is often valuable beyond the scope of the study that introduced it. Data-driven research by secondary analysis on existing datasets is becoming more important. Increased availability of public omics data represents an opportunity to find novel insights and discoveries across different datasets.
This study presents a quantitative analysis of the reusability of omics datasets in two online repositories, the Sequence Read Archive (SRA) and the Gene Expression Omnibus (GEO). We downloaded over 2.5 million publications from the PubMed Central Open Access corpus, and identified those that referenced SRA or GEO datasets. We used these papers to examine reusability based on various factors, including journal, repository, sequencing technology, and species. We find that most datasets are never reused—these datasets are mentioned once in the study that introduced them, but then never referenced again. In recent years, however, data reuse is rising. We aim to shed light on the landscape of data sharing in the quantitative biology research community, and illuminate the benefits of secondary analysis of omics data.
Comments