Who’s responsible?
Given the magnitude of -omics studies, the responsibility for ensuring that data are valid involves everyone, says Omenn. He doesn’t let anyone off the hook: Students, postdoctoral fellows, principal investigators, departmental heads, institutional review boards, journal editors and funding agencies all have to take their roles seriously to ensure that data are sound.
But in discussing responsibilities, points of contention arise. To validate data, researchers need access to data collected by others. What kinds of data should researchers make available to others? It is important to note, says Robert Chalkley at UCSF, that not every researcher likes the idea of releasing his or her data. It’s not just the risk of scrutiny that alarms these researchers but the worry that someone else may discover something novel in the data that they missed, which can easily happen with -omics research, because the data sets are so large.
But even if researchers see the need for releasing the data, what should they release? It shouldn’t be just raw data, argues Baggerly. He says researchers also should release the algorithms and codes of bioinformatics tools as well as the metadata, the types of information that denote which samples belonged to which groups and how researchers selected those samples. Baggerly explains that, with -omics information, “The data are subject to several different types of pre-processing … In many of these pre-processing steps, any one of several different algorithms could be employed. There is not yet a consensus as to which one is best.” Because there isn’t a consensus, Baggerly argues researchers have to be explicit in stating which ones they used.
Then comes the big question: Who should bear the responsibility of collecting, housing and making accessible all that data? In Baggerly’s view, journals should house the bioinformatics scripts through which researchers ran their data sets for a given publication, because those codes don’t take up much server room. But what about raw -omics data files, which can be gigabytes, even going onto terabytes, in size?
Raw data access
Access to raw data is a thorny subject. One way to illustrate why is to look at proteomics. “Over the years, [raw] data have never left the laboratory in which they were collected,” explains Bradshaw. “It has been clearly the opinion of a lot of people in the proteomics field, and certainly the opinion of the editors of MCP, that these data need to be put somewhere where they can be interrogated by others.”
Websites like PRIDE collect processed proteomics data. But processed data, as Baggerly and Bradshaw are keen to emphasize, are not the same as the raw data spat out by analytical instruments.
So in 2010, MCP made it mandatory for its authors to deposit their raw data files in a repository designed specifically for the purpose. One example of a raw data repository is TRANCHE (https://proteomecommons.org/tranche/), operated by the laboratory of Philip C. Andrews at the University of Michigan.
“For some time, TRANCHE was basically the only show in town,” says Bradshaw. “The problem was that TRANCHE’s funding line eventually was dependent on a [federal] grant, which ultimately was not renewed.”