Why aren’t we sharing?
In April 2016, Vice President Joe Biden addressed the American Association of Cancer Research (AACR) annual meeting in New Orleans to expand on the US Government’s ‘moonshot’ for cancer. In a broad-ranging and heartfelt speech about the billion dollar initiative, he chose to make the need for greater sharing of research data one of the key themes of his talk.
What once might have been dismissed by some as a bureaucratic distraction from getting on with the business of science, or even feared as a threat to future work, is now recognised as an essential element in beating cancer sooner. And with a technological revolution creating new opportunities – and raising new challenges – for sharing and reusing data, it’s a topic that’s increasingly exciting our research community.
At CRUK we know that good management and sharing of the data produced with our funds is a crucial element to achieving our ambitious goals. We must synthesise data from across disciplines to generate innovative insights into the origins, prevention and treatment of cancer. Data sharing planning is now an established part of our policies and procedures in applying for funding, and our Funding Managers and Committee members are always on the lookout for opportunities to maximise the value of our research outputs.
But sharing data is also great for individual researchers. We caught up with three of our researchers to find out why they’ve placed data sharing at the heart of their research programmes.
One of the most conspicuous success stories for data sharing is the field of genomics, where mature systems and vast databases make sharing and reusing data easy, and where innumerable research projects rely in part or in whole on publicly available resources.
“In genomics the situation’s quite simple,” says Dr Florian Markowetz, group leader at the CRUK Cambridge Institute. “We have the infrastructure, and we’re all required to deposit all our genomics data in the databases to be able to publish. It’s second nature. And that has really demonstrated its value. Pretty much every dataset that I might want to access, I’m able to.”
While genomics has been blazing trails, pioneering technology and standards and developing a mature culture of data sharing, other fields of cancer research have struggled to overcome some of the barriers. Florian is exasperated by what he sees as excuses for not sharing data, but he is also mindful of the real challenges that need to be addressed.
“There are challenges when it comes to protecting patient data. When I first started out, I assumed that you could just anonymise data and you wouldn’t be able to identify patients. But actually with genetic data that might not be the case. It’s possible that you might be able to reverse engineer the data to work out which patients it came from. So it requires more careful consideration.”
The real value of genomic data is the ability to link genomic and clinical data.
It’s a challenge that Professor Mark Lawler, Chair in Translational Cancer Genomics at Queen’s University Belfast, is also tackling.
“Genomic data is very useful, but real value in terms of human disease comes from the ability to link genomic and clinical data,” says Mark. “The challenge is: how do you share clinical and genomic data in a way that’s effective and workable, but also ethically responsible? While we are able to collect massive amounts of data, the problem is that these datasets are stored in different silos which don’t currently talk to each other.”
Mark is co-lead of the Global Alliance for Genomics and Health (GA4GH) Clinical Cancer Task Team, which develops tools and standards for interoperable data sharing with the aim of synchronising sequencing and clinical research efforts, and enabling computer systems to exchange and make use of information.
“GA4GH is a coalition of the willing,” says Mark. “Scientists, clinicians, industry, patient advocacy groups, IT and life sciences companies – over 400 institutions from about 60 countries have come together to work on a catalogue of projects.”
And they can point to a number of successes so far, such as the BRCA Exchange database and website, which pools many disparate groups’ data on BRCA gene variants and their associated disease risk and allows expert curators to link genetic and clinical data to improve our understanding of breast and ovarian cancer. Or the Matchmaker Exchange which has built a similar resource for rare diseases, allowing clinical and genetic data on rare cases from around the world to be “matched”, leading to improved disease diagnosis and treatment.
At the Francis Crick Institute, Professor Nick Luscombe has less patience for data access restrictions, and is using supercomputer power to literally accelerate data processing.
“Access restrictions really hinder progress,” Nick argues. “Accessing medical records, for example, is really difficult: if you do manage to get hold of the data, you end up with these discrete, disparate pieces of data that you can’t put together. What little can be done takes a lot of manual work, so it just can’t scale.”
There needs to be a cultural shift in medicine. There are some real reasons why people have negative views, including privacy and whether data can be misused. But there’s a mismatch between what we think the risks are and the public’s perception of risk.
“Also there are researchers who want to keep hold of their data and tightly control it, or they get strange ideas about credit and that they should be considered co-authors on any subsequent work that looks at ‘their’ data. That’s a real shame because sharing data speeds up progress – as genomics shows.”
Addressing some of these problems is the eMedLab project, funded by the Medical Research Council. “There are lots of standard datasets that people want to look at. In the past you had to go out and get that data and install your own local copy, and if the data wasn’t published openly that might mean a lot of effort before you even start your project.”
To use eMedLab, researchers need a virtual machine on the high-performance computing cluster, from which they can run analyses on standard public datasets, plus data provided by industry and by organisations including the Farr Institute and Genomics England. “With eMedLab we’re turning the process on its head,” Nick says. “Rather than bring the data to your programme, you take your programme to the data.”
Data is not enough
The increasing need for and value of interoperable data will require more than just ensuring that the data is published. In fact, Florian Markowetz believes data analysis needs to be as transparent as the data itself.
“Just having a bunch of data is not the whole story,” says Florian. “Really, the data alone are not all that interesting. You need to know how people analysed it, which annotations they used, which computer programmes. That’s what unlocks the value — the reproducibility, the reuse and interoperability.”
But Florian doesn’t simply appeal to altruism, preferring to emphasise what he calls the ‘selfish’ reasons for sharing data and analysis. “If you’re not sharing the details, people won’t understand what you’re trying to do, and won’t be able to reproduce your results. That’s not just bad for science, it’s bad for you. You’ll have a hard time from peer reviewers who want to know this stuff, you’ll have difficulty maintaining continuity of projects when lab members move on, and in the worst case scenario, your experiments will have been wasted and you’ll have to withdraw a paper or a project because you haven’t documented how your results were achieved.”
Just as technological advances create new opportunities for data management, sharing and analysis, they also create opportunities for gathering ever more complex and difficult data.
“Some of the new ways that technology allows us to collect data on populations ‘in the wild’ are extremely exciting,” says Nick Luscombe. “But we can’t be complacent about progress. Genomics also shows that it takes a conscious effort to prevent secrecy and restrictions creeping back in.”
“The next challenge we’re starting to look at is how you collect and link in longitudinal data,” says Mark Lawler. “That’s especially vital in cancer research, where the evolution of tumours is such a big part of the story. But when you’re handling multiple data points and outcomes over long periods, that introduces new logistical problems in terms of ensuring continuity and accuracy.
But we owe it to the patients who contribute their data to get this right – and if we do, the data will keep on giving back.
In this article
Chair in Translational Cancer Genomics, Queen’s University Belfast
Chair in Computational Biology, UCL and Winton Group Leader, Francis Crick Institute
Group leader, CRUK Cambridge Institute
Five selfish reasons to work reproducibly Florian Markowetz Genome Biology 16:274
This story is part of Pioneering Research: our annual research publication for 2015/16.