Biology is increasingly a ‘big data’ science as new high-throughput technologies support faster, cheaper generation of sequencing, metabolite and image data. This enables potentially exciting breakthroughs as researchers spot undiscovered patterns and make new discoveries of biological importance. However, many individual biologists, and in some areas the community as a whole, struggle to take full advantage of the data generated because of a lack of computing resource, appropriate support and technical skill. It is not only the output of data analyses, such as a models, curated datasets, or raw data, that have value to the wider community, but also the tools generated during research projects that are used to support researchers to test and validate their hypotheses. Currently these tools often remain in prototype form, for use only within the group or laboratory that generated them, because there is comparatively little standardisation and no easy means of sharing an accessible, user-friendly version of the tool. To undertake world-class bioscience, researchers therefore need to be able to store and access datasets, models and analysis tools, ideally from different locations across the globe due to the need for international collaboration. The iPlant Collaborative was funded by US agency the National Science Foundation (NSF) in 2008 to help solve these issues, and iPlant became the CyVerse project in 2013.
The CyVerse Data Store is a cloud-based storage space, accessible via the CyVerse Discovery Environment (DE), a virtual bioinformatics lab workbench, and developer APIs such as the AGAVE API. In the DE, users can share datasets and tools to analyse data with as many or as few people as they wish. Tools to analyse data developed by CyVerse and CyVerse UK staff or built by others can be shared with the wider community, in a similar manner to ‘apps’ on smartphones.
CyVerse UK is hosted at the Earlham Institute in Norwich, UK (EI). EI provides the National Capability of computational infrastructure and as such is perfectly situated to provide the foundations for the CyVerse UK node. CyVerse UK provides independent versions of the CyVerse Data Store, computational nodes, and API access.
Physical resource alone is not sufficient for a successful infrastructure: it also needs to be used, maintained and expanded as demand increases. To demonstrate the versatility, power and value of CyVerse UK. Three suites of tools benefit key areas of UK plant science – sequencing, systems biology and image analysis – will be made available to the global plant research community.
In less than 10 years, CyVerse has built a global user base of over 18,500 users. As this continues to expand, the future sustainability of CyVerse must be considered. CyVerse UK will help ensure the future existence and reliability of CyVerse for UK users, spread expertise and best practice, allowing the UK to input to the future direction of this valuable resource and provide an exemplar project to others.
By establishing CyVerse UK and promoting access to a resource that allows users to readily store and analyse their data, this project will help support a wide range of research including genome-wide association projects exploiting natural variation in crops, predicting biological networks and pathways, and the high-throughput imaging and image analysis services that take researchers one step closer to bridging the genotype to phenotype gap.
The Earlham Institute maintains the main hardware for the CyVerse UK project: it originally added 16 256GB 16-core nodes and 16 128GB 12-core nodes to the CyVerse computational pool, together with 482TB of mirrored storage, which are fully reserved for UK pipelines.
Additional funded was granted to increase the computing resources with 40 additional compute nodes (~800 cores): 24 512GB 24-core nodes and 16 256GB 16-core nodes.
cyverseuk
and cyversewarwick
organisations.icommands
(see some instructions on how to set it up. The settings in the docs point you to the US, but you can get in touch and we’ll help you changing them to point to the UK. iRODS allows us to have a full integration between the systems (together with providing fast and reliable data transfer and accessibility across the zones), so you’ll be able to see and use your data in the Discovery Environment in the same way no matter if it is housed in the UK or the US.
Finally, we have a new Data Commons collection at /iplant/home/shared/uk_data_commons
, which acts as a place that you can store and share your data publicly, for example in connection with a publication. This allows other users of CyVerse to gain access to and analyse your data immediately, which aids reproducibility and open science. Please contact us if you want to store your data openly in the UK Data Commons.