Biology is increasingly a ‘big data’ science as new high-throughput technologies support faster, cheaper generation of sequencing, metabolite and image data. This enables potentially exciting breakthroughs as researchers spot undiscovered patterns and make new discoveries of biological importance. However, many individual biologists, and in some areas the community as a whole, struggle to take full advantage of the data generated because of a lack of computing resource, appropriate support and technical skill. It is not only the output of data analyses, such as a models, curated datasets, or raw data, that have value to the wider community, but also the tools generated during research projects that are used to support researchers to test and validate their hypotheses. Currently these tools often remain in prototype form, for use only within the group or laboratory that generated them, because there is comparatively little standardisation and no easy means of sharing an accessible, user-friendly version of the tool. To undertake world-class bioscience, researchers therefore need to be able to store and access datasets, models and analysis tools, ideally from different locations across the globe due to the need for international collaboration. The iPlant Collaborative was funded by US agency the National Science Foundation (NSF) in 2008 to help solve these issues, and iPlant became the CyVerse project in 2013.
The CyVerse Data Store is a cloud-based storage space, accessible via the CyVerse Discovery Environment (DE), a virtual bioinformatics lab workbench, and developer APIs such as the AGAVE API. In the DE, users can share datasets and tools to analyse data with as many or as few people as they wish. Tools to analyse data developed by CyVerse and CyVerse UK staff or built by others can be shared with the wider community, in a similar manner to ‘apps’ on smartphones.
CyVerse is currently distributed across three US locations; we have extended this into an international collaboration by building a CyVerse UK node at the Earlham Institute in Norwich, UK (EI). EI provides the National Capability of computational infrastructure and as such is perfectly situated to provide the foundations for the CyVerse UK node. CyVerse UK provides independent versions of the CyVerse Data Store, computational nodes, and API access, but is also be linked to the US nodes to share resources and expertise. Physical resource alone is not sufficient for a successful infrastructure: it also needs to be used, maintained and expanded as demand increases. To demonstrate the versatility, power and value of CyVerse UK, a dedicated team of programmers based at the Earlham Institute and the Universities of Warwick, Liverpool and Nottingham have adapted tools that have been generated for use in a single project for wider community adoption. Three suites of tools to benefit key areas of UK plant science – sequencing, systems biology and image analysis – will be made available to the global plant research community via the CyVerse DE and programmatic interfaces.
In less than 10 years, CyVerse has built a global user base of over 18,500 users. As this continues to expand, the future sustainability of CyVerse must be considered. CyVerse UK will help ensure the future existence and reliability of CyVerse for UK users, spread expertise and best practice between the UK and US, allow the UK to input to the future direction of this valuable resource and provide an exemplar project to others wishing to establish future international CyVerse nodes.
By establishing CyVerse UK and promoting access to a resource that allows users to readily store and analyse their data, this project will help support a wide range of research including genome-wide association projects exploiting natural variation in crops, predicting biological networks and pathways, and the high-throughput imaging and image analysis services that take researchers one step closer to bridging the genotype to phenotype gap.