Banner Shadow

Open Science

Collaborative Science

Training

CyVerse UK

The story behind it

Biology is increasingly a ‘big data’ science as new high-throughput technologies support faster, cheaper generation of sequencing, metabolite and image data. This enables potentially exciting breakthroughs as researchers spot undiscovered patterns and make new discoveries of biological importance. However, many individual biologists, and in some areas the community as a whole, struggle to take full advantage of the data generated because of a lack of computing resource, appropriate support and technical skill. It is not only the output of data analyses, such as a models, curated datasets, or raw data, that have value to the wider community, but also the tools generated during research projects that are used to support researchers to test and validate their hypotheses. Currently these tools often remain in prototype form, for use only within the group or laboratory that generated them, because there is comparatively little standardisation and no easy means of sharing an accessible, user-friendly version of the tool. To undertake world-class bioscience, researchers therefore need to be able to store and access datasets, models and analysis tools, ideally from different locations across the globe due to the need for international collaboration. The iPlant Collaborative was funded by US agency the National Science Foundation (NSF) in 2008 to help solve these issues, and iPlant became the CyVerse project in 2013.

Enter CyVerse…

The CyVerse Data Store is a cloud-based storage space, accessible via the CyVerse Discovery Environment (DE), a virtual bioinformatics lab workbench, and developer APIs such as the AGAVE API. In the DE, users can share datasets and tools to analyse data with as many or as few people as they wish. Tools to analyse data developed by CyVerse and CyVerse UK staff or built by others can be shared with the wider community, in a similar manner to ‘apps’ on smartphones.

…and CyVerse UK

CyVerse is currently distributed across three US locations; we have extended this into an international collaboration by building a CyVerse UK node at the Earlham Institute in Norwich, UK (EI).
EI provides the National Capability of computational infrastructure and as such is perfectly situated to provide the foundations for the CyVerse UK node. CyVerse UK provides independent versions of the CyVerse Data Store, computational nodes, and API access, but it’s also linked to the US to share resources and expertise.

Physical resource alone is not sufficient for a successful infrastructure: it also needs to be used, maintained and expanded as demand increases. To demonstrate the versatility, power and value of CyVerse UK, in the first stage of the project a dedicated team of programmers based at the Earlham Institute and the Universities of Warwick, Liverpool and Nottingham have adapted tools that have been generated for use in a single research group for wider community adoption. Three suites of tools to benefit key areas of UK plant science – sequencing, systems biology and image analysis – will be made available to the global plant research community via the CyVerse DE and programmatic interfaces.

In less than 10 years, CyVerse has built a global user base of over 18,500 users. As this continues to expand, the future sustainability of CyVerse must be considered. CyVerse UK will help ensure the future existence and reliability of CyVerse for UK users, spread expertise and best practice between the UK and US, allow the UK to input to the future direction of this valuable resource and provide an exemplar project to others wishing to establish future international CyVerse nodes.

By establishing CyVerse UK and promoting access to a resource that allows users to readily store and analyse their data, this project will help support a wide range of research including genome-wide association projects exploiting natural variation in crops, predicting biological networks and pathways, and the high-throughput imaging and image analysis services that take researchers one step closer to bridging the genotype to phenotype gap.

Where

The Earlham Institute maintains the main hardware for the CyVerse UK project: it originally added 16 256GB 16-core nodes and 16 128GB 12-core nodes to the CyVerse computational pool, together with 482TB of mirrored storage, which are fully reserved for UK pipelines.
Additional funded was granted to increase the computing resources with 40 additional compute nodes (~800 cores): 24 512GB 24-core nodes and 16 256GB 16-core nodes.
We run a private OpenStack cloud that provides Virtual Machines to CyVerse UK and its collaborators. The execution system is deployed as a HTCondor cluster for the heavy-lifting HPC jobs.
EI’s Data Store is federated with the Data Store in the US, meaning that users can find and reuse datasets uploaded in the UK or the US without any extra steps. Currently the storage system is private and you’ll need to contact us to be added to the list of authorised users. Together with the maintenance of this hardware, EI provides support with the implementation of pipelines from the other universities as well as internal workflows.

Tools

If you only need to use CyVerse tools you can take a look at the Docker registry under the cyverseuk and cyversewarwick organisations.
We take advantage of Docker as a containerisation technology to ensure reproducibility and version control, as well as decoupling the applications from the underlying architecture. As well as being able to run on the CyVerse infrastructure, you can effectively download the Docker images and run the software on your machine if you chose to do so. If you are a developer, writing your own Dockerfile and making it available in the public registry is a great first step for your software to become available to the thousands of CyVerse users.

Applications

Together with the Docker containers, behind the curtains, tools are normally embedded in some other wrapper for ease of use, and then registered on one or more CyVerse systems.
Running jobs this way requires the user to register for a CyVerse account (it’s free, and you only need one for both US and UK systems!). If can get a CyVerse login here. There are two different types of applications in the CyVerse ecosystem: Agave apps and Discovery Environment apps.
Currently all applications running in the EI cluster are Agave applications. This allows you to run jobs in a variety of ways:
  • Using the Discovery Environment: this is usually the preferred way as the Discovery Environment provides you with an friendly graphical interface with additional functionality to manage your data, save your analyses and pipelines, and share jobs and data with collaborators.
    Unfortunately the Discovery Environment, while it usually integrates very well with the Agave API, will fail to allow you to select multiple files in the graphical interface even when the application allows it. If this happens to be your case you can temporarily(*) rely on one of the following methods.
  • You can rely on the Agave API from command line (here‘s the CLI) and submit jobs writing your own JSON file. This can be a bit more tedious, but on GitHub we provide example JSON files as a guideline.
  • Using the CyVerse UK web interface: this allows you to submit multiple files. It also has the plus side of showing you only the applications running at EI, and it’s fully integrated with the UK Data Store (so you need to contact us to be added to the list of users prior to running your analyses)
Whatever you choose you can also use the Discovery Environment to join the tools you need and create a workflow. (*)In the next future all applications will be DE apps, and you won’t need to worry about any of the above. Jobs will run in the most favourable location to you and your data will be stored in your preferred storage. It will also be easier for our users to share create new applications and share them with the whole community. If you are a developer CyVerse is a good platform to share your software and allow the research community to use it and reproduce analyses. You are welcome to add your own tools, or you can contact us for help. If you need help getting to grips with setting up your own workflows, then please get in touch.

Data Store

We are working on a full federation of our UK Data Store with the Data Store in the US. In the meantime you can store your data in our UK Data Store to give you geographical advantages with respect to access speed, and if you have any legal requirements that dictate your data has to remain within the UK/EU jurisdiction. Our UK Data Store uses the iRODS data mangement system in the same way as the US version, and the best way to get started would be for you to get comfortable with icommands (see some instructions on how to set it up. The settings in the docs point you to the US, but you can get in touch and we’ll help you changing them to point to the UK. iRODS allows us to have a full integration between the systems (together with providing fast and reliable data transfer and accessibility across the zones), so you’ll be able to see and use your data in the Discovery Environment in the same way no matter if it is housed in the UK or the US. Finally, we have a new Data Commons collection at /iplant/home/shared/uk_data_commons, which acts as a place that you can store and share your data publicly, for example in connection with a publication. This allows other users of CyVerse to gain access to and analyse your data immediately, which aids reproducibility and open science. Please contact us if you want to store your data openly in the UK Data Commons.

Virtual Machines

If you require additional computational power or a full Linux environment for development and analyses we can provide you with a custom Virtual Machine hosted in our private cloud.
Please contact us to discuss your requirements and what kind of support/capacity you need.