APPLES (Analysis of Plant Promoter-Linked Elements)
APPLES is a set of tools to analyse promoter sequences on a genome-wide scale. In this CyVerse-compatible version, two main modules are provided:
- APPLES_rbh: Find Orthologs as Reciprocal Best Hits
- APPLES_conservation: Find Non-Coding Conserved Regions
In addition, the following tools are also exposed to the user:
- APPLES_utr: Extract sequences based on FASTA and GFF3 files
The following diagram illustrates the structure of these modules:
The original APPLES package is described at this address
Nathaniel J. Davies, Peter Krusche, Eran Tauber and Sascha Ott, Analysis of 5’ gene regions reveals extraordinary conservation of novel non-coding sequences in a wide range of animals, BMC Evolutionary Biology, 2015, doi: 10.1186/s12862-015-0499-6
Laura Baxter, Aleksey Jironkin, Richard Hickman, Jay Moore, Christopher Barrington, Peter Krusche, Nigel P. Dyer, Vicky Buchanan-Wollaston, Alexander Tiskin, Jim Beynon, Katherine Denby, and Sascha Ott, Conserved Noncoding Sequences Highlight Shared Components of Regulatory Networks in Dicotyledonous Plants, Plant Cell, 2012, doi:10.1105/tpc.112.103010
:whale: Species String This is a string of Species names separated by ",".
- Note that there is no "," behind the last species;
- The first species is the central species;
:whale: Sequence Database Folder
With Species_1 being the central species, you will have the following folder structure:
<input_folder> +-- Species_1 | +-- PlantA.fa | +-- PlantA.bed | +-- PlantA_utr5.bed | +-- PlantA_utr3.bed +-- Species_2 | +-- PlantA.fa | +-- PlantA.bed | +-- PlantA_utr5.bed | +-- PlantA_utr3.bed | +-- rbhSearch_result.txt +-- Species_3 | +-- PlantA.fa | +-- PlantA.bed | +-- PlantA_utr5.bed | +-- PlantA_utr3.bed | +-- rbhSearch_result.txt . .
/cyverseZone/home/shared/cyverseuk/apples_testdata/apples_conservation_multiple/app_short for an example.
Please check the followings in order to get correct results from the module:
:white_check_mark: Apart from the main species (e.g. Species_1 in our example), all other species must have a
rbhSearch_result.txt file which annotates the orthologs between itself and the main species. This file needs to have a total of 4 columns (tab-separated):
- Column 1: Species 1's protein ID;
- Column 2: Species 2's protein ID;
- Column 3: Species 2's gene ID;
- Column 4: Species 1's gene ID.
i.e. "Species_1_proteinID Species_2_proteinID Species_2_geneID Species_2_geneID". This is the format produced by the APPLES_rbh module.
:white_check_mark: The gene IDs in your
rbhSearch_result.txt must match those in your
PlantA.fa file. If these don't match, the program will not produce any result.
The APPLES_rbh module finds Orthologs as Reciprocal Best Hits
Protein FASTAof Species A
Protein FASTAof Species B
The APPLES_utr module extracts sequences based on FASTA and GFF3 files of a species
- 1.1-stable Added parallelisation option [fa9ebdd]
- 1.0 Simple version adopted from Grannysmith
For a Species X:
Gene FASTA*- This is the file from which you wish to extract your sequences from. Provided that you have the matching GFF3 annotation, this file may be genome, scaffold or others based.
GFF3*- This is the file which annotates the FASTA file.
Gene ID Identifier Text**- This is the text which prefixes the Gene ID in the 9th column of the GFF3 file. Check your GFF3 to see what goes here.
Sequence Length- The number of bases which you wish to extract upstream.
Stop at Neighbouring Gene- Check this if you wish the sequence extraction to stop at neighbouring gene.
Include the 5-prime UTR region- Check to start the upstream at TSS so that the sequence include the UTR region. Otherwise start at 5-prime.
* - Sequences of a species are queried from a pair of FASTA and GFF3 files. This requires that the Sequence IDs in both files to match. In the FASTA file, this is the ID following the `>` charactor in the description lines; in the GFF3 file, this is the value stored in the first column of the gene lines (i.e. lines that says "gene" in the 3rd column).
** - To understand the `Gene ID Identifier Text` works, here are a couple of examples: Use "ID=" if your `gff3` file looks like this: `Niben101Scf00059 maker gene 513034 528469 . + . ID=Niben101Scf00059g04019;Alias=maker-Niben101Scf00059-snap-gene-4.18` Use "ID=gene:" if your `gff3` file looks like this: `1 tair gene 31170 33153 . - . ID=gene:AT1G01050;Name=PPA1;biotype=protein_coding;description=Soluble inorganic pyrophosphatase 1 [Source:UniProtKB/Swiss-Prot%3BAcc:Q93V56];gene_id=AT1G01050;logic_name=tair`
The APPLES_conservation module finds Non-Coding Conserved Regions
There are three sections of inputs for the conservation module. The first two are identical to that of the utr module with each one being for one of the two species. In the third section:
Orthologs- A total of 4 columns (tab-separated) are required in this file. Column 1: Species A's protein ID; Column 2: Species B's protein ID; Column 3: Species B's gene ID; Column 4: Species A's gene ID. i.e. "SpeciesA_proteinID SpeciesB_proteinID SpeciesB_geneID SpeciesA_geneID". This is the format in which results from the APPLES_rbh module are produced.
Orthologs Mode- Results from the Pseudo-Orthologs option is used as a controlled result which is only useful when compared with the result produced by using the correspoinding (proper) orthologs. If you don't know what it means, please use the default mode.
Window Size- The Seaweed algorithm aligns substrings of the given sequences (the length of which are specified in each species's "Sequence Length" argument) at a time. The length of this substring is called the "Window Size". It is recommended to use one of these values: 30 / 60 (default) / 80 / 100
Use this following command to split the orthologs file:
split -d --number=l/$(nproc) rbhSearch_result_PlantA_PlantB.txt rbhSearch_result_PlantA_PlantB.txt
Similar to all of the CyVerse UK applications developed at Warwick. There are 3 options when it comes to using our applications:
- Via the CyVerse Discovery Environment. This is the recommended approach to a new user. This is the easiest option since a full user interface is provided to the user.
- Using the Docker images that are available on our Docker Hub repository :whale:. Each application/tool has a corresponding image.
- With the source codes that are hosted on our Github repository :octocat:. This approach will give you more information of how the application actually works. We are always looking to improve our code, so feel free to send us a pull request.
The modules related to APPLES can be searched on the CyVerse Discovery Environment using the "apples" keyword in the application search box as shown in this screenshot: