Data analysis - documentation : Whole genome alignments using ELAND
This page last changed on Apr 04, 2007 by ac.
This page details how to do large scale alignments using ELAND.
("the software formerly known as IMPALA")
stands for fficient arge-Scale lignment of ucleotide atabases.
ELAND searches a set of large DNA files for a large number of short DNA reads allowing up to 2 errors per match.
Eland gets compiled automatically as part of the pipeline compilation.
All necessary source files now live in the Eland subfolder of the Pipeline CVS tree, with the exception of GlobalUtilities.h and GlobalUtilities.cpp (both also required by PhageAlign), which live in subfolder Misc.
Each read in your query set of reads must be of the same length and ELAND must be compiled to match. If you wanted a manual compilation, to do this go to Pipeline/Eland and type e.g. make eland_20 for reads of length 20 - eland_20 is the executable that you should run. Alternatively, set OLIGO_LENGTH in your environment to 20 and do make -e eland - in this case your executable will be called just eland.
Each read in your query set of reads must be of the same length. These reads are aligned to a target set of large DNA sequences. These sequences must be in Fasta format with one entry per file. An example:
>X dna:chromosome chromosome:NCBI36:X:1:154913754:1 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTCTGAAAGTGGACCTATCAGCAG GATGTGGGTGGGAGCAGATTAGAGAATAAAAGCAGACTGCCTGAGCCAGCAGTGGCAACC CAATGGGGTCCCTTTCCATACTGTGGAAGCTTCGTTCTTTCACTCTTTGCAATAAATCTT
They must first be "squashed" into the 2-bits-per-base format that ELAND reads. This is done by
At the end of this procedure, your directory path/myGenome will look similar to this:
> ls path/myGenome fastAFile1.fa.2bpb fastAFile1.fa.vld
For speed you may want to make room for the squashed files on your machine's local hard drive rather than keeping them on an external analysis drive. As an example, the squashed files for the 3Gbase human genome will take up 750Mbytes of space (750 Mbytes x 4 bases per byte = 3 Gbases).
The design goal of ELAND was to match a large number of reads against the human genome, ie a small number of large pieces of DNA. The key to doing this efficiently is to fit as large a batch of reads into RAM as possible. Accordingly it stores a match position (chromosome number plus position in chromosome) in 4 bytes, which in turn limits the total length of DNA that ELAND can match against in a single run.
The most significant of the 4 bytes is used as a block number, and can range between 1 and 239 (codes 0 and 240 to 255 are used to denote repeats). The remaining 3 bytes are used to denote position in a block, which can therefore be up to 2^24=16Mb in size. Each file takes up a whole number of blocks, so if you have large number of short sequences (e.g. ESTs) to match against, the way forward for now is to stick them in a single large file and separate them with Ns, I am looking to support the squashing of multi-entry fasta files in a future release.
The command line syntax to run ELAND is
eland_executable queryFile.txt squashedGenomeDir outFile.txt [repeatFile.txt]
For historical reasons a slightly different file format is used if an output name ending in .vmf is specified.
This requires revision v1.12 or later of GERALD.pl in Pipeline/Gerald. For more information on Gerald see Gerald User Guide and FAQ.
The config file you run GERALD on should contain lines such as
(or wherever your directory of squashed genome files is held)
(or wherever your list of repeats is held)
to run ELAND on particular lanes.
Note that ELAND_GENOME referes to a directory, not a file. The usual Gerald variables GENOME_DIR and GENOME_FILE are not used for Eland analysis. As described above, Eland expects a different file format and not Fasta.
You should then be able to run make. Under the hood a script convertToFasta.pl converts and concatenates all the reads into a single large FASTA file which is then passed as an input to ELAND. A script convertFromELAND.pl converts the results back into the by-tile PhageAlign format expected by the rest of the pipeline.
Eland uses up to 1GB of memory. The pipeline starts one Eland job per lane. In order to prevent most computers from running out of memory, an artificial dependency in the GERALD Makefile prevents multiple instances of Eland from running at the same time. This limitation can be removed by using the option
in the Gerald config file. Be aware that this may use up to 8 GB of memory on your analysis computer; if not enough memory is available, the analysis is likely to crash. Allowed values for this option are 1, 2, 4, 8. (1 means no lane parallelisation, and uses up to 1 GB RAM, 2 means two parallel jobs and up to 2 GB etc.)
First note that missing bases need to be specified as "N" characters and not "." as in the sequence files. This conversion is handled automatically by GERALD but you need to be aware of it when running ELAND as a standalone program.
ELAND allows up to (read length/4) "external" Ns - i.e at the beginning or end of the read. These are ignored.
ELAND also allows up to 2 "internal" Ns - these are interpreted in one of two ways, which are given equal weight (i.e. if a read with one type 'D' match and one deletion-N match will be called as a repeat).
A 'type D' N is of the form:
Read: ACNGT Genome: ACCGT
i.e. the base is there but not detected.
A 'type I' N is of the form:
Read: ACNGT Genome: AC-GT
i.e. it has skipped a base.
'Type D' and 'type I' Ns are given equal weight
In lining up bases in your read with bases at the position it aligns to in the reference, you should
I do have a Perl script that correctly interprets the rules described above, but given that Ns shouldn't occur that often it might be simpler just to ignore N-containing reads.
For ancient historical reasons 'D' stands for 'detection error' and 'I' stands for 'incorporation error' - I realise this is confusing as the temptation is to interpret them as 'deletion' and 'insertion.' I spent considerable time trying to get the N handling to work as described above, but the way it handles Ns is based on very old assumptions about how the (then non existent!) machine would behave - at the time we were intending to sequence single molecules for which I assumed N errors would be a lot more common and would be likely to be of the type described. Most Ns we actually see in practice are due to clusters at the edge of tiles wandering off the edge of the image for a cycle or two due to imperfect re-mapping of the tile position at different cycles - hence giving a 'type D' error. Otherwise the pipeline software strongly prefers to make a base call if at all possible, even if the call is of low quality.
A low priority TBD is to allow the N handler to be switched to 'type D errors only' (which better reflects the way the platform behaves) or just restricted to external Ns.
|Document generated by Confluence on Apr 05, 2007 11:25|