Data analysis - documentation : Whole genome alignments using ELAND

This page last changed on Apr 04, 2007 by ac.

This page details how to do large scale alignments using ELAND.

("the software formerly known as IMPALA")

ELAND stands for E fficient L arge-Scale A lignment of N ucleotide D atabases.

ELAND searches a set of large DNA files for a large number of short DNA reads allowing up to 2 errors per match.

Here
"large" = human genome size and above; of course, it also works for small genomes
"large number" = at least 8 million on a PC with 1Gb of RAM
"short" = 32 bases or less

Compilation

Eland gets compiled automatically as part of the pipeline compilation.

All necessary source files now live in the Eland subfolder of the Pipeline CVS tree, with the exception of GlobalUtilities.h and GlobalUtilities.cpp (both also required by PhageAlign), which live in subfolder Misc.

Each read in your query set of reads must be of the same length and ELAND must be compiled to match. If you wanted a manual compilation, to do this go to Pipeline/Eland and type e.g. make eland_20 for reads of length 20 - eland_20 is the executable that you should run. Alternatively, set OLIGO_LENGTH in your environment to 20 and do make -e eland - in this case your executable will be called just eland.

Preparing the reference genome

Each read in your query set of reads must be of the same length. These reads are aligned to a target set of large DNA sequences. These sequences must be in Fasta format with one entry per file. An example:

>X dna:chromosome chromosome:NCBI36:X:1:154913754:1
CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTCTGAAAGTGGACCTATCAGCAG
GATGTGGGTGGGAGCAGATTAGAGAATAAAAGCAGACTGCCTGAGCCAGCAGTGGCAACC
CAATGGGGTCCCTTTCCATACTGTGGAAGCTTCGTTCTTTCACTCTTTGCAATAAATCTT

They must first be "squashed" into the 2-bits-per-base format that ELAND reads. This is done by

making a folder for the squashed files
```
mkdir path/myGenome
```

going to where the files are and doing

Pipeline/Eland/squashGenome path/myGenome fastAFile1 [fastAFile2 ... ]

At the end of this procedure, your directory path/myGenome will look similar to this:

> ls path/myGenome
fastAFile1.fa.2bpb
fastAFile1.fa.vld

For speed you may want to make room for the squashed files on your machine's local hard drive rather than keeping them on an external analysis drive. As an example, the squashed files for the 3Gbase human genome will take up 750Mbytes of space (750 Mbytes x 4 bases per byte = 3 Gbases).

Limitations on the size of squashed genome files

The design goal of ELAND was to match a large number of reads against the human genome, ie a small number of large pieces of DNA. The key to doing this efficiently is to fit as large a batch of reads into RAM as possible. Accordingly it stores a match position (chromosome number plus position in chromosome) in 4 bytes, which in turn limits the total length of DNA that ELAND can match against in a single run.

The most significant of the 4 bytes is used as a block number, and can range between 1 and 239 (codes 0 and 240 to 255 are used to denote repeats). The remaining 3 bytes are used to denote position in a block, which can therefore be up to 2^24=16Mb in size. Each file takes up a whole number of blocks, so if you have large number of short sequences (e.g. ESTs) to match against, the way forward for now is to stick them in a single large file and separate them with Ns, I am looking to support the squashing of multi-entry fasta files in a future release.

So in a single ELAND run you can match against:

One file of 240x16=3824Mb
239 files, each up to 16Mb in size
Or something in between, eg 24 files of up to 160Mb each.
The NCBI human genome will fit. Please note that at this time (4th April 2007) ELAND does not check that the directory of squashed files it is matching against does not exceed these limits.

Running ELAND as a standalone program

The command line syntax to run ELAND is

eland_executable queryFile.txt squashedGenomeDir outFile.txt [repeatFile.txt]

queryFile.txt is a file of query sequences. This must be either a multi-entry Fasta format file or a one-sequence-per-line ASCII file. The length of each sequence must exceed the read length specified at compilation, else "undefined behaviour" will result. Unspecified bases in the reads must be denoted by an "N." IUPAC ambiguity codes are not handled (yet).
squashedGenomeDir is the path to the directory squashed genome files you made above.
outFile.txt is the name of the output file. One line of output per sequence, fields are:
1. Sequence name (derived from file name and line number if format is not Fasta)
2. Sequence
3. Type of match:
NM - no match found.
QC - no matching done: QC failure (too many Ns basically).
RM - no matching done: repeat masked (may be seen if repeatFile.txt was specified).
U0 - Best match found was a unique exact match.
U1 - Best match found was a unique 1-error match.
U2 - Best match found was a unique 2-error match.
R0 - Multiple exact matches found.
R1 - Multiple 1-error matches found, no exact matches.
R2 - Multiple 2-error matches found, no exact or 1-error matches.
4. Number of exact matches found.
5. Number of 1-error matches found.
6. Number of 2-error matches found.
Rest of fields are only seen if a unique best match was found (i.e. the match code in field 3 begins with "U").
7. Genome file in which match was found.
8. Position of match (bases in file are numbered starting at 1).
9. Direction of match (F=forward strand, R=reverse).
10. How N characters in read were interpreted: ("."=not applicable, "D"=deletion, "I"=insertion).
Rest of fields are only seen in the case of a unique inexact match (i.e. the match code was U1 or U2).
11. Position and type of first substitution error (e.g. 12A: base 12 was A, not whatever is was in read).
12. Position and type of first substitution error, as above.

For historical reasons a slightly different file format is used if an output name ending in .vmf is specified.

repeatFile.txt: Optionally you may want to specify a set of words that you know are highly repetitive in your target files at your read length of interest. You can then tell ELAND to ignore them, which greatly speeds up whole-human-genome alignments. There is no automatic way of generating a repeat file, but with a bit of Perl/shell scripting it is straightforward to extract a list of repeats from the output of a couple of ELAND runs to speed up future runs.
Also in the Eland directory is a basic test harness ELAND_test.pl which can be run to verify correct operation.

Running ELAND from the pipeline using GERALD

This requires revision v1.12 or later of GERALD.pl in Pipeline/Gerald. For more information on Gerald see Gerald User Guide and FAQ.

The config file you run GERALD on should contain lines such as

ELAND_GENOME /usr/local/share/eland/ncbi35

(or wherever your directory of squashed genome files is held)

ELAND_REPEAT /usr/local/share/eland/reps30_5

(or wherever your list of repeats is held)
and either

ANALYSIS eland

or use

34:ANALYSIS eland

to run ELAND on particular lanes.

Note that ELAND_GENOME referes to a directory, not a file. The usual Gerald variables GENOME_DIR and GENOME_FILE are not used for Eland analysis. As described above, Eland expects a different file format and not Fasta.

You should then be able to run make. Under the hood a script convertToFasta.pl converts and concatenates all the reads into a single large FASTA file which is then passed as an input to ELAND. A script convertFromELAND.pl converts the results back into the by-tile PhageAlign format expected by the rest of the pipeline.

For longer runs you will need to bear in mind that at the moment ELAND can cope with read lengths of at most 32.

Also note that at the moment you can only specify one ELAND_GENOME per experiment. This is a GERALD not an ELAND limitation but I'm unlikely to get round to fixing it in the near future.

CAUTION: ELAND will only detect matches having 2 differences or less from your reads. This means it is less sensitive than PhageAlign, which will always find a best match (although possibly not a unique one) for your reads. This is the price you pay for it being ~10000 times faster than PhageAlign. This means
- If your data is noisy not all of it is going to align. If this happens to a significant proportion of your data then it is probably the case that your data is too noisy to get good results anyway. "Talk to the hand," as they say.
- Estimates of error rates based on ELAND output will underestimate the true error rate (since the noisiest reads will not get aligned and hence won't contribute to the calculation).

Eland uses up to 1GB of memory. The pipeline starts one Eland job per lane. In order to prevent most computers from running out of memory, an artificial dependency in the GERALD Makefile prevents multiple instances of Eland from running at the same time. This limitation can be removed by using the option

ELAND_MULTIPLE_INSTANCES 8

in the Gerald config file. Be aware that this may use up to 8 GB of memory on your analysis computer; if not enough memory is available, the analysis is likely to crash. Allowed values for this option are 1, 2, 4, 8. (1 means no lane parallelisation, and uses up to 1 GB RAM, 2 means two parallel jobs and up to 2 GB etc.)

ELAND features, "features" and TBDs

Would like to process longer read lengths than 32
Quality value aware: at the moment it allows up to two mismatches per read, giving all mismatches equal weight. It would be better to be able to individually weight bases via either a score file or quality values, as is done by PhageAlign. This is on my to-do list: I've thought of a way round it but haven't had time to implement it.
More output format options, e.g. print all matches found, not just the unique best match.
I also have prototype versions to align paired reads and align bisulphite-treated DNA (to detect methylation sites).

Appendix: How ELAND handles missing bases

First note that missing bases need to be specified as "N" characters and not "." as in the sequence files. This conversion is handled automatically by GERALD but you need to be aware of it when running ELAND as a standalone program.

ELAND allows up to (read length/4) "external" Ns - i.e at the beginning or end of the read. These are ignored.

ELAND also allows up to 2 "internal" Ns - these are interpreted in one of two ways, which are given equal weight (i.e. if a read with one type 'D' match and one deletion-N match will be called as a repeat).

A 'type D' N is of the form:

Read:   ACNGT
Genome: ACCGT

i.e. the base is there but not detected.

A 'type I' N is of the form:

Read:   ACNGT
Genome: AC-GT

i.e. it has skipped a base.

'Type D' and 'type I' Ns are given equal weight

In lining up bases in your read with bases at the position it aligns to in the reference, you should
a. ignore any leading Ns
b. ignore any 'type I' Ns (because they are 'non existent' bases)

I do have a Perl script that correctly interprets the rules described above, but given that Ns shouldn't occur that often it might be simpler just to ignore N-containing reads.

For ancient historical reasons 'D' stands for 'detection error' and 'I' stands for 'incorporation error' - I realise this is confusing as the temptation is to interpret them as 'deletion' and 'insertion.' I spent considerable time trying to get the N handling to work as described above, but the way it handles Ns is based on very old assumptions about how the (then non existent!) machine would behave - at the time we were intending to sequence single molecules for which I assumed N errors would be a lot more common and would be likely to be of the type described. Most Ns we actually see in practice are due to clusters at the edge of tiles wandering off the edge of the image for a cycle or two due to imperfect re-mapping of the tile position at different cycles - hence giving a 'type D' error. Otherwise the pipeline software strongly prefers to make a base call if at all possible, even if the call is of low quality.

A low priority TBD is to allow the N handler to be switched to 'type D errors only' (which better reflects the way the platform behaves) or just restricted to external Ns.

Document generated by Confluence on Apr 05, 2007 11:25