Using a massively parallel sequencing approach, the Illumina Genome Analyzer (GA) can generate billions of bases of high quality sequence data in a single run. The system uses Solexa sequencing technology and novel reversible terminator chemistry, optimized to achieve unprecedented levels of cost effectiveness and throughput. More information describing this instrument and its application is available at two Illumina web sites, here and also here.
The DNA Technologies and Expression Analysis Cores of the Genome Center operate two of these instruments at GA IIX status: the latest version with up to date optics, electronics, and analytic capabilities. One of our GAIIs is equipped with a paired end module
that allows sequencing from both ends of a defined-size DNA molecule. Data produced from such paired end reads greatly facilitates assemblies and investigation of genomic structure and this capability, coupled with chemistry improvements allowing longer accurate reads, has expanded the utility of this platform for de novo assemblies. Illumina maintains a site with the most up-to-date citations about the experiments being done with these instruments, and we're always happy to discuss with you what's new and what's available at our core facilities.
Steps involved in a GA experiment can be broken down into a series of experimental manipulations, instrument runs, and data analyses. These steps include creation of a sequencing library, seeding and preparation of the flow cell on the cluster station, sequencing by synthesis, and bioinformatics.
The Sequencing Library
Library Quality Control and Quantitation
Cluster Generation and Sequencing by Synthesis
Does Length Matter?
Deliverables and Bioinformatics: The GA II and SLIMS
Did It Work?
Other Analysis Options
Prices
Sample Submission and Scheduling
In Closing
DNA. We offer DNA library preparation services, with performance specifications depending on the source material. Genomic DNA, double strand cDNA libraries, BACs, or other material available in microgram quantities will generate quality libraries nearly every time. Libraries can also be constructed from much less input material, e.g., chromatin-immunoprecipitated product or hybridization-selected DNAs, but in these technically challenging situations outcomes are not guaranteed.The following guidelines should be observed for submission of library-worthy material: for chromosomal, BAC, or related DNA libraries, provide 5 ug of high quality DNA (concentration > 100 ng/ul, OD 260/280 close to 1.8) in a neutral solution such as EB, TE ([EDTA] =0.1 mM) or water.
RNA. We also offer mRNA seq library preparation. In this protocol, polyA is purified from total RNA, then fragmented, converted to ds cDNA, and ligated to the Illumina adapters. Starting material is ideally 10 ug of total RNA (at least 200 ng/ul), analyzed by gel or bioanalyzer to ensure integrity. We do not offer small RNA library construction services, but we do run these libraries on our sequencers if primer is provided.
Users interested in making their own libraries to expedite their studies and save money should download and read the current protocols for different kits offered by Illumina (see below). Ordering information for kits can be provided on request. DNA library production utilizes straightforward molecular biological techniques, and many groups have successfully produced libraries from protocols using assembled or kitted enzymes from other companies. An example of one such protocol for ChIP-seq DNA library construction is also provided below. We can provide (i.e., sell you) Illumina produced, purified, and certified oligonucleotides for construction of both single- and paired-end (PE) read libraries. Oligonucleotides from commercial vendors have also been used succesfully for sequencing library preparation. Since the PE libraries work on both the single read and PE read flow cells, many users are opting to go with PE primers for library construction for maximum flexibility.
Making libraries from the Illumina mRNA seq kit is also a straightforward process. Other specialized RNA library kits are available from Illumina, check out their web site for details. Normalized mRNA libraries, those made following removal of highly abundant messages, is offered as a contract service, please inquire if this is of interest.
Library quality control is essential and merits its own section below.
Illumina Kit Protocols
The latest versions of Illumina validated protocols for all their kits are now at your fingertips:
Go to http://www.illumina.com/ftp.ilmn
Log on with
Username: guest
Password: illumina
Select the folder(s) of interest, in this case probably "Genome Analyzers."
Homemade Protocols
Genomic DNA library: This protocol is based on the Illumina kit protocol but uses the New England Biolabs reagents (see below). This is the protocol used in the core sequencing library workshops
ChIP-seq DNA library: Protocol describing ChIP-seq DNA library construction using commercially available enzymes coupled with the Illumina oligonucleotides from the library kit. Protocol provided by Dr. Ghia Euskirchen. We can provide the purified, certified Illumina oligonucleotides for this protocol, you just need to assemble the enzymes and other required materials. Both standard genomic and paired end primers will be available.
Annotated ChIp-seq DNA library: Variation of the above protocol, with embedded comments regarding alternative procedures for genomic DNA and updated enzyme information.
Normalized RNA library: Not a detailed protocol, but an overview of a successful workflow for generating a normalized library.
We have not extensively validated these protocols. They do work for people but, as with any protocol, they're not necessarily perfected or exactly adaptable to your situation.
New England Biolabs sells a kit containing all the enzymes required for library construction. We have beta-tested these kits and they work well. They are economical and conveniently packaged. For more information on these and other NEB library-related products, please visit this site.
Other Resources
ChIP-Seq Data Technical Note and ChIP-Seq DataSheet. Two informational documents from Illumina describing the uses and applications of sequencing chromatin immunoprecipitation products. Good overviews with useful references.
Other Libraries
Illumina sells kits for mate-pair and indexing library construction (use the link above to get the protocols). Neither library type is generated as a core service yet, but we can provide ordering information and also library construction guidance.
Mate pair kits are for creating paired end libraries spanning distances greater than the approximately 600 bp maximum for cluster formation. DNA fragments of defined long length (800 bp to 10 kb or so) are generated then circularized. The DNA is again fragmented and the junction region selectively cloned. Sequencing either end of the junction region effectively generates mate pair sequence separated by the size of the orignally created fragments.
Indexing libraries allow for the sequencing of multiple libraries in a single lane, i.e., multiplexing. This strategy is an economical option for analysis of smaller genomes or other situations in which the typical lane output is greater than required. Short nucleotide "bar codes" are appended to each library which are then pooled and sequenced. Deconvoluting the bar codes informatically allows multiple libraries to be sequenced in a single lane at a potential cost and time saving. To date, two methods have been exploited for this: using the illumina indexing kit or synthesizing your own adapter oligos with your own bar codes. Illumina indexing technology has several operational (not scientific) drawbacks we can discuss with you. Homemade indexing has been utilized successfully by multiple core users, but there are two important factors to consider:
1. Indexed sequences can not be easily aligned to a reference genome. The standard illumina base calling pipeline we use does not allow
for a one step splitting off of the index sequence while aligning the rest of the read. As a result, extra bioinformatic manipulations not included in standard fees will be required to get accurate alignment statistics while maintaining the index sequence.
2. The base composition of the first two bases in the library MUST be balanced in order for the image analysis software to identify clusters correctly. That is, the clusters should have a roughly equal representation of A,T,C, and G in the first two bases of the sequenced read. If only two samples are pooled, a single index sequence used on each sample will result in highly skewed base compositions and could prevent any data from being generated. The way around this is to make two different libraries for each sample, and balance the bases across each cycle. This is critical to get good data, so make sure you discuss these issues with us if there are questions.
For a sample of bar codes that have been used and a protocol for adapter preparation with these barcodes, download this document. These particular sequences have the advantage of enough inherent redundancy so that if one or two nucleotides is lost, the bar code can still be unambiguously assigned.
Hybridization Selection/ Sequence Capture libraries are those in which particular genomic regions are pre-enriched prior to indexed library generation and sequencing. This strategy allows focused, very deep sequencing and can be implemented for a number of applications. A number of interesting articles have recently appeared exploiting this technique, check out the Illumina publications site for details. A number of companies offer services or platforms that can generate such material, including NimbleGen, Agilent, RainDance,Febit, and Fluidigm. A company called Expression Analysis (no relation to our core) has a service for sequence capture using the Agilent and Rain Dance platforms (depending on the particular experiment one is preferred). We have no direct experience with any of the platforms so consider this informational and not an endorsement or recommendation.
More on Fragmentation
DNA to be made into a sequencing library must first be converted into small fragments. There are several methods for doing this, each with attendant pros and cons. The original protocols utilized nebulization for fragmentation, and anyone desiring to pursue this technique can contact us for extra nebulization units because we abandoned this method very early. A related methodolgy, Hydroshear, can be accessed through the CAES Genomics facility. We primarily use a Diagenode Bioruptor because we are familiar with its operation through using it in chromatin immunoprecipitation experiments. Access to this instrument is available through the core, with the usual training and signup guidelines for core-available equipment in effect. Many protocols and centers rely on and recommend a fragmentation device from Covaris, which uses adaptive focused acoustics to break the DNA down into appropriately sized fragments. The core does not have imminent plans to acquire this device since in our opinion it does not offer a substantial advantage over the bioruptor. A recent development is the availability of an enzymatic fragmentation product from New England Biolabs called (cleverly?) Fragmentase. Two advantages of this product are that it lends itself very well to high throughput construction of libraries, and that controlled digestion can be used to achieve gentle fragmentation of genomic DNA in order to recover very high MW fragments (3-10 kb, e.g.). Large MW fragments are important in the construction of mate pair libraries, and these can be difficult to produce using other methods. Our reliable protocol-optimizer Marta Matvienko has carried out some experiments in this regard and is willing to share her results, please download here.
Library Quality Control and Quantitation
Library quality is the single most important determinant of the success of your sequencing run, both in terms of the number of reads generated (quantitation) and the validity of the sequence obtained (content). Optimizing this is hot topic in the community; a Dec '08 paper, Quail et al., from the Sanger Inst. lists a number of improvements over the standard Illumina protocols in library preparation and analysis. If you construct your own libraries you should download this paper and a supplementary methods table for the many practical issues covered. While the newer pipeline software allows analysis of ever-increasing cluster densities, optimal data generation depends on hitting the cluster values of 180K-220K clusters/tile.
Library quantitation is surprisingly challenging: nanodrop readings, picogreen, Agilent bioanalyzer, and qPCR are all used with varying degrees of accuracy. Lately we have settled on the bioanalyzer. This instrument has the advantage of providing a ng/ul value, an accurate molecular weight, and the calculated molarity for each peak, while displaying the presence of other unwanted libary components like adapter and primer dimers. For just about all users we recommend bioanalyzer analysis before sequencing. We can run individual samples, and bioanalyzer access is also available through the core in different formats, see here for more details. Quantitation based on the bioanalyzer generally gets us within a good sequence read range (note: a related instrument from Bio-Rad calledthe Experion performs comparably). We do suggest getting a nanodrop reading first as a rough idea, because that will indicate whether a regular or high sensitivity chip should be used for subsequent more precise analysis.
Periodically libraries appear not as predominant, well-defined peaks or humps, but double peaks (one at the correct MW, the other higher), peaks with trailing shoulders, or peaks plus high molecular, broad humps. It's not clear what this material is, possibly single strand or heteroduplex conformers, but it does make productive clusters and doesn't necessarily mean the library is not good. It is not accurately quantified by the bioanalyzer, however, so if a substantial proportion of the library appears like this a different quantitative protocol based on qPCR is recommended. Our protocol utilizes SYBR green fluorescence detection during amplification with the library PCR sequences. The amplification efficiency of an uncharacterized library is simultaneously compared with the amplification efficiency of a previously sequenced library. The qPCR assay can also be carried out by individuals with access to a real time PCR machine, you just need a library of known behavior as a standard. The idea is straightforward--generate a standard curve plotting Ct vs. library amount using a known library, then compare a couple of dilutions of your unknown against this standard. You can download our protocol for qPCR quantitation if you want to give it a shot.
The perfect number of clusters can not be guaranteed on the first run, we can only claim to get closer to optimal if bioanalyzer and/or qPCR information is available. Additional runs of the same library are more accurately diluted, since info from the first lane can be used to adjust the concentrations. Both the qPCR and bioanalyzer assays are particularly useful in identifying libraries that should not be run at all. To repeat, we recommend one of these two forms of qc on all libraries.
Library content is also difficult to assess. Care taken during preparation procedures, as outlined in the Sanger Inst. paper and the Illumina protocols, is essential. Gel or bioanalyzer determination of library size is important: inappropriate-sized fragments may lead to problems in cluster formation or sequence contamination, and can even indicate complete absence of intended inserts. Some users have developed their own qPCR tests on libraries to make sure, e.g., that a particular target promoter is present and suitably enriched in a ChIP library before sequencing.
The cores take no official position on the Illumina-recommended validation steps of cloning and (Sanger) sequencing a representative sampling of library molecules.
Cluster Generation and Sequencing by Synthesis
Flow cell preparation on the cluster station is carried out by core facility personnel. Each flow cell has eight lanes, corresponding to eight different libraries. One lane is always reserved for a control (see below, "Did it work?"). We recommend trying out a new library on one lane only. This will establish the amount required for optimal cluster generation in subsequent runs, and should give you a good feel for the quality of the library content. A library typically will provide enough material for many, many lanes so it's not a problem for us to archive and keep running that sample until enough data is produced to satisfy your requirements.
Each library has a particular sequencing primer used in conjunction with that library type. In the core we maintain stocks of the genomic DNA library sequencing primer so this primer does not need to be provided with your samples. However, for other types of libraries, e.g., the small RNA or RNA tag libraries, specific sequencing primers are required. These can be hard to acquire since Illumina does not sell them directly. However, the sequences are available so HPLC purified primers that will function successfully in the sequencing reaction can be purchased from the usual sources. If multiple primer types need to be used in a single flow cell additional charges will be applied due to the extra materials and time involved.
Does Length Matter?
Of course it does! When we got our first GA in August 2007, accurate read lengths barely exceeded 25 bases. The read length you require will naturally depend on your experimental needs, and discussions with the Bioinformatics Core will be essential to focus your decision. Numerous options are theroetically possible in the core, and prices for these options can be downloaded. Here are a few considerations:
A 40 bp single read will provide an unambiguous match to any genome. This read length is therefore suitable for most ChIP-seq and mRNA-seq applications where a reference genome is available.
It can also be suitable for certain SNP discovery applications, where a reference genome is available. It is the read length used for small RNA analyses. It can be suitable for some bacterial genome analyses, reduced representation or hybrid-selected libraries, depending on the application.
Longer single reads are better for SNP discovery, initial assemblies of complex genomes or transcriptomes, and complete assemblies of simpler, smaller genomes. Current chemistry and software, and a consensus from many big centers, suggests 85 bp runs are the best comporomise between longest read length and lowest error.
Paired end reads substantially facilitate assemblies of all genome sizes. A combination of single and paired end reads is a good approach for many assembly projects. Paired end reads can also help resolve differences among repeat regions and as a result can be use in transcriptome projects to distinguish family members as well as identify alternative splicing.
Note that these are general guidelines, and examination of the literature coupled with discussions with Bioinformatics personnel will provide information on the particular options most suitable for your application. A short tech note from Illumina on assemblies using their instrument output might be a good place to begin.
Deliverables and Bioinformatics: The GA IIX and SLIMS
Each lane on a flow cell generates huge amounts of raw image data for each synthesis cycle. These images are processed into base calls and alignment files through the Solexa Pipeline, which is immediately initiated following run completion. We typically produce between 15 and 22 million good reads/lane. The total base ouput for the lane will depend on the number of cycles for that run, and whether it's a paired end read. For example, a 40 cycle single read run with 24 million good reads would give 960 Mb bases/lane, and a PE 40 cycle run nearly 2 Gb.
The Illumina pipeline, which generates base calls and alignment information, is run in standard fashion as part of our service. We maintain a number of genomes for alignment purposes, you can check with us to see if we have your genome of interest. For the purposes of the pipeline, a "genome" can be either genuine genomes or collections of EST sequences, unigene assemblies, plasmid sequences, or whatever you want to use to compare against your sequenced reads. Sequences (or links) can be provided to us and we'll prepare for alignment for no extra cost. These sequences must be in fasta format for us to use.
We recommend you download the pipeline manual to familiarize yourself with the pipeline. The new version of the pipeline includes increased tertiary analysis options such as basic assemblies and allele calling. We don't intend to offer this through our cores, but if it's of interest you may want to read up on it in the CASAVA section of the manual. We do have a set of default analysis parameters we use for the pipeline, but custom parameters can often be invoked for no extra charge. A particularly important part of the manual is a description of the output files generated, so you will want to focus on that. You can also download this condensed "cheat sheet" that summarizes information on the most relevant output in the GERALD directory. These files are the raw material for your subsequent analyses, so it's important to understand what's in them.
Following analysis of each run, users have access to parsed output through the Solexa LIMS (SLIMS) created in conjunction with the Bioinformatics Core (to get some idea of the look of this interface, login here and use "craigventer@test.com" as the email and "slimsdemo" as the password). Subsequent download of all the data, including images and all the sequence files, can be arranged through us and the Bioinformatics Core. Users should be prepared to collect and store their own sequence and, if they wish, raw tiff image data, soon after running the sample. Images will be available for one month following data availability at which time they will be AUTOMATICALLY DELETED. If you wish to store images longer you must indicate it on your sample drop off form (see below in the Sample Submission section). Processed data will be stored free of charge for three months. At this time, data will begin accruing storage charges unless you indicate, through the SLIMS user interface, that it can be deleted. Processed data will never be automatically deleted. VERY IMPORTANT: please download and carefully read the Managing your Storage document for the full story.
A SLIMS account will be created for you on your first run, with information about how to set up and access your account distributed via email. Typically you will get this email before your actual files are available, you just need to be a little patient. The main SLIMS page can be reached here.
For the UNIX-savvy user, it is possible to use rsync to acquire all your non-image data directly. Please go to http://bioinformatics.ucdavis.edu/index.php/Archiving_solexa_data to read how.
The Bioinformatics Core is continually gaining expertise manipulating and analyzing these large sequence data sets, so for any assistance downstream of the initial basecalling these are the people to talk to.
For the do-it-yourselfer, many shareware options exist to carry out analyses on Illumina sequence data. This section is not intended to provide a comprehensive list of sites offering such software, but instead presents some of the tools developed and used by researchers here at the Genome Center (although there is a Cold Spring harbor site that has been highly recommended, check out http://hannonlab.cshl.edu/fastx_toolkit/). The links provide varied amounts of documentation, and have been only minimally explored by core personnel. In other words, you're pretty much on your own!
Software developed in the Farnham lab and placed online by the Ludaescher group allows direct upload of ELAND files into
a program called Sole Search for ChIP-seq analysis. The site can be accessed here:
http://chipseq.genomecenter.ucdavis.edu/cgi-bin/chipseq.cgi
Output from this program (and others) can be used to carry out motif analysis, using web accessible tools from the Jin lab at Ohio State (Victor Jin is a Farnham lab alum). Access this functionality at:
http://motif.bmi.ohio-state.edu/ChIPMotifs/
The Michelmore lab has also developed tools for processing Illumina data. Two sites to visit are:
http://code.google.com/p/atgc-illumina/
http://code.google.com/p/atgc-illumina/source/browse/#svn/trunk
Documentation is a little sparse on these last two sites, however, there is a power point
presentation that explains basic idea of their usage and more:
http://atgc-illumina.googlecode.com/files/ILLUPA_Overview_AKozik_090910_D.pdf
Have fun!
Prices
Prices for sequencing depend on the number of cycles requested (i.e., how long a sequence you want) and whether the runs are paired end (PE). Our official posted charges are here. However, additional options and services are continually becoming available, so for the latest prices and offered services please download this document. The listed fees include all the labor and reagents for cluster generation, cycle sequencing, and initial base calling analysis using the standard Solexa Pipeline. Failed lanes on an otherwise good run (as evidenced by the behavior of the control lane) will be charged full price. Even from sub-par runs, some information such as library integrity can be obtained; of course, if re-runs are necessary they will be carried out as soon as possible and not moved to the end of the queue.
While initial base calling is currently part of the service fee, the Bioinformatics Core has developed a menu of services relating to the access, manipulation and analysis of sequence data from these instruments. Please contact them for help in analyzing your data.