Genome
Center

Core Home

Find Us

Contact Us

Links

Log On

Prices

Bioinformatics

Equipment


Ultra High Throughput Sequencing       
Updated 11/04/09

genome analyzer Using a massively parallel sequencing approach, the Illumina Genome Analyzer (GA) can generate billions of bases of high quality sequence data in a single run. The system uses Solexa sequencing technology and novel reversible terminator chemistry, optimized to achieve unprecedented levels of cost effectiveness and throughput. More information describing this instrument and its application is available at two Illumina web sites, here and also here.

The DNA Technologies and Expression Analysis Cores of the Genome Center operate two of these instruments at GA IIX status: the latest version with up to date optics, electronics, and analytic capabilities.  One of our GAIIs is equipped with a paired end module data collection that allows sequencing from both ends of a defined-size DNA molecule.  Data produced from such paired end reads greatly facilitates assemblies and investigation of genomic structure and this capability, coupled with chemistry improvements allowing longer accurate reads, has expanded the utility of this platform for de novo assemblies. Illumina maintains a site with the most up-to-date citations about the experiments being done with these instruments, and we're always happy to discuss with you what's new and what's available at our core facilities. 
Steps involved in a GA experiment can be broken down into a series of experimental manipulations, instrument runs, and data analyses. These steps include creation of a sequencing library, seeding and preparation of the flow cell on the cluster station, sequencing by synthesis, and bioinformatics.

The Sequencing Library
Library Quality Control
Cluster Generation and Sequencing by Synthesis
Does Length Matter?
Did It Work?
Deliverables and Bioinformatics:  The GA II and SLIMS
Other Analysis Options
Prices
Sample Submission and Scheduling
In Closing

The Sequencing Library

A high quality sequencing library is critical to obtaining good data, and a variety of approaches for different types of library preparations continues to develop. The basic approach is the same--obtain double stranded DNA and ligate adapters onto the ends to allow flow cell seeding and subsequent cluster generation.

DNA. We offer DNA library preparation services, with performance specifications depending on the source material.  Genomic DNA, double strand cDNA libraries, BACs, or other material available in microgram quantities will generate quality libraries nearly every time.  Libraries can also be constructed from much less input material, e.g., chromatin-immunoprecipitated product or hybridization-selected DNAs, but in these technically challenging situations outcomes are not guaranteed.The following guidelines should be observed for submission of library-worthy material:  for chromosomal, BAC, or related DNA libraries, provide 5 ug of high quality DNA (concentration > 100 ng/ul, OD 260/280 close to 1.8) in a neutral solution such as EB, TE ([EDTA] =0.1 mM) or water.  

RNA. We also offer mRNA seq library preparation.  In this protocol, polyA is purified from total RNA, then fragmented, converted to ds cDNA, and ligated to the Illumina adapters.  Starting material is ideally 10 ug of total RNA (at least 200 ng/ul), analyzed by gel or bioanalyzer to ensure integrity.  We do not offer small RNA library construction services, but we do run these libraries on our sequencers if primer is provided.

Users interested in making their own libraries to expedite their studies and save money should download and read the current protocols for different kits offered by Illumina (see below). Ordering information for kits can be provided on request. DNA library production utilizes straightforward molecular biological techniques, and many groups have successfully produced libraries from protocols using assembled or kitted enzymes from other companies.  An example of one such protocol for ChIP-seq DNA library construction is also provided below.  We can provide (i.e., sell you) Illumina produced, purified, and certified oligonucleotides for construction of both single- and paired-end  (PE) read libraries.  Oligonucleotides from commercial vendors have also been used succesfully for sequencing library preparation.  Since the PE libraries work on both the single read and PE read flow cells, many users are opting to go with PE primers for library construction for maximum flexibility.

Making libraries from the Illumina mRNA seq kit is also a straightforward process.  Other specialized RNA library kits are available from Illumina, check out their web site for details.  Normalized mRNA libraries, those made following removal of highly abundant messages, is offered as a contract service, please inquire if this is of interest.

Library quality control is essential and merits its own section below.

Illumina Kit Protocols

The latest versions of Illumina validated protocols for all their kits are now at your fingertips:

Go to http://www.illumina.com/ftp.ilmn
Log on with
      Username: guest
      Password: illumina
Select the folder(s) of interest, in this case probably "Genome Analyzers."

Homemade Protocols

Genomic DNA library:  This protocol is based on the Illumina kit protocol but uses the New England Biolabs reagents  (see below).  This is the protocol used in the core sequencing library workshops

ChIP-seq DNA library:
Protocol describing ChIP-seq DNA library construction using commercially available enzymes coupled with the Illumina oligonucleotides from the library kit.  Protocol provided by Dr. Ghia Euskirchen.  We can provide the purified, certified Illumina oligonucleotides for this protocol, you just need to assemble the enzymes and other required materials.  Both standard genomic and paired end primers will be available.

Annotated ChIp-seq DNA library:  Variation of the above protocol, with embedded comments regarding alternative procedures for genomic DNA and updated enzyme information.

Normalized RNA library: Not a detailed protocol, but an overview of a successful workflow for generating a normalized library.

We have not extensively validated these protocols. They do work for people but, as with any protocol, they're not necessarily perfected or exactly adaptable to your situation.

New England Biolabs sells a kit containing all the enzymes required for library construction.  We have beta-tested these kits and they work well.  They are economical and conveniently packaged.  For more information on these and other NEB library-related products, please visit this site.

Other Resources

ChIP-Seq Data Technical Note and ChIP-Seq DataSheet. Two informational documents from Illumina describing the uses and applications of sequencing chromatin immunoprecipitation products.  Good overviews with useful references.

Other Libraries

Illumina sells kits for mate-pair and indexing library construction (use the link above to get the protocols).  Neither library type is generated as a core service yet, but we can provide ordering information and also library construction guidance.

Mate pair kits are for creating paired end libraries spanning distances greater than the approximately 600 bp maximum for cluster formation.  DNA fragments of defined long length (800 bp to 10 kb or so) are generated then circularized.  The DNA is again fragmented and the junction region selectively cloned.  Sequencing either end of the junction region effectively generates mate pair sequence separated by the size of the orignally created fragments.  

Indexing libraries allow for the sequencing of multiple libraries in a single lane, i.e., multiplexing. This strategy is an economical option for analysis of smaller genomes or other situations in which the typical lane output is greater than required.  Short nucleotide "bar codes" are appended to each library which are then pooled and sequenced.  Deconvoluting the bar codes informatically allows multiple libraries to be sequenced in a single lane at a potential cost and time saving.  To date, two methods have been exploited for this:  using the illumina indexing kit or synthesizing your own adapter oligos with your own bar codes.  Illumina indexing technology has several operational (not scientific) drawbacks we can discuss with you.  Homemade indexing can be utilized successfully but there are parameters related to the selection of the bar code sequences that MUST be discussed with us for proper downstream analysis.  For a sample of bar codes that have been used and a protocol for adapter preparation with these barcodes, download this document.  These particular sequences have the advantage of enough inherent redundancy so that if one or two nucleotides is lost, the bar code can still be unambiguously assigned.

Hybridization Selection/ Sequence Capture libraries
are those in which particular genomic regions are pre-enriched prior to indexed library generation and sequencing.  This strategy allows focused, very deep sequencing and can be implemented for a number of applications. A number of interesting articles have recently appeared exploiting this technique, check out the Illumina publications site for details.  A number of companies offer services or platforms that can generate such material, including NimbleGen, Agilent, RainDance,Febit, and Fluidigm.  A company called Expression Analysis (no relation to our core) has a service for sequence capture using the Agilent and Rain Dance platforms (depending on the particular experiment one is preferred). We have no direct experience with any of the platforms so consider this informational and not an endorsement or recommendation.

More on Fragmentation


DNA to be made into a sequencing library must first be converted into small fragments.  There are several methods for doing this, each with attendant pros and cons.  The original protocols utilized nebulization for fragmentation, and anyone desiring to pursue this technique can contact us for extra nebulization units because we abandoned this method very early.  A related methodolgy, Hydroshear, can be accessed through the CAES Genomics facility.  We primarily use a Diagenode Bioruptor because we are familiar with its operation through using it in chromatin immunoprecipitation experiments.  Access to this instrument is available through the core, with the usual training and signup guidelines for core-available equipment in effect.  Many protocols and centers rely on and recommend a fragmentation device from Covaris, which uses adaptive focused acoustics to break the DNA down into appropriately sized fragments.  The core does not have imminent plans to acquire this device since in our opinion it does not offer a substantial advantage over the bioruptor. A recent development is the availability of an enzymatic fragmentation product from New England Biolabs called (cleverly?) Fragmentase. Two advantages of this product are that it lends itself very well to high throughput construction of libraries, and that controlled digestion can be used to achieve gentle fragmentation of genomic DNA in order to recover very high MW fragments (3-10 kb, e.g.).  Large MW fragments are important in the construction of mate pair libraries, and these can be difficult to produce using other methods.  Our reliable protocol-optimizer Marta Matvienko has carried out some experiments in this regard and is willing to share her results, please download here.

Library Quality Control

bioanalyzer traceLibrary quality is the single most important determinant of the success of your sequencing run, both in terms of the number of reads generated (quantitation) and the validity of the sequence obtained (content).   Optimizing this is hot topic in the community; a Dec '08 paper, Quail et al., from the Sanger Inst. lists a number of improvements over the standard Illumina protocols in library preparation and analysis.  If you construct your own libraries you should download this paper and a supplementary methods table for the many practical issues covered.  While the newer pipeline software allows analysis of ever-increasing cluster densities, optimal data generation depends on hitting the cluster values of 180K-220K clusters/tile.

Library quantitation can be surprisingly difficult to assess: nanodrop readings, picogreen, Agilent bioanalyzer, and qPCR are all used with varying degrees of accuracy.  Lately we have settled on the bioanalyzer.   This instrument has the advantage of providing a ng/ul value, an accurate molecular weight, and the calculated molarity for each peak, while displaying the presence of other unwanted libary components like adapter and primer dimers.  Use of a new high sensitivity chip from Agilent allows visualization and quantitiation of very low concentration, on the order of 1 ng/ul or less.  For just about all users we recommend bioanalyzer analysis before sequencing.  We can run indiviudal samples, and bioanalyzer access is also available through the core in different formats, see here for more details.  The disadvantage of using the instrument yourself is the DNA chip requires at least 11 samples; it is typically more convenient and cost effective to allow us to run your library with other samples.   Quantitation based on the bioanalyzer does not always give the perfect number of clusters but it has gotten us closer than other methods.  It actually does help to get a nanodrop reading first as a rough idea, because that will indicate whether a regular or high sensitivity chip should be used for subsequent more precise analysis.

A qPCR-based library QC assay similar to the one described in the Sanger Inst. paper may be more useful for very low concentration libraries or those consistently showing a dispersed molecular weight spread by bioanalyzer (the appearance of such material does not necessarily indicate a "bad" library, some controversy exists to the exact nature of this material, ask us to BS about it if interested).  Our protocol utilizes SYBR green fluorescence detection during amplification with the library PCR sequences.  The amplification efficiency of an uncharacterized library is simultaneously compared with the amplification efficiency of a previously sequenced library. The qPCR assay can also be carried out by individuals with access to a real time PCR machine, you just need a library of known behavior as a standard.  The idea is straightforward--generate a standard curve plotting Ct vs. library amount using a known library, then compare a couple of dilutions of your unknown against this standard. You can download our protocol for qPCR quantitation if you want to give it a shot.

The perfect number of clusters is never guaranteed on the first run, we can only claim to get closer to optimal than if dilutions are based on a nanodrop reading alone. Additional runs of the same library are more accurately diluted, since info from the first lane can be used to adjust the concentrations.   Both the qPCR and bioanalyzer assays are particularly useful in identifying libraries that should not be run at all.  As a result, we recommend one of these two forms of qc on all libraries. 

Library content is also difficult to assess.  Care taken during preparation procedures, as outlined in the Sanger Inst. paper and the Illumina protocols, is essential.  Gel or bioanalyzer determination of library size is important: inappropriate-sized fragments may lead to problems in cluster formation or sequence contamination, and can even indicate complete absence of intended inserts.   Some users have developed their own qPCR tests on libraries to make sure, e.g., that a particular target promoter is present and suitably enriched in a ChIP library before sequencing.

The cores take no official position on the Illumina-recommended validation steps of cloning and (Sanger) sequencing a representative sampling of library molecules. 

Cluster Generation and Sequencing by Synthesis

cluster station Flow cell preparation on the cluster station is carried out by core facility personnel.  Each flow cell has eight lanes, corresponding to eight different libraries.  One lane is always reserved for a control (see below, "Did it work?").   We recommend trying out a new library on one lane only.  This will establish the amount required for optimal cluster generation in subsequent runs, and should give you a good feel for the quality of the library content.  A library typically will provide enough material for many, many lanes so it's not a problem for us to archive and keep running that sample until enough data is produced to satisfy your requirements. 

Each library has a particular sequencing primer used in conjunction with that library type.  In the core we maintain stocks of the genomic DNA library sequencing primer so this primer does not need to be provided with your samples.  However, for other types of libraries, e.g., the small RNA or RNA tag libraries, specific sequencing primers are required.  These can be hard to acquire since Illumina does not sell them directly.  However, the sequences are available so HPLC purified primers that will function successfully in the sequencing reaction can be purchased from the usual sources.  If multiple primer types need to be used in a single flow cell additional charges will be applied due to the extra materials and time involved.

Does Length Matter?

Of course it does!  When we got our first GA in August 2007, accurate read lengths barely exceeded 25 bases.  This was long enough to do many types of experiments but clearly limiting for others.  With improvements in chemistry, hardware, and software, read lengths up to 100-120 bp are achievable (note: at these read lengths familiarity with quality filtering is essential since error rate are higher).  This puts the output from our instruments, especially in conjunction with paired end reads, into a range suitable for de novo assembly of complex genomes.  The read length you require will naturally depend on your experimental needs, and discussions with the Bioinformatics Core will be essential to focus your decision.  Because of the new chemistries the read lengths on our price list are no longer accurate, and due to certain restrictions the newly available prices can't be displayed (but can be downloaded).  However, the new options are  20, 40, 60, 80, 100, and 120 bp in both single read and paired end read format. We are trying to develop general guidelines for length and "paired-endedness" depending on the application but this is a moving target.  Here are a few considerations:

A 40 bp single read will provide an unambiguous match to any genome. This read length is therefore suitable for most ChIP-seq and mRNA-seq applications where a reference genome is available.
It can also be suitable for certain SNP discovery applications, where a reference genome is available.  It is the read length used for small RNA analyses.  It can be suitable for some bacterial genome analyses, reduced representation or hybrid-selected libraries, depending on the application.

Longer single reads are better for SNP discovery,  initial assemblies of complex genomes or transcriptomes, and complete assemblies of simpler, smaller genomes. There are no established guidelines yet on the differences between 80, 100, and 120 for assemblies.  The longer read lengths have many more errors, but the ability to have some long reads to help collapse contigs may offset this disadvantage and in the long run be more economical.

Paired end reads substantially facilitate assemblies of all genome sizes.  A combination of single and paired end reads is a good approach for many assembly projects.  Paired end reads can also help resolve differences among repeat regions and as a result can be use in transcriptome projects to distinguish family members as well as identify alternative splicing.

Note that these are general guidelines, and examination of the literature coupled with discussions with Bioinformatics personnel will provide information on the particular options most suitable for your application.   A short tech note from Illumina on assemblies using their instrument output might be a good place to begin.

Did it Work?

The question of what defines a "good" sequence is of some interest.  Every sequencing run we do contains a control lane of phiX174 DNA. All users have access to data generated in the control lane.  Based on the behavior of this commercially available library DNA, we determine whether the run meets our quality specifications.  This is evidenced by certain metrics: percent of the clusters that pass quality filtering, number of sequences that align to the phiX reference genome, and the overall percent error rate for those aligning phiX sequences.  The summary report of the run (in a file called Summary.htm) is available on SLIMS so you can see these metrics on the control for yourself.  Depending on your library and the quality of the reference genome used in your alignment, you may not have access to these same metrics so the behavior of the control lane can be informative.  Even if you don't do an alignment or have a good reference genome there are other indicators of quality .  At the gross level, there is base composition which tells you, e.g., whether the genome you sequenced matches the genome of your organism, or whether there's a preponderance of linker/adapter sequences in your output.  The Summary.htm file reports the percent of your clusters that passed filtering, which is indicative of the overall quality of the run (but unfiltered reads still contain good data.  We recommend NOT restricting your analyses to filtered reads).  At a deeper level, quality scores for each base in each read are contained within some of the files themselves (see the pipeline manual for more info), although parsing the data at this level requires more bioinformatic knowledge.  

Deliverables and Bioinformatics:  The GA IIX and SLIMS

Each lane on a flow cell generates huge amounts of raw image data for each synthesis cycle.  These images are processed into base calls and alignment files through the Solexa Pipeline, which is immediately initiated following run completion.  Depending on the accuracy of library quantitation, we can achieve up to 24 million good reads/lane.  The total ouput will depend on the number of cycles for that run, and whether it's a paired end read.  For example, a 40 cycle single read run with 24 million good reads would give 960 Mb bases/lane, and a PE 40 cycle run nearly 2 Gb. 

To familiarize yourself with the pipeline, you should download the manual.  The new version of the pipeline includes increased tertiary analysis options such as basic assemblies and allele calling.  We don't intend to offer this through our cores, but if it's of interest you may want to read up on it in the CASAVA section of the manual. We do have a set of default analysis parameters we use for the pipeline, but custom parameters can often be invoked for no extra charge. A particularly important part of the manual is a description of the output files generated, so you will want to focus on that. You can also download this condensed "cheat sheet" that summarizes information on the most relevant output in the GERALD and Bustard directories. These files are the raw material for your subsequent analyses, so it's important to understand what's in them. 

Following analysis of each run, users have access to parsed output through the Solexa LIMS (SLIMS) created in conjunction with the Bioinformatics Core (to get some idea of the look of this interface, login here and use "craigventer@test.com" as the email and "slimsdemo" as the password).  Subsequent download of all the data, including images and all the sequence files, can be arranged through us and the Bioinformatics Core.  Users should be prepared to collect and store their own sequence and, if they wish, raw tiff image data, soon after running the sample.  Images will be available for one month following data availability at which time they will be AUTOMATICALLY DELETED.  If you wish to store images longer you must indicate it on your sample drop off form (see below in the Sample Submission section).  Processed data will be stored free of charge for three months.  At this time, data will begin accruing storage charges unless you indicate, through the SLIMS user interface, that it can be deleted.  Processed data will never be automatically deleted.  VERY IMPORTANT:  please download and carefully read the Managing your Storage document for the full story.

A SLIMS account will be created for you on your first run, with information about how to set up and access your account distributed via email. Typically you will get this email before your actual files are available, you just need to be a little patient. The main SLIMS page can be reached here.

For the UNIX-savvy user, it is possible to use rsync to acquire all your non-image data directly. Please go to http://bioinformatics.ucdavis.edu/index.php/Archiving_solexa_data to read how.

The Bioinformatics Core is continually gaining expertise manipulating and analyzing these large sequence data sets, so for any assistance downstream of the initial basecalling these are the people to talk to.

Other Analysis Options

For the do-it-yourselfer, many shareware options exist to carry out analyses on Illumina sequence data. This section is not intended to provide a comprehensive list of sites offering such software, but instead presents some of the tools developed and used by researchers here at the Genome Center (although there is a Cold Spring harbor site that has been highly recommended, check out http://hannonlab.cshl.edu/fastx_toolkit/).  The links provide varied amounts of documentation, and have been only minimally explored by core personnel.  In other words, you're pretty much on your own!

Software developed in the Farnham lab and placed online by the Ludaescher group allows direct upload of ELAND files into a program called Sole Search for ChIP-seq analysis.  The site can be accessed here:

http://chipseq.genomecenter.ucdavis.edu/cgi-bin/chipseq.cgi

Output from this program (and others) can be used to carry out motif analysis, using web accessible tools from the Jin lab at Ohio State (Victor Jin is a Farnham lab alum).  Access this functionality at:

http://motif.bmi.ohio-state.edu/ChIPMotifs/

The Michelmore lab has also developed tools for processing Illumina data.  Two sites to visit are:

http://code.google.com/p/atgc-illumina/


http://code.google.com/p/atgc-illumina/source/browse/#svn/trunk

Documentation is a little sparse on these last two sites, however, there is a power point
presentation that explains basic idea of their usage and more:

http://atgc-illumina.googlecode.com/files/ILLUPA_Overview_AKozik_090910_D.pdf

Have fun!

Prices

Prices for sequencing depend on the number of cycles requested (i.e., how long a sequence you want) and whether the runs are paired end (PE). Our official posted charges are here.  However, additional options and services are continually becoming available, so for the latest prices and offered services please download this document.  The listed fees include all the labor and reagents for cluster generation, cycle sequencing, and initial base calling analysis using the standard Solexa Pipeline.  Failed lanes on an otherwise good run (as evidenced by the behavior of the control lane) will be charged full price.  Even from sub-par runs, some information such as library integrity can be obtained; of course, if re-runs are necessary they will be carried out as soon as possible and not moved to the end of the queue.

While initial base calling is currently part of the service fee, the Bioinformatics Core has developed a menu of services relating to the access, manipulation and analysis of sequence data from these instruments.  Please contact them for help in analyzing your data.


Sample Submission and Scheduling

All sample drop offs must be accompanied by our submission form, please obtain a pdf or doc version.  The same form is to be used for submitting material to be made into libraries and as well as for submitting libraries ready for sequencing.  Please contact us if you have any questions about the required information, but when you drop off your form we will go over it with you. It is essential that you fill out all the information! 

Once your library is made and quantified, you should bring it over as soon as possible to get the next available slot.  We don't schedule reservations for the instrument, but set up runs as we fill up the seven lanes on a flow cell. Turnaround time for any sample is difficult to predict very far in advance since it depends on the requested option and what happens to be in the queue.  We can generally give some idea when your sample is delivered.   Two things are certain:  (a) the sooner you drop off a library, the sooner you will get data back, and (b) we will stay in contact and let you know the status of your project.  However, we do encourage scheduling discussions at the outset of your project since, depending on the requirements of your run, we may need to order particular reagents.

We have a number of available genomes and EST databases we can use for alignment, including human, mouse, zebrafish, Arabidopsis, Drosophila strains.  We can also take your genome of interest, or any collection of sequences you would like to align your sequences against, and generate the appropriate
files required by the pipeline for no extra charge. Please provide us with the sequences in fasta format when dropping off your sample.  Re-alignments done after the initial pipeline run will be subject to extra charges.



In Closing . . .

The ability to utilize the Genome Analyzer as an extension of individual research programs will lead to unprecedented amounts of data and enable novel areas of investigation. However, because of the cost and complexity of the process, careful thought should be given to experimental design. We would be happy to discuss this with you.

Please watch this site for continued updates!