skip to primary navigationskip to content


Understanding your @10Xgenomics cell ranger reports

By (James@cancer) from The Genomics Core blog. Published on Feb 02, 2017.

The Cell Ranger analysis provided by 10X is an excellent start to understanding what might be going on in the single cells you just sequenced. It allows some basic QC and this can help determine how well your experiment is working. There is a high degree of variability in the number of cells captured and capture efficiency, but right now we cannot easily see if this is down to the sample (most likely) or the technology.

Some of the metrics are easy to interpret e.g. the ‘Estimated Number of Cells’ (how many single cells were captured) – the more the merrier! Others need to be compared across runs to determine what the “correct” parameters for an experiment might be e.g. the current 10X recommendation for ‘Mean Reads per Cell’ is 50,000, but you may find that more, or fewer, reads are required for your samples. You can use the other metrics such as ‘Median Genes per Cell’ or ‘Sequencing Saturation’ to help determine when more or less sequencing depth are required.

The most important metrics: 10X help by making the most important stuff big. You should already have an idea of the number of cells you expected to capture (because you carefully counted your cells before starting didn't you), hopeful the  ‘Estimated Number of Cells' matches what you were aiming for. Ideally this would be the same across your project, but is likely to be quite variable if the cell types are very different.

The ‘Mean Reads per Cell’ and ‘Sequencing Saturation’ both tell you whether you've over-sequenced. Our recommendation is to run a single lane on hiSeq 400 first and to use these numbers to determine if more sequencing is worth it or not. Diving in for a lane per sample might turn out to be expensive mistake (as it was in the example above).

The ‘Median Genes per Cell’ equals is likely to become a key metric for users. We've become used detecting 10,000-15,000 genes in microarray and RNA-Seq experiments on bulk tissue. What the figure is for single-cell remains to be seen. However it is likely to be quite cell specific, and is also likely to increase as methods capture more of the transcripts.

The ‘Sequencing’ table metrics explained:
  • ‘Number of Reads’ equals the total number of single-end reads that were sequenced.
  • ‘Valid Barcodes’ equals the fraction of reads with barcodes that match the whitelist.
  • ‘Reads Mapped Confidently to Transcriptome’ equals the fraction of reads that mapped to a unique gene in the transcriptome with a high mapping quality score as reported by the aligner.
  • ‘Reads Mapped Confidently to Exonic/Intronic/Intergenic Regions’ equals the Fraction of reads that mapped to the exonic/intronic/intergenic regions of the genome with a high mapping quality score as reported by the aligner.
  • ‘Sequencing Saturation’ equals the fraction of reads originating from an already-observed UMI. This is a function of library complexity and sequencing depth. More specifically, this is the fraction of confidently mapped, valid cell-barcode, valid UMI reads that had a non-unique (cell-barcode, UMI, gene). This metric was called "cDNA PCR Duplication" in versions of Cell Ranger prior to 1.2.
  • ‘Q30 Bases in Barcode/Sample Index/UMI Read equals the fraction of bases with Q-score at least 30 in the cell barcode/sample index/Unique molecular identifier sequences.
  • ‘Q30 Bases in RNA Read’ equals the fraction of bases with Q-score at least 30 in the RNA read sequences. This is Illumina R1 for the Single Cell 3' v1 chemistry and Illumina R2 for the Single Cell 3' v2 chemistry.
  • ‘Estimated Number of Cells' equals the The total number of barcodes associated with cell-containing partitions, estimated from the barcode count distribution.
  • ‘Fraction Reads in Cells' equals the The fraction of barcoded, confidently mapped reads with cell-associated barcodes.
  • ‘Mean Reads per Cell’ equals the total number of sequenced reads divided by the number of barcodes associated with cell-containing partitions.
  • ‘Median Genes per Cell’ equals the median number of genes detected per cell-associated barcode. Detection is defined as the presence of at least 1 UMI count.
  • ‘Total Genes Detected’ equals the number of genes with at least one count in any cell.
  • ‘Median UMI Counts per Cell’ equals the median number of UMI counts per cell-associated barcode.

It will help to look at these numbers over time and across projects. Right now the data about the sample is limited, but collecting more sample/experiment metadata is likely help determine whether an experiment has worked or not. Right now it is difficult for us to give advice as your experiment may be the first time we've eve3r run that type of cell!

How do I submit my index information into Lablink?

By (Hannah Haydon) from The Genomics Core blog. Published on Oct 24, 2016.

When accepting sequencing submissions in the Genomics Core, there may be instances where we have to contact you if there is an error with your submission form. The most common problems relate to index information. We have put together some instructions here that we hope should make things easier and help us to get started on your sequencing as soon as we can.

Please only follow these instructions if:
·        The index sequences you have used are visible in the index sequences tab of the submission form
·        There are fewer than 384 samples within your pool.
If the points above are not true, please see section, unspecified index further on in this blog.

1.      Completing the sample/reagent label field

1a. Navigate to the index sequences tab of the sample submission form.

1b. Search for your index sequences

1c. Copy the index name from column C of the index sequences tab, e.g A001-A005 to the column Sample/Reagent Label, of the submission form tab.

Figure 1-index sequences tab of the sample submission form

Figure2-submission form

      2. Completing the UDF/Index type field

2a. Select the correct UDF/Index type from the drop down menu on the submission form tab.
IMPORTANT –please make sure the Index type field matches column B of the index sequences tab. This ensures that your library goes through our acceptance step. Please see the two following examples. 

Example1- I am submitting a Truseq LT library consisting of 5 samples and used indexes A001-A005.
The sample/reagent label on the submission form should read A001-A005.
The UDF/index type should read Truseq LT.

 Figure 3 - The index type field next to these indexes is Truseq LT so this is what should be entered into the UDF/Index type field.

Figure 4- Submission form

In most cases, the Index type will match to the indexes you have used as expected. However there are now multiple kits available which share the same indexes.
Because of this, there may be some cases where the index sequences you select will have a different Index type to the library you have made. (see example 2 below) This may affect you if you are submitting for Nextera XT or Nextera.

Example 2- I used indexes N701-N501, N702-N501, N703-N501, N704-N501. I prepared the libraries using a Nextera XT library prep kit.

The sample/reagent label on the submission form should read N701-N501, N702-N501, N703-N501, N704-N501. The UDF/index type should read Nextera and not Nextera XT.This is because the UDF/Index type needs to match column B on the index sequences tab.

Figure 5 - The index type field next to these indexes is Nextera so this is what should be entered into the UDF/Index type field

3. Unspecified Index

If your pool has index sequences not present in the index sequences tab OR if you have a pool which is made up of more than 384 samples, you will need to submit as unspecified index.
In the submission form:
·        Sample/reagent label should read – unspecified
·        UDF/Index type should read –Unspecified (other)
You should submit your pool as one row on the form. Libraries submitted as unspecified index cannot be demultiplexed by the Genomics Core but we do have a demultiplexing guide on lablink which should give some useful information.

Important-since we have no index sequence information, please write in the comments section of the form the index lengths for Index 1 and Index 2. Without this information your sequencing may be delayed whilst we contact you to check these parameters.
Once you have submitted your libraries, the Genomics Core would like to start working on your sequencing as soon as we can.

If the incorrect index type has been selected, we will need to delete your submission and we would ask you to submit again after making changes to your sample sheet and following the instructions above. Of course whilst this guide should be used to help you, we are always here to discuss this with you in person if you have any questions. Alternatively you can contact us on our helpdesk:

Recent papers that the Genomics Core has helped with

By (James@cancer) from The Genomics Core blog. Published on Oct 10, 2016.

I like to highlight some of the really interesting work we've been involved with, or that has come out of the Institute from time to time, and I recently updated our lab home page with links to a couple of papers.  i thought I'd take the opportunity to write about them in a bit more detail here. Many of you will already know I run the Genomics Core facility at CRUKs Cambridge Institute. We do a lot of Illumina sequencing! The lab works on a huge number of projects for the research groups here in the Institute, and also across many groups in Cambridge via a long-running sequencing collaboration. We do do some R&D work in my lab, but >90% of our efforts are working with, or for, other research groups.
Highlights from the last years genomics research include work from the Caldas group who have completed three project over the lat year I've included here; 1) profiling of almost 2500 Breast Cancer patients for mutational analysis of 173 genes using a targeted pull-down (Pereira et al Nature Communications 2016); 2) cancer exomes from Murtaza et al,; 3) PDXs from Bruna et al.; and the Balasubramanian group who have shown that it is possible to capture and sequence double-strand DNA breaks (DSBs) in situ and directly map these at single-nucleotide resolution, enabling the study of DSB origin (Lensing et al. Nature Methods 2016). The rapid speed and unbiased nature of the genome-wide experiments being performed in the Institute, and often prepped and sequenced in the Genomics core continue to increase our understanding cancer biology.

1) Mutational analysis of 173 genes in 2433 tumors: Bernard Pereira and Suet-Feung Chin, in Carlos Caldas research group, published a massive Breast cancer gene resequencing project, which is helping to improve our understanding of patient classification into clinically relevant subtypes. They showed that this using both mutation and copy-number analysis provides the best currently possible stratification. The project analysed almost 2500 tumours from the METABRIC study (see my previous blog) sequencing the 173 most frequently mutated breast cancer genes. They found 40 mutated genes that are instrumental in breast cancer progression. There was high variation in teh mutational frequency of some of these genes, e.g. TP53 was mutated in 85% of IntClust10, and around half of IntClust4/5/6 and 9, but less than 15% of IntClust3/7/8, which are good prognosis tumours.

Analysis of the clonal distribution of mutations (accounting for CNVs) showed that most drivers were present in nearly all tumour cells and probably occurred early in the evolution of the tumour. There was a lower number of apparent tumour clones in samples from patients with better prognosis than in patients with poorer outcomes. Inactivating mutations in SMAD4 were associated with worse outcomes across the IntClusts, but TP53 mutations were more strongly associated with worse outcome in ER+ disease. And TP53 DNA-binding domain mutations were associated with the worst outcomes. PIK3CA mutations were prognostic in ER-, but not in ER+ patients.

The Prevalence of mutations of mutations across histological subtypes 

The paper reported the finding of 10 new breast cancer driver genes that were previously known drivers in other cancer types. Hopefully this wil allow the relatively quick migration of treatment from one setting to another. Cancer Research UK's chief clinician Professor Peter Johnson was quoted  as saying "This study gives us more vital information about how breast cancer develops and why some types are more difficult to treat than others, and this information is a great resource for researchers all over the world" - the release of this data via cBioPortal is likely to improve breast cancer research. Particularly as the METABRIC study is a large sample size project, and also has long-term clinical follow-up.

What did Genomics do: This project was a collaboration between the Caldas group, Sam Apraicio's group in Vancouver and Illumina. The sequencing for this project was done by Illumina but Michelle Pugh from my group helped prep the rapid capture libraries. Michelle is now working at Iniviata.

2) Breast cancer PDX encyclopaedia: Alejandra Bruna and Oscar Rueda in the Caldas group have done a massive amount of work in creating one of the first, and largest series of Breast cancer patient derived xenografts (PDX). These PDX models allow far more than a single molecular analysis to be performed from a patient sample. Tumour tissue becomes possibly limitless, and follow up studies and even "clinical trials" can be carried out in a level of detail, and at a rate, that is very tough to do in a Human setting.

Rather than rely on cell lines, which have very limited inter- and intra-tumor heterogeneity and are adapted to growth on plastic - as such it is not difficult to see their shortcomings in the development of new treatments. But generating PDX models is hard and people have usually focused on making only one or two, or a handful up till now. The Caldas group wanted to build something larger, and create a resource that reflected the full molecular pathology they had revealed in the METABRIC study. So far 83 PDX models have been created. All have been shown to be re-established after freezing and so are a long term resource. Both primary and metastatic models have been created; 60% are from ER+ patients. The PDXs were subjected to extensive molecular characterisation: we developed the use of shallow whole-genome sequencing of pre-capture exome libraries for CNV analysis with the Caldas group (see this post from 2014), Exome sequencing, reduced-representation bisulfite sequencing (‘RRBS’) for DNA methylation, and gene expression arrays (the project stuck with arrays for the best possible correlation to METABRIC).

Importantly the project was able to show that PDX's retained the same histological and molecular pathology through passaging, and that the intra-tumor heterogeneity and clonal architecture were maintained.

Perhaps most exciting was the demonstration that PDX's could be used for high-throughput drug screening, to test drug combinations, and could predict in vivo drug response. CRUK's Science Blog covered this paper and the discussed how this work is likely to be seen as a better way to discover and develop new cancer drugs.

The Bioinformatics core helped to create the data portal for this project. The Breast Cancer PDTX Encyclopaedia is an open resource that allows users to browse the data. The publication describes the project methods i detail, which will hopefully encourage others to create additional PDX models for breast and or other cancers.

What did Genomics do: The core helped with much of the molecular characterisation. We prepped and sequenced all the sWGS and exome libraries, we sequenced the RRBS libraries (and did some useful work tweaking the amount of PhiX needed for these libraries), Illumina HT12 arrays were processed at the Department of Pathology. The Cambridge Institute Bioinformatics, Histopathology, Flow Cytometry, Biological Resource, and Bio-repository core facilities all helped with this project.

3) Mapping double-strand breaks: Stefanie Lensing in Shankar Balasubramanian's group (co-founder of Solexa) developed an improved method for mapping double-strand breaks. DSBs are one of the major causes of mutations and rearrangements and several groups have previously characterised them. However the methods used have not been ideal; ChIP-seq captures DSB proteins rather than the breaks themselves, and BLESS involves a relatively inefficient blunt-end ligation, lots of PCR, and produce low-diversity libraries (see this paper and my post). Stefanie developed DSB-seq to capture DBSs in situ using a modified Illumina P5 adapter such that after ligation single-end sequencing could be used to map breaks with nucleotide resolution. She directly compared BLESS with DSB-seq and shows that the new method identified 4.5-fold more DSBs normal human epidermal keratinocytes.

The paper also shows that G-quadruplex DNA secondary structures, which have previously been implicated as fragile sites in the genome, were 3-fold enriched over random within DSBCapture peaks. There was very large enrichment of DSBCapture peaks in regulatory, nucleosome-depleted regions, and many DSB sites were also sites for RNApolII revealing a relationship between DSBs and elevated transcription within nucleosome-depleted chromatin.

You can get a detailed protocol on Natures Protocol Exchange.

What did Genomics do: The core performed the sequencing for this project.

It is looking like it will be possible to combine long-, or synthtic-read, phasing, methods with exome targeting to sequence DNA repair genes in patients. By using PacBio and/or 10X Genomics it would be possible to definitively test patients for mutations in cis- or in trans- adding clinically relevant information currently not available through short-read genomes and exomes. DSB-seq and phasing of DNA repair gene mutations are likely to be useful methods that could be translated in the next few years.

4) Understanding metastatic disease with ctDNA sequencing: Muhammed Murtaza (Tgen and Rosenfeld group) and Sarah-Jane Dawson (Peter Mac and Caldas group) published a detailed analysis of a single breast cancer patient using tumour- and liquid -biopsy analysis. They collected 8 tumour biopsies and 9 plasma samples collected over 1,193 days of clinical follow-up for an ER+/HER2+ breast cancer patient. They exquisitely characterised the tumour evolution during therapy using exome and targeted amplicon sequencing (TAm-seq). This work is the first to really demonstrate that liquid-biopsy truly recapitulates the tumour burden and metastatic heterogeneity - and is an important step along the road to using liquid-biopsy in the clinic.

The paper reports the finding of over 350 candidate non-synonymous SNVs from the exome sequencing data, and just over 300 were successfully sequenced by TAm-seq to an average coverage of up to 8000x! The amount of data generated from a single patient really allowed the group to delve deep into the evolution of the tumour and the relationships between the primary and metastatic sites (see Fig1 above). The authors do point out that this kind of study needs to be replicated to show how well liquid biopsy can be used to track other breast cancer patients, and how it performs in other cancers, with potentially different evolutionary tracks and varying metastatic sites.

Importantly the results show again that if actionable mutations are identified in circulating DNA then this may inform the choice of targeted therapies.

What did Genomics do: This project was not processed by the Genomics Core, however it leads on from work we were involved with. It was also such a great paper I wanted to highlight it here!

Why is my HiSeq 2500 sequencing taking longer than usual

By (James@cancer) from The Genomics Core blog. Published on Jul 15, 2016.

With the introduction of the HiSeq 4000 we're able to sequence faster and cheaper than ever before. But as we're transitioning the larger projects over to HiSeq 4000 a side-effect is fewer and fewer samples to run on HiSeq 2500; and as we're waiting for samples to fill the 8 lane flowcell that means longer wait times for you. We thought this post might help you determine if you still need to use HiSeq 2500, or if you can migrate over to HiSeq 4000. Most sequencing is taking under 2 weeks, but some people are now waiting up to one month for 2500 data.

We bought the new instruments in Genomics to do large RNA-seq gene expression and exome projects. The HiSeq 4000 has an increased maximum read-length (PE150 vs HiSeq 2500 PE125) and increased cluster density (312M clusters vs HiSeq 2500's 250M) so users can expect to see lower costs for sequencing. As a guide expect to run the following number of samples per application:

  • Genomes - 6 Human genomes (30x coverage) per flowcell in just 3 days
  • Exomes -  90 Nextera exomes (4Gb per exome) per flowcell in under 2 days.
  • RNA-seq - 125 mRNA-seq DGE (20M reads per sample) per flowcell in under 2 days.

How do instruments differ:  The HiSeq 4000 performs very well for RNA-seq and exomes. Data are highly comparable and certainly for new projects you should migrate to HiSeq 4000. If you are in the middle of a project it is probably worth a discussion to decide on the best time to switch, or how to mitigate the longer wait times on hiseq 4000.

The main differences between the machines are the clustering chemistry (either random clusters or patterned flowcells) and the sequencing chemistry (either the original 4-colour SBS, or the NextSeq only 2-colour version). The amount of data they each generate and the costs aso vary and so I've listed them below.

Costs are based on paired-end 150bp reads for equivalence
The NextSeq is the easiest system to run instead of 2500 as it is a single sample/pool per flowcell and generates about the same data as a HiSeq 4000 lane, however it is a different sequencing chemistry. Rapid runs cost the most but "should" generate data almost identical to the normal 8-lane flowcells.

Get in touch via the HelpDesk if you have any questions, or pop down for a chat.

PS: want to know  more about HiSeq 4000? Then read this post on my personal blog - (almost) everything you wanted to know about @illumina HiSeq 4000...and some stuff you didn't

Running a big RNA-seq project is easy(ish)

By (James@cancer) from The Genomics Core blog. Published on Jul 15, 2016.

Last year we completed our largest ever RNA-seq project: 528 samples of TruSeq mRNA, 60 lanes of HiSeq 2500 SE50, 13 billion reads - and all in 16 weeks. Being able to do such a large project in such a short time and get high quality data from nearly all samples really demonstrates the robustness of RNA-seq. If you're thinking that a project larger than 96 samples might be too much to consider, then come and talk to us (and Bioinformatics) at a Tuesday afternoon experimental design meeting - and we'll convince you it can be a pretty smooth process.

We've been using Illumina's TruSeq mRNA-seq automated on our Agilent Bravo robot and the sequencing was done on HiSeq 2500, although we're currently  moving to HiSeq 4000.
  • 528 samples processed on six-plates of RNA-seq
  • QC lanes sequenced and analysed
  • 60 lanes of SE50bp sequencing in total, 10 lanes per plate
  • 12,918,018,345 PF reads for this project (215M reads per lane on average)
  • 24M reads per sample on average
  • 16 weeks from start to finish
This has been a large and complex project where we had lots of discussions along the way. I think that everyone involved has contributed to the success so far: the research group who asked us to do the project, my lab, and also our Bioinformatics Core. The ability to discuss the experiment at different stages, and to focus on QC issues as they arise really makes using the Cores a great place to do your projects.

Our first paper on the bioRxiv

By (James@cancer) from The Genomics Core blog. Published on Feb 07, 2016.

I just uploaded our paper, which has also been submitted to BioTechniques, onto the bioRxiv preprint server. The work we present comes from an idea I had shortly after first using Agilent's BioAnalyser in 2000. I was blown away by this piece of technology that has become the de facto standard for RNA QC, and has also pretty much replaced gel electrophoresis for DNA fragment analysis in NGS applications. When launched in 1999, it was the only microfulidics instrument for biology applications. The idea was a simple one: can bioanalyser chips be swapped between assays?

Figure 1: qualitative analysis of the same sample across different chips

In Bioanalyzer chips can be used interchangeably for many analyses of DNA or RNA we show that for RNA and NGS library size they most certainly can. We evaluated the compatibility of two of the most commonly used BioAnalyser kits (RNA6000 and DNA-HS) with three BioAnalyser chip types (RNA6000, DNA1000 and DNA-HS). Importantly, the sticker displaying the chip layout was disregarded and the loading pattern indicated in the assay-specific protocol was used in each experiment. Concentration and RIN of each RNA sample were highly comparable within and between chips, and well within the normal variability expected of samples submitted for RNA-seq experiments. For NGS libraries the average size estimated across all Bioanalyzer chips was 290bp with a 40bp range across all samples, while DNA concentration showed a 10-20% variation in concentration across chips.
Although the quantitative analysis of DNA (NGS libraries) was not so great  we'd always recommend quantitative PCR. Our quantitative results appear to be due to higher variability for concentration data between chip types, although this was due to inter-chip variability and not due to chip type. I've never been a fan of using the BioAnalyser for quantitative analysis, except perhaps in RNA-seq or microarrays where the amount of RNA being used is often less critical. 

Following us on Twitter

By (James@cancer) from The Genomics Core blog. Published on Nov 20, 2015.

The Genomics Core now has two Twitter accounts, you can follow me @CIgenomics (James Hadfield, Head of Genomics) and hear about things I think are interesting, but which you might not necessarily be interested in; and/or you can follow our sequencing queue @CRUKgenomecore which puts out live Tweets directly from the sequencing LIMS.

How does the LIMS Tweet: Some clever work by Rich in Bioinformatics has allowed us to pull out data directly from Genologics Clarity LIMs queue using a script run every 24 hours, and the Twitter API then allows that script to post messages on our behalf. Because of this the Tweets about our queue should happen every day and without manual intervention. Hopefully you'll be able to rely on these to give you a reasonable idea of how long you might have to wait for your sequencing results. Of course we can't predict what will happen with your particular sample so please treat the Tweet as a guide.

Tweets explained: The Tweets have a format that we hope is pretty intuitive, but we've described what all the bits of information mean below...

Thanks especially to Rich Bowers in the Bioinformatics core for pulling all of this together from a vaguely described idea by me.

Improving DNA and RNA quant with plate based fluorimetry

By (James@cancer) from The Genomics Core blog. Published on Sep 04, 2015.

We quantify NGS libraries all the time and qPCR works brilliantly, but nucleic acids need to be handled differently. We don't actually run that much quantifiaction on DNA and RNA as most of our users have already done this; we asked them to do it so we could more efficiently run larger batches of library prep to keep costs down and turnaround times as short as possible. Over the last few years we've been running the Nextera exome preps and DNA quant has become more important than ever before, in fact we started running a secondary quant just to be certain about DNA concentration.

Most of the time DNA and RNA quant works well and we've favoured the fluorescent Qubit assay recommended by Illumina in their protocols. A nanodrop or plate reading spec at 260:280nM measures total nucleic acid and is confounded by ssDNA, RNA, and oligos so can give inaccurate results. We run the Qubit dsDNA BR Assay from Molecular Probes on the PHERAstar fluorescent plate reader (here's their handy protocol). We have only been using 1ul of DNA (Illumina suggest 2) for each sample but we run triplicate assays to get a high-quality quantitation.

Problems with the Qubit assay: Recently some users have reported problems with the accuracy of the QuBit assay on our plate reader and the manager of our Research Instrumentation Core helped us to get to the bottom of the issues and some excellent results. The main problem turned out to be addition of DNA into the working dye solution, it was the DNA coating the outside of the tips that appeared to be making the results so flaky. Changing the protocol to add DNA to the plate first fixed it and the results are looking great.

It ca also be very important to be certain which assay you should use; BR (Broad range) or HS (High Sensitivity). If you are working with low concentration nucleic acids then the HS assay is probably the one to use. For really accurate quant we'd suggest a quick QT check first, then normalisation of samples to about twice what you need; a second triplicate and robust quant will allow you to dilute the samples to the perfect working concentration.

Here are our top tips:
  • Add DNA to the measurement plate/tubes before anything else
  • Use a repeat pipette to make sure each well gets the same/right amount of dye solution
  • Shake the tubes/plate in the dark for at least 10 minutes (quant will be inaccurate if the dye has not intercalated properly, you can check your standard curve replicates to verify if this is an issue)
  • The triplicates really are worth the effort - especially if you're doing a Nextera prep

When will my sequencing be done?

By (Unknown) from The Genomics Core blog. Published on Aug 25, 2015.

Will my sequencing be done before the dying of the sun,
Will wildcats once more roam the land
Will the desert still have sand
Will Norfolk be swallowed by the sea
Do I have time for a cup of tea?
Will rhinos and the manatee
Be urban legend, just like me
Oh, it's done.

Nature reports on "careers in a core lab"

By (James@cancer) from The Genomics Core blog. Published on Mar 26, 2015.

In this weeks issue of Nature a feature by Julie Gould covers what life as a core lab manager is like: Core facilities: Shared support. She interviews several core lab managers/directors from the US and Europe including me. If you've ever fancied a job in a core then I'd recommend the article.

If you have any questions about the realities of running a core and what sort of career move it might be feel free to get i touch. If you are in the CRUK-CI then you've got lots of other core managers who can give you there views as well.

Is your antibody any good

By (James@cancer) from The Genomics Core blog. Published on Feb 06, 2015.

"Doesn't necessarily do what it says on the tin!" 

Probably not is the simple answer, and only if you've verified it is a more comprehensive one. The lack of reproducibility from antibody data in scientific publications is shocking, Nature published a commentary signed by over 100 researchers: Reproducibility: Standardise antibodies used in research, in which they describe the pretty poor state of antibody reproducibility. In this they cite a 2008 BioTechniques article, and a 2012 Nature commentary that discuss the state of affairs with antibodies in particular, and with reproducibility in general. In the BioTechniques paper the authors finish by saying that "for the meantime, however, the responsibility ultimately lies with the researcher or laboratory director to ensure that the antibodies used in their labs are validated for specificity and reproducibility."

Antibodies sold as being specific for a protein are oftentimes not, they can be very promiscuous in what else they bind and sometimes don't even bind the targeted protein. To make sure you are not affected by poor choices of antibodies make sure you run some validation studies before diving into your ChIP-seq experiments!

Not doing this risks wasting money (a lot according to the Nature article -see figure below). But more importantly you might waste your time, or even worse publish something that is erroneous. Hopefully you've already validated that your MCF7 cells are actually MCF7s with the BioRepository, so why not do the same with your antibody before starting your next experiment?

Figure from Bradbury and Plückthun Nature 2015.

Use your local support team

By (James@cancer) from The Genomics Core blog. Published on Jan 27, 2015.

We have a half-day workshop on Thursday for NGS newbies, the focus of which is library prep for next-generation sequencing. We organise seminars from commercial providers of new technologies throughout the year; but this is a semi-annual event where local users get a chance to present their work, and new users get to hear about what's possible with NGS.

This year we have presentations about RNA-seq, ChIP-seq, Exome-seq, FFPE genomes, DNA methylation, targeted resequencing and a talk on the UoC 10,000 Genomes Project; and afterwards we'll wrap up with beer and pizza. These days require lots of organisation (thanks to Fatimah for organising this years event) but, for the new users especially, turn out to be well worth the effort.

Making use of your local support teams: We also make sure we keep a good relationship with our local technical support teams and run a series of commercial presentations throughout the year. This works out to be much easier to organise as they do the prep work! While we're here in the Genomics Core to help our local users, we get lots of queries from people outside the Cambridge Institute, and this is one way we've found to increase the support we can offer.

Every other month we have Illumina come in to present on a specific library prep, or talk about recent updates. Sandra (Field Application Specialist), and Carla (Marketing Technology Specialist) generally talk for 30 minutes followed by Q&A, and then spend some time with users on a one-to-one basis troubleshooting their problems.

We also try to arrange a training session once per quarter with Thermo. We've been using their ABI 7900 qPCR instruments for eight years and buy in quite a lot of their SYBR and TaqMan master-mixes. Ever since we started working with them we've run "An introduction to qPCR" course for new users. The last one was run by Emma and everyone said it was a great introductory session.

What's in it for them: Neither Illumina or Thermo would do this for free if there was nothing in it for them. They get to interact directly with potential new customers, and get feedback on how their technologies are working in the real world. Some of these conversations might end up as research collaborations. Some of the contacts might end up as new sales contracts too (I know why they are really here)!

What's in it for us: These talks have been reasonably well attended and increase the support we can offer (albeit indirectly), and the feedback from users has been almost universally positive. I'd encourage you to get in touch with your local sales or technical rep and ask if they can help you too. They might even supply doughnuts!

PS: Thanks very much to Carla and Sandra at Illumina for the seminars over the past 12 months. And to Emma for the most recent qPCR training.

PPS: If you missed the registration link to the event on Thursday, send us message via a comment below!

How many reads do I need to sequence?

By (James@cancer) from The Genomics Core blog. Published on Jan 25, 2015.

A common question we're asked is "how many reads should I use to sequence a sample?" I'm going to focus on genomes, exomes and amplicomes in this post and introduce the Lander-Waterman equation [1]. Other apps are more complex because the number is very much 'how long is a piece of string' for RNA-seq, ChIP-seq and other counting applications - it depends on the complexity of your sample and the sensitivity you'd like to get, but is also affected by the number of replicates you have.

The Lander-Waterman equation
Lander-Waterman: Almost everyone doing NGS is using this equation, even if they are not aware of it. Anyone under 27 was born after it was published (1988), but it is an equation that is good to understand if you are sequencing. Basically it allows you to estimate how many reads of a specific length you need to sequence your genome.

The general equation is C = LN/G where: C = redundancy of coverage, G is the haploid genome size, L is the sequence read length, and N is the number of sequence reads. It can be rearranged to N = CG/L allowing you to compute the number of reads to sequence a genome, exome or amplicome (amplicon-panel) to a desired coverage (this is what we typically discuss when designing experiments).

In the examples below paired-end reads of 125bp from each end of a fragment are used, but these are converted to single 250bp reads for simplicity.
  • Human genome (3Gb) 30x coverage = 360M reads.
  • Human exome (150Mb) 50x coverage = 30M reads.
  • Human amplicome (30x250bp amplicons 0.075Gb) 1000x coverage = 0.3M reads.

[1] Lander, E. S. & Waterman, S. Genomic Mapping by Fingerprinting Random Clones : A Mathematical Analysis. Genomics239, 231–239 (1988).  
Eric Lander founded both the Whitehead and Broad Institutes. Michael S. Waterman is one of the founders of computational biology and gave his name to another important algorithm: Smith-Waterman alignment, he also wrote Computational Genome Analysis with our Director Simon Tavare while at the University of Southern California

Is my NGS library any good?

By (Unknown) from The Genomics Core blog. Published on Dec 04, 2014.

We've all been there. You bought the extortionately priced kit, you ran the gels, you lovingly removed every single SPRI bead, you sweated in a lab coat for days, and finally you elute your first ever NGS libraries. The question is, how can you tell if you were wasting your time? What if your tube turns out to contain nothing but buffer? Or worse, what if it can be sequenced, but it produces nothing more than a load of expensive gobbledegook?

Never fear, if your experimental design is up to scratch, then you need only three simple quality checks to tell you if your library is a Science paper in the making, or a bit of a dud:
  1. Bioanalyzer for Library Length
  2. qPCR for Concentration
  3. Nanodrop for Chemical Contamination (optional)

1. Bioanalyzer for Size Distribution

The Agilent Bioanalyzer or Tapestation runs 1ul of your library in a microfluidics gel-like cartridge, and shows you the range of sizes in your library, as well as an estimate of library quantity.
A good Bioanalyzer trace will look different depending on the type of library you are assaying. Preferably, your library should appear as a single discrete peak approximating a bell curve. It should be larger than ~150bp, but smaller than ~700bp.
The Bioanalyzer trace is essential for detecting Illumina adapter contamination, which can be spotted is a sharp peak between 100 - 150bp. If you are a member of the CRUK Cambridge Institute, we can train you on how to run the Bioanalyzer and offer you advice on interpreting your Bioanalyzer trace.

A clean library on the Bioanalyzer: this will sequence like a dream

A problematic library on the Bioanalyzer: it will be difficult to sequence this library well.

Once you have run your library on the Bioanalyzer, use manual integration or the region table to select the entire trace and determine the average size of your library. You will need this to calculate your nanomolar concentration later. If you sequence with us, we will ask for this information at submission - it must be accurate in order for us to provide you with a high sequencing yield and quality.

Look out! Certain library prep types do not give an accurate length estimate on the Bioanalyzer due to the presence of secondary structures in the DNA (e.g. Truseq DNA PCR-free). If you're using a kit, the protocol should clearly state if this is the case - and should give you the length to use in quantification calculations.

I wouldn't recommend you use use the Bioanalyzer nmol/l concentration for multiplexing, unless you really know what you are doing - or it is explicitly recommended in your protocol or kit. After all, the Bioanalyzer nmol/l value is only accurate for quantifying certain library prep types, and it is biased by any DNA in your sample which does not contain Illumina adapters.

2. qPCR for Quantification

I like to recommend quantification of libraries by qPCR, using primers designed to target the Illumina adapters. Our NGS service currently uses the KAPA library quantification kit (LQK) for this, and we find it very reliable - but there are alternative kits out there which we haven't tested.
A high quality library should be high concentration, ideally >10nM, but also not too high concentration, ideally <100nM. 
If you find your libraries are consistently very high yield (>100nM), then it is likely that you are performing more cycles of PCR than you need; this is likely to give you unnecessarily high PCR duplicate rates in your data. Reduce your protocol 1 PCR cycle at a time until you are reliably getting 10nM - 100nM libraries. Make sure you remember to dilute your library pools to within our submission requirements, currently 10nM - 20nM.

My top tips for high quality qPCR quantification:
  • Aliquot your qPCR mastermix and your standards into single-use batches prior to first use, to avoid template contamination and the effects of repeated freeze-thaw cycles
  • Wipe down all working surfaces and pipettes with a DNA degrading cleaning agent e.g. DNA Away/DNAoff/DNAZap, before starting work
  • Make a serial dilution and take triplicate measurements, use the median concentration result
  • Check your serial dilution and your replicate measurements give highly reproducible concentration values
  • Check that your results are all comfortably within the range of your standard curve
If you use our NGS service and you choose to use the KAPA LQK, we can provide you with aliquots of the recommended DNA dilution buffer (Tris-Hcl with 0.05% tween). Also, if you are within the CRUK-CI, we offer training on how to perform real-time PCR, and you can sign out a KAPA qPCR kit from the Genomics Core to take advantage of the Institute’s bulk discount.

If you must know about the Qubit...

Other quant methods like Qubit or Bioanalyzer can be great for some library types, as long as you know what you are doing - but both will over-estimate your library concentration if you have an inefficient adapter ligation reaction. So use them with care.

Our submission guidelines are in nmol/l (nM), so if you use the Qubit you need to convert ng/ul to nM using the following equation:

x: concentration in ng/ul 
L: average library length (bp)

y: concentration in nM.

3. Nanodrop for Chemical Contamination

The Nanodrop is a quick and dirty assay for protein and chemical contaminants which interfere with sequencing - including the real killers ethanol and phenol. Test 1ul of each NGS library, preferably before you pool them for submission. I recommend you check that the 260/280 ratio is greater than 1.8, and that the 260/230 ratio is greater than 2.0. The trace should like like this:

A good Nanodrop profile

A bad Nanodrop profile. Do you see the peak at 230nm?

A library with a 260/230 ratio less than 1.8, or a 260/280 measurement less than 2.0, may cluster poorly, and therefore generate low quality data. If you're new to the library preparation process and you can spare the sample I recommend you throw this one away and start again - while paying very careful attention to each cleanup step.
Always use the recommended cleanup method, don't be tempted to swap a bead cleanup for a column, or vice versa, even if it is more convenient! That will waste your time in the long run.
If you've got a contaminant and your library is irreplaceable, consider whether your yield is sufficiently high for you to repeat the final cleanup step. If not, have a chat with your NGS provider and ask if they will try sequencing it anyway. If you sequence with us here at CRUK-CI, we will always try our best to get you sequence data - as long as you know the you run the risk of paying for a lane of data which you can't use.

Whatever happens, do NOT use the Nanodrop quantity measurement for quantifying your DNA/RNA prior to library preparation, OR your final library concentration. DON'T DO IT. This is the most easily avoidable mistake in NGS. Don't be that scientist!

I hope that is enough to get you started. As ever, if you want advice on whether your library is going to sequence well on the Illumina platform, the best place to go is your local NGS facility (if you have one), or Illumina's technical support team:

Happy Sequencing!

Indexing 2: Troubleshooting a bad index balance

By (Unknown) from The Genomics Core blog. Published on Nov 13, 2014.

Indexes are one of the simplest improvements in the last five years of sequencing, with the most incredible far-reaching effects. Today I will share a complementary pair of posts tackling the problems our customers experience most frequently when submitting indexed libraries for sequencing.

Why did I get very different yields for the libraries in my pool?

We've seen this so many times. You think you have carefully quantified and pooled your libraries, and then your sequencing data comes back with a massive variation in the number of reads for each library in your pool. What a nightmare! 

Don't be fooled - there is nothing that your sequencing provider can do on the sequencer to cause a variable yield from your different indexes. An imbalance between indexes within your library pool arises during the pooling process, so an imbalanced pool indicates something has gone wrong during pooling.

Normally the problem is one of the following:

  1. Different libraries in the pool are of different lengths
  2. Quantification of the libraries prior to pooling was not accurate
  3. The process of mixing the libraries into the pool was not robust

First check #1: Are your libraries of different average size?

  • Measure the length of every library prior to pooling on the Bioanalyzer or Tapestation (or similar). 
  • Make sure you are including all of the visible peaks in your length measurement, including any adapter dimers, since they all contribute to the clustering.
  • Check that all of the libraries in your pool are a similar length to one another

Clustering efficiency is a non-linear function of length, because small fragments cluster disproportionately more efficiently than large ones. So if you mix a library of 200bp 50:50 with a library of 600bp, you will receive much more data for the short 200bp library.

As a guideline, all libraries should ideally be within +/- 50bp of one another. 

Then check #2: Was your quantification prior to pooling accurate?

If your quantification is not reproducible then your library balance will be way off, whatever else you do well. When troubleshooting an imbalanced pool, I recommend you repeat quantification on your individual libraries a second time, and see if you receive the same result.

It is worth asking your NGS provider to share their quantification results with you, so you can compare them to your own expectation. No two quantification measurement will ever be in precise agreement, but your NGS provider must have a very robust process in order to provide you with a reliable per-lane yield, so you can use their result as a gold-standard during troubleshooting.

If you are quantifying by qPCR, here are some valuable tips to improve robustness:

  • Perform quantification measurements in triplicate on your plate
  • Check your triplicate measurements are within ~0.5 Ct values
  • Take the Median value of your triplicates
  • Quantify all libraries which you plan to pool together on a single qPCR plate
  • Always run a no-template control to check for nonspecific amplification or contamination

If you are quantifying by qubit or bioanalyzer, I recommend that you swap to qPCR as soon as possible - and I bet you will see a better pooling balance afterwards.

Finally, have a look at #3: Was the process of mixing the libraries robust?

A common mistake when pooling is to quantify your library, perform a dilution, and then assume the diluted library will be exactly the concentration you aimed for. Unfortunately this is only true if your original concentration is close to your goal. As a guideline, any dilution greater than 1:5 is unlikely to be sufficiently robust for multiplexing. Using small volumes during dilution steps can really exacerbate this problem

The best practice for diluting highly concentrated libraries prior to pooling is to dilute them to a low value just higher than your goal, then re-quantify, then do a final small dilution to reach your goal. Use large volumes for your dilution steps, and keep your final dilution step as small as possible - and definitely less than 1:5. I often aim for a final 1:2 dilution step.

Consider this simple example:

  • Library A is at 100nM, so I dilute 1ul in 9ul of buffer to give me 10nM
  • Library B is at 300nM, so I dilute 1ul in 29ul of buffer to give me 10nM
  • Library C is at 600nM, so I dilute 1ul in 59ul of buffer to give me 10nM
  • I then mix 10ul of the diluted A, B and C. 

Frankly, my pooling balance is going to be rubbish.

Here's what I should do instead:

  • Library A is at 100nM, so I dilute 10ul in 40ul of buffer to aim for 20nM, then I re-quantify and find out it is actually at 18nM. I mix 10ul of this with 8ul of buffer to give 10nM
  • Library B is at 300nM, so I dilute 10ul in 140ul of buffer to aim for 20nM, then I re-quantify and find out it is actually at 22nM. I mix 10ul of this with 12ul of buffer to give me 10nM
  • Library C is at 600nM, so I dilute 10ul in 290ul of buffer to aim for 20nM, then I re-quantify and find out it is actually at 15nM. I mix 10ul of this with 5ul of buffer to give me 10nM
  • I then mix 10ul of the diluted A, B and C

My pooling balance will be beautiful

For the true NGS novices out there, if you don't know how I calculated the dilution steps in the example above then check this out.

If you have checked #1, #2, and #3 and everything looks perfect, then get in touch with Illumina's tech support team ( or with your NGS provider.

Indexing 1: A Simple NGS Pooling How-To Guide

By (Unknown) from The Genomics Core blog. Published on Oct 17, 2014.

Indexes are one of the simplest improvements in the last five years of sequencing, with the most incredible far-reaching effects. Today I will share a complementary pair of posts tackling the problems our customers experience most frequently when submitting indexed libraries for sequencing.

How do I pool my library at a defined concentration?

I get asked this a lot. Our current submission requirements are 10nM - 20nM in 15ul, but what does this mean? Is the total DNA concentration in the pool 10nM, and each individual library therefore much less? Or is it that each library within the pool is at a final concentration of 10nM?

Simply put, our submission guidelines IGNORE your indexes. Quantification and clustering cannot differentiate between indexes on a sample, so all we are interested in is the total quantity of DNA in your pool.  So, for example, if you have five libraries in a pool, the final pool DNA concentration must be at least 10nM - which means that each library within that pool is at least 2nM.

Here is the simplest at-a-glance method to dilute and pool your libraries. For more detailed hints and tips read on to my next post!
  1. Quantify and quality check all of your libraries
  2. Select a goal concentration for pooling - at or below the lowest concentration of your set of libraries.
  3. Make sure this is within our current submission guidelines.
  4. Dilute all of your libraries to that concentration, using Illumina Resuspension Buffer, EB, or 10mM Tris pH 8.5 with 0.1% Tween.
  5. Combine an equal volume of all of your libraries in your pool tube

Ta-da! You are ready to submit your pool for sequencing.

Science of the Yesteryear

By (Unknown) from The Genomics Core blog. Published on Sep 19, 2014.

Are you old enough to remember The Magic Roundabout, Hong-Kong Phooey and Mr Ben?  Did you finish your degree barely touching a computer?  When you graduated was 'genomics' a mere glint in Fred Sanger's eye?  If you answered 'yes' to any of these questions then you, like me, may feel befuddled by the dizzying speed of technological advances.

Don't despair, even when you say you went to 'Glastonbury' in 1990 and realise the app-savvy, linked-in, 'omics'-brains you are talking to weren't even born.  If you have spent years in fusty, ill-funded labs only to stumble, blinded, into the light of modern science, here are my rules for survival:

1.  Don't cry

2.  Even if you don't know what the piece of data being flashed on the screen is telling you, you are still a good person

3.  You are still making a contribution, however small

4.  Don't waste money on expensive running shoes; you'll never be able to catch the latest advances

5.  Accept your limits.  You have fewer brain cells than you had when you were 20

6.  It is inevitable that one day you will be replaced by a robot, but as the great Loudon      Wainwright III said (young folk may be more familiar with his famous son, Rufus), 'at least you've been a has-been and not just a never-was.'

7.  It's not your fault; you were born too soon.

qPCR quantification using our new Agilent Bravo robot

By (Unknown) from The Genomics Core blog. Published on Sep 17, 2014.

We have an exciting new instrument in the Genomics Core which will enable us to automate several of our protocols which until now have been quite labour intensive. This is the Bravo Robot by Agilent. After some previous experience with automation, I think that the results we have seen up until now are really quite promising with the aim of saving hands on time and providing consistency and accuracy within protocols.

qPCR test run - One of the first tests we ran upon installation of the Bravo was qPCR quantification. We quantify all libraries which are submitted to our sequencing service so we can aim to generate cluster densities on the flowcell which will yield large amounts of high quality data.

For this test, a single RNA-seq library was quantified using our standard method with the Illumina quantification kit by Kappa Biosystems. To test the reproducibility of the robot, we performed qPCR on this one sample 24 times as this is the maximum number of tubes which can be loaded per run.

In addition to this, each of the same 24 aliquots of this one library was set up in the same way but manually. This test was useful to determine how good the liquid handling on the Bravo is by looking at the reproducibility of the 24 replicates but additionally for comparing that between manual versus automated set up. The library had also been quantified previously so we expected the concentration to be 50nM.
Agilent Bravo Robot

The results show that there is higher variation in concentrations achieved manually although both methods slightly overestimated the expected concentration of 50nM. We saw an average concentration of 55.6nM manually (an 11% increase from 50nM) in comparison to a concentration of 52.7nM (a 5.5% increase from 50nM) on the Bravo.
Despite there being an about 1.8% difference between the average concentrations seen on the manual set-up in comparison to that of the automated, we can see that the Bravo has yielded more consistant results which is what we would expect and also good news. Since this test, we are now quantifying all SLX library submissions using the Bravo.

Although the robot will not be for general use and we will be unable to run individuals qPCR, we will be using it for all qPCR quantification for libraries submitted to us for our NGS service and for generating RNA-seq libraries. Once these protocols become robust within our lab, we will explore the option of utilising the robot in other protocols including Exome libary prep. It is additionally going to be used for automation of ChIP by the Odom group.

Help! I really don’t know anything about Next Generation Sequencing…

By (Unknown) from The Genomics Core blog. Published on Aug 28, 2014.

Over the last three years, while managing the CRUK-CI NGS service, I have heard one phrase more times than I can count:

Help! I really don’t know anything about Next Generation Sequencing…

Yet I have not got bored of hearing this phrase, or of explaining the basics of NGS to fellow researchers and colleagues. That is quite simply because (i) I think NGS is the bee’s knees, and (ii) my answer is never the same twice.

I understand why NGS can be daunting to researchers who wish to use it for the first time; it is technically complex at the same time as being unforgiving if you make a mistake – quite frankly, it is difficult to do well, and expensive if you screw up. The principles underlying the Illumina SBS chemistry are very simple, but the vagaries of instrumentation, the concepts behind designing a good experiment, and the technical skill required to create a high quality library add a layer of complexity which can frighten entire labs away from NGS.

The one thing I always open with is “Don’t be put off; it’s not as complicated as it looks”. The rest of the story changes every month, as new technologies are released and old ones are updated or improved.

Regardless of the name of the current Illumina big-hitter, or the details I advise you take into account when using a particular kit, there are some tips that I give to everyone, which I can also share with you. I will be honest and say they are unashamedly Illumina-focused. I manage an NGS service which runs Illumina sequencers, what else did you honestly expect?

1. Start with the Basics

I have not yet found a better resource for NGS training than the Illumina website (let's overlook the rubbish search function and confusing navigation for the time being). The user-training videos are easy to follow but sufficiently comprehensive to get you started. Furthermore, you can re-watch them once you have some experience, to help with troubleshooting. Don’t even start your experimental design until you have understood how the process works, or you will make expensive mistakes.

2. Practice, Practice, Practice

This should be obvious, but you might be surprised how many folks get it wrong. Don’t ever do your first NGS library preparation with your precious, tender, super-rare, pride-and-joy samples! Don’t even think about it! Don’t expect to buy one 96-sample kit and get 96 high quality NGS libraries with your first prep. Start small. Practice until it works, and then perform your experiment.

When you are ready to try NGS for the first time: buy a small kit, and source a batch of DNA or RNA that you can waste while you learn how to make good libraries. Start by making 8 libraries for practice, following the kit instructions very carefully. Where possible, collect a small quantity of DNA/cDNA after each cleanup step which you could use to troubleshoot failures.

I would strongly recommend that you sequence your best batch of practice libraries, rather than relying solely on the library QC. It is easy to make an RNAseq library which looks quite convincing on the bioanalyzer but performs poorly in differential gene expression analysis, and very, very easy to make a library which looks like a nice small RNA library which contains only the Illumina adapter sequence.

3. It’s Not What You Know, it’s Who You Know

It is unlikely that your first adventure in NGS will be plain sailing, so it’s important to know where you can go for help. If you’re new to NGS and you’re based here at the CRUK Cambridge Institute, the best starting point is to come down to the Genomics Core and talk with us. We’re always happy to meet NGS novices and to help you get your experiments started. If you’re not based in the CRUK CI I’m afraid we can’t offer you much help – we’re a small team and we’re dedicated to supporting this Institute. If you can find a Genomics Facility in your institute then they are going to be your best bet for introductory information, as well as more targeted advice.

If you’re not lucky enough to have a dedicated team onsite then the next best resource for you is Illumina’s technical support team ( They are well-informed and always happy to help with troubleshooting.

Other great sites to help you troubleshoot your NGS problems are:

PhiX Control - Phact or PhiXion?

By (Unknown) from The Genomics Core blog. Published on Jul 25, 2014.

Many of our users might have heard us talking about or seen a percentage of their reads aligning to the PhiX genome in the Multi Genome Alignment (see figure below). This is a result of Illumina's recommendation to use the PhiX genome control (see TechNote) for troubleshooting and quality control purposes. There are many features of PhiX that make it a good NGS control: it has a small 5386bp genome, it is a balanced genome (45% GC and 55% AT), and the library is 375bp average making it perfect for clustering and sequencing. The PhiX genome was the first genome to ever be sequenced.

An external file that holds a picture, illustration, etc. Object name is fgene-05-00031-g001.jpg
The Multi-Genome Alignment report
Why PhiX helps on the sequencer:
We use PhiX control in order to assess the quality of sequencing runs using Sequencing Analysis Viewer or SAV (see image below for an example). Illumina ship PhiX control at 10nM, which we then dilute, denature using our standard protocol and aliquot ready to use prior to clustering. As Illumina suggest, we spike in 1% of PhiX in lanes 1-7 and 5% in lane 8 in all our Hiseq runs, unless requested otherwise. We spike in 5% in our Miseq runs as we see more variable libraries being sequenced here. When checking run performance metrics in SAV, we check the cluster density, clusters passing filter (how many of the clusters are true clusters), error rate, phasing and pre-phasing and also the alignment rate to check the right amount of PhiX spiked in is aligning to the PhiX genome. These results can help us determine whether the problem is associated with the library or the machine, which is the reason we use PhiX to distinguish where the problem lies when a run or a lane doesn't perform well.

PhiX helps our troubleshooting: 
We check the run performance at every stage possible: we check after the clustering step to ensure the fluidics delivery is equal across all lanes, we check first base report to check the run looks good at the start of the sequencing and then we check the run metrics throughout the sequencing. When troubleshooting, we look at the same metrics as mentioned earlier, but in a lot more detail and also at many other metrics such as %base and the images to check that the library is balanced and to check the machine is behaving. When a run hasn't performed as expected and we cannot figure out the cause, we may also get Illumina involved and discuss the runs with them.

It can help get better results with funky libraries: There are many different library prep methods making it difficult to predict the performance of every sequencing run. Some methods such as Bisulfite, iClip, Bless, amplicons etc can produce "funky" libraries that might require the use of up to 50% PhiX.

The Genomics Core recommend using a higher percentage of phiX when:
  • You have a low diversity library (also under clustering can help here as well)
  • You have small amplicon pools
  • Bisulfite-seqeuncing, Blessd, amplicon sequencing

    Here is the Illumina product code if you would like to order phiX:
    PhiX control v3

    Agilent’s Clinical Meeting, Haloplex and SureSelect

    By (Unknown) from The Genomics Core blog. Published on Jul 10, 2014.

    Agilent’s Clinical Meeting, Haloplex and SureSelect

    I recently attended the Agilent Clinical Meeting in London which gave some very informative presentations on the extent to which Next Generation Sequencing is aiding diagnosis of disease and screening in the clinic. Clinical genetics labs need to be able to provide diagnostic tests which have a very rapid turnaround time that are also cost effective.

    Many diagnostic tests are currently based on sequencing a specific gene and using Sanger sequencing. However, many people talked about how they are developing panels for targeted sequencing along with studying the exonic regions by exome sequencing.  Despite the advances of NGS in the clinic, it was clear that whole genome sequencing is where we want to be heading, to get a complete picture. Unfortunately right now, it is not affordable enough.

    After attending this conference it was interesting to see how other labs are using Agilent’s Enrichment and panel solutions so thought I would summarise the technologies here.

    Haloplex Target Enrichment

    What is Haloplex?
    HaloPlex is a Target Enrichment System which can be used for the analysis of genomic regions of interest and is aimed for studying a large number of samples.

    How does it work?
    1.The workflow appears quite simple, starting with DNA fragmentation using restriction enzymes. 

    2.Probes which are designed to both ends of a DNA fragment are hybridised to form circular DNA molecules.

    3.There is a clean-up step using magnetic streptavidin beads which captures only those fragments containing the biotinylated Haloplex probes. The circular molecules are then closed by ligation.

    4.Finally, PCR is used to amplify the targeted fragments ready for sequencing.

    What can I enrich for?
    SureDesign software can be used to design these custom panels for specific genes or for thousands of exons of interest.

    Several clinical labs described how Haloplex technology is enabling them to design diagnostic tests based on screening for specific disease causative genes. Its popularity seemed to be down to its ability to permit a fast turnaround time due to the reduced amount of sample preparation required.

    SureSelect Target Enrichment

    What is Sureselect?
    Agilent’s SureSelect technology enables you to look at the whole exome or at a targeted panel.  It has become a very useful tool in focusing on familial disease loci and for validation of whole genome sequencing. 

    How Does it work?
    The SureSelect workflow involves a shearing step of gDNA followed by library preparation incorporating adaptors required for sequencing and indexes for multiplexing. Regions of interest are selected for by a 24hour hybridisation step with biotinylated RNA library baits followed by a cleanup step using magnetic streptavidin beads. The baits can be custom designed using Agilent’s SureDesign software. PCR is then used to amplify these regions which are then ready for sequencing.

    Exomes in the Genomics Core

    Here in the Genomics core, we are currently using Illumina’s Nextera Rapid Exome kit for Exome sequencing and Fluidigm Access Arrays for generating libraries for targeted sequencing.

    Agilent have recently released a new SureSelect kit, SureSelectQXT which combines a transposase-based library prep, followed by target enrichment. We have just received one of these kits and will soon be testing this in the lab.

    To seq or not to seq that is the DGE question.

    By (Unknown) from The Genomics Core blog. Published on Jul 03, 2014.

    The most common question asked in Differential Gene Expression (DGE) experimental design meetings at the CI is; "should we do RNA-seq or microarray processing?". It all boils down to what questions you want to answer and how the data will integrate into the bigger experiment. I have described some of the most common questions that are asked or discussed and hopefully this information will be useful in getting you thinking about the direction you want to go. 
    1. Why are people doing RNA-seq?
    2. Isn't RNA-seq really expensive?
    3. What about analysis, does it take longer to analyse RNA-seq data?
    4. Do I need as many replicates? 
    5. How long does it take?
    6. How many samples can be processed at once?
    1. Why are people doing RNA-seq? RNA-seq data allows you to have a greater dynamic range than microarray. RNA-seq is a digital reading (counting number of reads) and microarray is the analogue reading (fluorescents units) this can be useful if you are looking at the extremes of expression. You may if you wish in the future take your prepared library and do a different type of sequencing and analysis to look for splice junctions and other transcriptional changes. It is important to remember that wanting more than DGE needs a completely different experimental design. For microarray processing you are restrained to the design of the array and what species you want to explore, with RNA-seq there is no restraint as long as you have an adequate reference genome/transcriptome to align your data to. There are lots of technical reasons why you would choose one method over the other but I do not think that you can ignore the fact that RNA-seq is the new technology and people are choosing the method as it fashionable and may seem to be more attractive for publications.

    2. Isn’t RNA-seq really expensive? Currently the biggest cost in sequencing is the library preparation. In the core we are currently investigating alternative suppliers to reduce this cost. None the less currently sequencing costs are approximately the same as microarray for DGE analysis within the CIGC.

    3. What about analysis, does it take longer to analyse RNA-seq data? By its nature more data is created from RNA-seq sequencing so this in its self requires a significant amount of computing time to process the information. Put these aside similar stringent work flows and pipelines are in place to create a comprehensive gene list of the comparison for both processes.

    4. Do I need as many replicates? Yes. The design of the experiment for DGE will remain similar for both RNA-seq and microarray which includes replication requirements. Therefore the number of replicates recommended for the experiment will be the same for either RNA-seq or microarray processing.

    5. How long does it take? RNA-seq takes about the same amount of time to process samples in the lab as microarray samples. For both it takes just under a week to get to QC’ed cRNA (microarray) or normalised pooled libraries (RNA-seq). We process both protocols within the institute. However, as we no longer have a working microarray scanner on site, the guys at the Department of Pathology kindly perform the scanning step for us.

    6. How many samples can be processed at once? Microarray project designs are constrained to multiples of 12 to get the most out of the consumables, due to the way they are manufactured. RNA-seq utilises 96 individual indexes so if processing less than 94 samples (we use 2 for positive controls) all sample can be pooled together. It gets a little more complicated for larger projects but this is also true for microarray processing.

    An Intro to BioNano's Irys System

    By (James@cancer) from The Genomics Core blog. Published on Jun 20, 2014.

    The BioNano Irys System is an alternative way to look at whole genomes, specifically structural variation. Although it isn't classed as a NGS instrument it has some similarities and could be beneficial when used in conjunction with sequencing. Similar to NGS, the Irys system can be used for genome mapping and de novo assembly. BioNano boasts that the Irys can generate several Gb per hour and can run long sections of DNA at a time.

    From the BioNano website
    Like NGS, the sample requires some preparation prior to running whereby the DNA is fluorescently labelled. Unlike NGS, this method does not require PCR which eliminates the possibility of PCR bias. Single stranded DNA is used to prevent entwining on itself.

    The prepared DNA is fed into Irys Nanochannel which untangles in solution and is moved along the flowcell by electrophoresis. The current is briefly turned off, the DNA is stretched and the machine takes an image of the DNA by exciting the fluorescence. The image is taken with hundreds of thousands of bases of the sequence motifs. This is the data used to create the genome map.

    The instrument costs $295,000 which is expensive but not if your used to purchasing NGS instruments.


    Welcome to the genomics core blog

    By (James@cancer) from The Genomics Core blog. Published on May 29, 2014.

    This is the welcome post for a new venture in the genomics core at the Cancer Research UK Cambridge Institute. This blog will be written by the members of the lab and is likely to focus on new technologies, interesting publications, tweaks to methods and anything else that takes our interest that we think you'll enjoy reading about.

    We're writing it for users of our core lab but would be happy to get comments back from anyone that finds the content useful. I've been blogging personally for a couple of years now and have persuaded my group that it is something they will enjoy, and that won't take up too much time. I'll report back in year if it's worked out as we hope it will!

    You'll get posts from:
    • James Hadfield - Core Facility Manager
    • Sarah Leigh-Brown - Core Facility Deputy
    • Michelle Pugh - Senior Scientific Officer
    • Hannah Haydon - Scientific Officer
    • Fatimah Bowater - Scientific Officer
    • Rosalind Launchbury - Scientific Officer