Commonly Used Bioinformatics Methods

Here is some generic text that can be used for grants using bioinformatics pipelines or analysis methods. Feel free to use or contact us for more detailed information.

As of 06/21/22, GSEA has replaced GSA: https://www.gsea-msigdb.org/gsea/index.jsp.

Analysis of Genomic Profiling Data: Arrays and Next-Generation Sequencing Data

Mutation detection pipeline for MSK-IMPACT and WES capture data. Tumors will either be profiled for genomic alterations in key cancer-associated genes (either the 341 or 410 panel) using the custom, deep-sequencing MSK-IMPACT assay (Integrated Mutation Profiling of Actionable Cancer Targets) or the Agilent Whole Exome Array. Genomic DNA from tumor and patient matched normal samples will be prepared for sequence library and target capture (IMPACT or Whole Exome) as described by Won HH et al.[1] Pooled libraries containing captured DNA fragments are sequenced on the Illumina HiSeq system at 2x100bp paired-end reads. The sequence data is demultiplexed using bcl2fastq Conversion Software (Illumina), which generates FASTQ files for input into the main pipelines.

The raw sequence data is then analyzed with the research sequence analysis pipeline, which was engineered to achieve mutation call parity with the pipeline used clinically by the Diagnostic Molecular Pathology Service at MSKCC. The pipeline is a flexible and automated system and has been extensively validated on thousands of specimens processed for MSK-IMPACT, and exome or genome sequencing and is based on standard best practices using freely available open-source tools/programs [2, 3-8] and custom written scripts and programs. The custom code we used is being made available on publically accessible repositories at GitHub (github.com/soccin and github.com/mskcc). Currently, these code bases are not easily portable to run on other systems, but the full source code is available for inspection and we will develop a transportable version that will be usable by others. The pipeline detects the full somatic and germline abnormalities including point mutations; small insertions and deletions; and total, allele-specific, and integer DNA copy number. Pipeline results are available in multiple formats and are imported automatically into the cBioPortal [9, 10] for visualization and dissemination to the wider research community.

A brief summary of the variant pipeline is as follows. First any adapter sequences are removed from the 3ʹ end of reads. Reads are then aligned to the hg19 b37 version of the human genome using BWA-MEM algorithm.[4] Local realignment and quality score recalibration is performed using the Genome Analysis Toolkit (GATK) according to GATK best practices [3] and the ABRA assembly-based realigner.[5] Samples are then processed via a series of computational quality control steps to ensure genomic concordance between tumor and normal tissues from the same specimen, detect the presence of tumor DNA in the normal sample, and monitor contamination-involving DNA from different patients. The quality control steps consist of standard metrics from the PICARD toolkit (github.com/broadinstitute/picard) and custom methods.

Paired-sample variant calling was performed on tumor samples and their respective matched normal samples to identify point mutations/single nucleotide variants (SNVs) and small insertions/deletion (indels). MuTect [7] (version 1.1.4) was used for SNV calling and the HaplotypeCaller from the GATK package (v.3.x) was used for detecting insertion/deletion events. To call somatic indels, a custom post-processing script is used which mimics the behavior of the SomaticIndelDetector (no longer available). For IMPACT samples there are a number of intronic regions are that tiled and the Pindel algorithm is used to look at structural rearrangements. Variants are then annotated using the Variant Effect Predictor, and annotations relative to the canonical transcript for each gene (derived from a list of known canonical transcripts obtained from the University of California, Santa Cruz genome browser) are reported. In cases where variant calling was performed for a tumor without a matched normal sample, variants with minor allele frequency >1% in the 1000 Genomes Project cohort (www.1000genomes.org) were also removed as they were more likely to be common population polymorphisms than somatic mutations. Annotated SNVs and indel calls were subjected to a series of filtering steps to ensure that only high-confidence calls were admitted to the final step of manual review. First, prior knowledge from the literature was incorporated in the analysis through a two-tiered variant-filtering scheme: variants corresponding to known hotspot mutations with extensive supporting evidence in the literature (at least 5 mentions in the Catalog of Somatic Mutations in Cancer [COSMIC] database [http://cancer.sanger.ac.uk/cosmic]) were considered first-tier events. These variants were subjected to lower requirements on coverage, number of mutant reads and variant frequency to be considered as high confidence calls. Second, variants detected in more than 20% of a set of historical normal samples (ie, ≥3 mutant reads and > 1% variant frequency) were considered to be likely artifacts and removed. Third, we employed the following thresholds on coverage depth (DP), number of mutant reads (AD), and variant frequency (VF) for rejecting false positive calls. First-tier variants (ie. well-characterized hotspot mutations) were considered a separate class from novel second-tier variants. First-tier variants were filtered using the following criteria: DP ≥ 20X, AD ≥ 8 and VF ≥ 2%, compared with second-tier variants: DP ≥ 20X, AD ≥ 10 and VF ≥ 5%. (These services will be used by RP-1, aims 1 and 2; RP-2, aim 3; and RP-3, aim 2)

RNA Sequencing Expression Analysis
The output from the sequencers (FASTQ files) are mapped to the reference genome using the rnaStar (https://code.google.com/p/rna-star/) aligner, which supports the mapping of spliced-reads. We use the 2-pass mapping method outlined in Engstrom PG et al.[11] in which the reads are mapped twice. The first mapping pass uses a list of known annotated junctions from Ensembl. Novel junctions found in the first pass are then added to the known junctions and a second mapping pass is done. After mapping, we compute the expression count matrix from the mapped reads. This will be done using HTSeq (www-huber.embl.de/users/anders/HTSeq) and one of several possible gene model databases. We again use the gene models (GTF) from Ensembl. The raw count matrix generated by HTSeq is then processed using the R/Bioconductor package DESeq (www-huber.embl.de/users/anders/DESeq), which is used to both normalize the full dataset and analyze differential expression between sample groups. (These services will be used by RP-4, aims 1 and 2)

Gene Set Analysis

As of 06/21/22, GSEA has replaced GSA: https://www.gsea-msigdb.org/gsea/index.jsp.

An unsolved challenge in the analysis of RNA expression profile data is the identification of pathways that are differentially activated between groups of samples. When approaching a pathway analysis problem, we generally use the following commercial and academic pathway analysis software suites, websites, or methods: GeneGo, DAVID,[12] Gene Set Enrichment Analysis, [13] and GSA (statweb.stanford.edu/~tibs/GSA). While generally useful to get an idea of global and broad RNA expression trends, truly specific success stories with these tools are few and far between. We have therefore developed a couple of novel approaches to pathway analysis that meet the needs of MSKCC investigators.

References

1. Won, H.H., et al., Detecting somatic genetic alterations in tumor specimens by exon capture and massively parallel sequencing. J Vis Exp, 2013(80): p. e50710.

2. Al-Ahmadie, H., et al., Synthetic lethality in ATM-deficient RAD50-mutant tumors underlies outlier response to cancer therapy. Cancer Discov, 2014. 4(9): p. 1014-21.

3. DePristo, M.A., et al., A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet, 2011. 43(5): p. 491-8.

4. Li, H. and R. Durbin, Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 2009. 25(14): p. 1754-60.

5. Mose, L.E., et al., ABRA: improved coding indel detection via assembly-based realignment. Bioinformatics, 2014. 30(19): p. 2813-5.

6. Shen, R. and V. Seshan, FACETS: Fraction and Allele-Specific Copy Number Estimates from Tumor Sequencing. Memorial Sloan-Kettering Cancer Center, Dept. of Epidemiology & Biostatistics Working Paper Series, 2015. Working Paper 29.

7. Cibulskis, K., et al., Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol, 2013. 31(3): p. 213-9.

8. Ye, K., et al., Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics, 2009. 25(21): p. 2865-71.

9. Cerami, E., et al., The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov, 2012. 2(5): p. 401-4.

10. Gao, J., et al., Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci Signal, 2013. 6(269): p. pl1.

11. Engstrom, P.G., et al., Systematic evaluation of spliced alignment programs for RNA-seq data. Nat Methods, 2013. 10(12): p. 1185-91.

12. Dennis, G., Jr., et al., DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol, 2003. 4(5): p. P3.

13. Subramanian, A., et al., Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A, 2005. 102(43): p. 15545-50.