Experimental Setup Questions

Have been treated/isolated in the same way as the target samples (ie if the target is FFPE then the controls must be FFPE, … etc)
Are diploid, copy number neutral.

It is not possible to analysis a single sample or a series of samples all with copy number alterations without a control sample.

What are the standard recommendations for an RNASeq Differential Gene Expression project?

If you are interested in Differential Gene Expression then the following are the standard recommendations:

– Read Length 50bp Paired End

– Read Depth 20-30 million Reads per sample. But please read the ending comment.

– Technical replicates are not needed but

– Biological replicates are _VERY_ important. You need at a minimum of 3 replicates however that is the bare minimum. Studies have shown that more increased the number of differentially expressed genes you can detected. For more details please see the following:

https://bioinformatics.mskcc.org/bic/how-many-biological-replicates-are-needed-in-an-rna-seq-experiment/

Some special cases to consider:

– If you are trying to detect lowly expressed transcripts like transcription factors you may need to sequence to higher depths of perhaps 60-80 million. However there is no simple way to estimate this precisely and at some point there is diminishing returns on the number of transcripts detected.

Analysis Questions

How do I access my results?

Click here for information about accessing your results.

What tool can I use to work with BAM files?

There are a number of tools that will allow you to work with BAM files.

IGV (http://software.broadinstitute.org/software/igv/)
SAMTOOLS (http://www.htslib.org/)
PICARD (https://broadinstitute.github.io/picard/)

And then there are many other tools that will do specific analysis.

Do you offer a bulk ATACseq pipeline?

We have a standard pipeline based on the ENCODE pipeline. Roughly it does the following:

Maps to the genome with BWA
Post alignment processing to filter and also does the Tn5 nucleotide shift
Create normalized BigWig files
Peak calling with MACS2
Create a merge peak atlas
Count the reads in all samples in the atlas peaks

If you want to see the actual pipeline it is available here: https://github.com/soccin/ATAC-seq

If you would like to request the service please go to https://bioinformatics.mskcc.org/bic/services/request/ and make sure to write that the data is ATAC-seq.

PCA and clustering analysis shows one sample looking very different from the replicates. Should I consider not analyzing this sample at all or is it OK to go ahead and include it in the analysis?

There is no simple answer to this question. It depends on our experimental system and how much variance is expect in it. The variance you are seeing could be typical of your system or that sample may have failed some part of the sequencing process and is an outlier/corrupted data. Some questions to ask:

Did anything happen in the preparation of this sample that might be causing this different?
Were the genomic core’s QC measures for this sample the same?
Are the pipeline QC metrics for this sample different in some way.

Mouse IMPACT pipeline for external academic users

Data: You do not need to transfer us the data. Inform IGO that you will be having the Bioinformatics core do the analysis so they grant us permission to access it.

Important design notes:

The pipeline runs somatic variant calling which requires a pair of samples target (tumor) and control (normal). We do have a reference control sample (normal pool) that can be used if a control is not available; however, we _strongly_ recommend that you sequence a matched control that is appropriate for your experiment. There can be significance strain variation in mice samples and it is difficult to filter strain variants from somatic variants without a matched control/normal.

Cost $/tumor (please email bicrequest@mskcc.org to request a quote)

Time estimate: 7-10 days after receipt of the full meta-data associate with the project. The key meta-data is the identification of sample type (tumor vs normal) and the pairing to tumors with matched normals.

RNAseq Analysis Questions

How long will my RNAseq analysis take?

Once your project has been submitted to the pipeline, analysis takes 7-10 days.

Where can I find the RNASeq methods?

RNASEQ methods can be found here: https://soccin.github.io/pwg-docs/methods/rnaSeq.html

How do you define a differentially expressed gene?

Genes meeting fold change cutoff log2(2), adjusted p-value cutoff 0.05, and mean coverage of at least 15; note, genes with count 0 in one condition are also included even though p-values are not significant

What is pathway analysis?

Pathway analysis utilizes knowledge from molecular databases to detect pathways/gene-sets in high throughput data.
There are two major types of pathway analysis: over-representation analysis and gene-set (enrichment) analysis. There are numerous statistical methods and computational tools that perform pathway analysis and aim to identify pathways where a significant part of genes changes concordantly between two conditions/groups. There are many available tools for pathway analysis.

Over-representation analysis, often called functional enrichment analysis, starts with a predefined list of genes. Usually, these are differentially expressed genes computed for a comparison of interest. The analysis then seeks over-represented pathways by examining whether the proportion of genes from each pathway in the list exceeds the proportion that is expected randomly.
Another type of pathway analysis is Gene Set Analysis , or Gene Set Enrichment Analysis (GSEA ). Gene Set (Enrichment) Analysis works in the pathway/gene-set space. It requires (1) a collection of pathways/gene-sets and (2) a computational method for identifying significantly altered pathway/gene-sets. A commonly used collection of annotated pathways/gene-sets is the Molecular Signatures Database (MSigDB).

This approach uses values for all genes in all samples measured in the study and ranks them by a quantitative measure based on gene expression in each sample. The measure can be fold-change in the comparison between two conditions/groups. In the next step, a quantitative measure for each pathway/gene-set is computed. There are many computational techniques that differ on how the pathway measures (enrichment score, ES) are computed.
To summarize, over-representation analysis starts with a predefined list of genes and hence requires a priori decision making on how this list is defined. Specifically, a list of differentially expressed genes can be identified by specifying fold-change threshold (usually log2FC=1) and p-value cut-off (FDR=0.05). Both log2 fold-change and p-value cutoff can be adjusted depending on the experiment.

Gene Set (Enrichment) Analysis uses all genes, and computations are performed in the gene-set/pathway space. If expression of a (sufficiently) large number of genes from a certain pathway/gene-set changes by only about 30%, these genes will not be detected by differential expression analysis on gene level, and hence will not be detected by over-representation method. At the same time, GSA will identify a pathway with concordant albeit small changes of a large number of genes.

Which linkage method is used in clustering? (i.e. single linkage, average linkage, or complete linkage)

For PCA/MDS and dendrogram we use normalized counts of all genes (filtering out very lowly expressed genes I believe currently it’s at least 15 counts in each group).

We use Euclidian distance and the default method in the hclust() R package.

Can I get FPKM’s as output for my RNAseq projects?

First, you should not use FPKM’s. Even the “creator” of FPKM’s does not think they are a good idea. The more appropriate measure is TPM. Please see https://bioinformatics.mskcc.org/bic/why-you-should-not-use-fpkms/.

Second, the default RNAseq pipeline works at the gene level not the transcript level. This is what most people need and this is a simpler pipeline that allows us to keep the cost of RNAseq analysis low. If you want to use TPM then you need to work at the transcript level. This service is not included in the price of standard RNAseq. If you do need transcript level analysis and normalization please contact the core for a price quote.

Does the bioinformatics core perform bulk RNAseq or single cell RNAseq?

The bioinformatics core can help with either. For bulk RNAseq we have a standard pipeline. Details can be found here:

https://soccin.github.io/pwg-docs/methods/rnaSeq.html

For single cell we are still developing methods but we have an analysis workflow based on SEURAT:

https://satijalab.org/seurat/

What is the gene set analysis (GSA) output?

Click here for the description of gene set analysis (GSA) output.

If I have a dataset with RPKM values, how can I compare this to my data?

That depends. First RPKM have fallen out of favor and really should not be used, see:

Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Wagner GP1, Kin K, Lynch VJ. Theory Biosci. 2012 Dec;131(4):281-5. doi: 10.1007/s12064-012-0162-3. Epub 2012 Aug 8.

So if it is possible to re-analyze the previous dataset I would strongly recommend that.

However if that is not possible then you would need to convert this dataset to RPKM. But here there is a second huge problem. Unless that is done in exactly the same way as the previous set you will may have severe BATCH effect problems. How you would deal would these depend on the details of the experiment.

Some potential issues.

(1) We have a method that will compute FPKM but it is not Cufflinks. We no longer use that program we use RSEM which is faster and better and also computes the more accepted TMP measurement. Without knowing how your RPKM’s are computed I can not know if they are compatible with the FPKM’s we would compute.

There are other ways to convert from counts to RPKM but again which way would be best is hard to know without more details.

(2) The gene model we us is GENECODE v18. The gene model used is a critical input and different gene model files are likely to have biases in doing a cross comparison. Again without knowing the Gene Model used in your dataset there is no way to know who compatible things will be.

(3) there is a final problem as to whether the data was sequencing in a compatible way. If the previous data set was sequenced with a different library perp and sequence mode (Single vs Paired, Read length differences) then comparing and dealing with batch effects will be substantially harder if not impossible.

How do I request a rerun of my RNAseq analysis?

If you need to rerun your analysis to change groupings, exclude a sample, or for some other reason, the core will provide one free re-analysis, if requested. Additional reruns will be charged at the hourly rate for time used. If you would like to request this service please go to: https://bioinformatics.mskcc.org/bic/services/request/. Alternatively, users are able to re-analyze their project on their own. Instructions can be found here: https://bicdelivery.mskcc.org/faqs/diffanalysis.

How are the pval and P.adj calculated in the ResDESeq files?

The p-value is computed using the Wald statistical test, where, for each gene, the NULL hypothesis assumes no differential expression between two sample groups (e.g., treated vs. control, or tumor vs normal). If the p-value is small (e.g., p<0.05), the NULL hypothesis is rejected. This is because p<0.05 indicates merely a less than 5% likelihood that the NULL hypothesis holds true.

Since thousands of genes are being tested, a common approach involves adjusting the p-values. This is called multiple hypothesis correction. Bioinformatics pipeline uses Benjamini-Hochberg (BH) correction to obtain adjusted p-values.

What is the difference between dm6 and dm3 builds?

dm6 and dm3 are two different versions/releases of the fly genome. Although they are labeled 6 and 3 they are really consecutive releases

dm3 == Release 5
dm6 == Release 6

dm6 is the newer release and as such typically is more accurate. I would recommend dm6 unless one of the following two conditions apply

1) You have previous data already run under dm3 and you want to be able to combine/compare the new with the old.

2) You know of a specific annotation that is available for dm3 that is not available for dm6

Cluster/Server questions

How do I use the cluster?

Please click here for information about using the Luna cluster.

How do I download data from the NYGC?

Please click here for information about how to download data from the NYGC.

How do I access files on the HPC clusters from my MAC using FUSE mounts (sshfs)?

1. Create a folder in desired directory

mkdir ifs

2. Mount the drive (in this case I mounted ifs)

sshfs -o auto_cache,reconnect,volname=ifs/,defer_permissions -o follow_symlinks -o IdentityFile=/path/to/.ssh/id_rsa youremail@mskcc.org:/ifs/ ifs/

3. Use any desired tool e.g. samtools, IGV

Contact the Bioinformatics Core

The Bioinformatics Core is located on the 4th Floor of the Zuckerman Research Center.

To make a request, please email bicrequest@mskcc.org.