View on GitHub

BioinfTraining

Repository for bioinformatics training at the Norwegian Veterinary Institute

Bioinformatics training: transcriptomics

Protocol

We will follow the protocol described in Tophat2 bioinformatic protocol published in Nature protocol, 2012 for this course.

Trans_pipeline

For simulated dataset:

Tophat2 —> cufflinks —> cuffmerge —> cuffdiff —> cummeRbund
Tophat2 —> cuffdiff —> cummeRbund
Tophat2 —> featureCounts —> DESeq2/edgeR

For real life dataset

(Just because HISAT2 is new and much faster than Tophat2):

HISAT2 —> StringTie —> Ballgown (There is an option to follow something similiar to cuffmerge as well)
HISAT2 —> cufflinks pipeline (with or without cuffmerge) —> cummeRbund
HISAT2 —> featureCounts —> DESeq2/edgeR
HISAT2 worklow pipeline Nature protocol publication

Day 1

Introduction of transcriptomic analyses
Discusssion regarding reference-based and de novo approaches
How to use tophat v2 - based on the Nature protocol (2012) paper
Download genome reference and create an index using bowtie2

in /work/projects/nn9305k/home//transcriptome/ref

$ ln ../../../../bioinf_course/transcriptomics/ref/Dm.BDGP6.dna.toplevel.fa .
$ ls ../../../../bioinf_course/transcriptomics/ref/Dm.BDGP6.91.gtf .
$ bowtie2-build Dm.BDGP6.dna.toplevel.fa Dm_BDGP6_genome

USAGE

$ bowtie2-build <Reference_fasta_file_name> <bowtie2-build_ref_index_name_that_you_will_use_later>

To do

Align the given reads to the Drosophila genome using tophat v2
create a new folder called tophat in /work/projects/nn9305k/home//transcriptome/

create a slurm script with time=12:00:00, ntasks=16 and mem-per-cpu=12Gb

$ tophat -p 16 -G ../ref/Dm.BDGP6.91.gtf -o <tophat_output_folder_name> ../ref/Dm_BDGP6_genome <read1> <read2>

Day 2

Discuss results from tophat v2 alignemnt
Read about cufflinks and differential expression analysis pipeline

To do

Run cufflinks on the tophat output

Within tophat folder create a slurm script with time=12:00:00, ntasks=16 and mem-per-cpu=12Gb

$ cufflinks -p 16 -o <cufflinks_output_folder_name> <tophat_output_folder_name>/accepted_hits.bam

Day 3

Discuss about (long) using tophat -> cufflinks -> cuffmerge -> cuffdiff pipeline (To find novel transcripts and genes)
Discuss about (short) using tophat -> cuffdiff pipeline (To calculate differential expression for only known genes and transcripts)

To do

Run cuffmerge and cuffdiff on Day 2’s cufflinks output
create a text file and call it assemblies.txt and it should contain the information below

$ cat assemblies.txt
<cufflinks_output_folder_name_for_Con1_Rep1>/transcripts.gtf
<cufflinks_output_folder_name_for_Con1_Rep2>/transcripts.gtf
<cufflinks_output_folder_name_for_Con1_Rep3>/transcripts.gtf
<cufflinks_output_folder_name_for_Con2_Rep1>/transcripts.gtf
<cufflinks_output_folder_name_for_Con2_Rep2>/transcripts.gtf
<cufflinks_output_folder_name_for_Con2_Rep3>/transcripts.gtf

Within tophat folder create a slurm script with time=12:00:00, ntasks=8 and mem-per-cpu=12Gb

$ cuffmerge -o <cuffmerge_output_folder_name> -g ../ref/Dm.BDGP6.91.gtf -s ../ref/Dm.BDGP6.dna.toplevel.fa -p 8 assemblies.txt

For the long pipeline, create a slurm script

cuffdiff -o cuffdiff_long_output -p 16 -L Con1_l,Con2_l <cuffmerge_output_folder_name>/merged.gtf Con1_Rep1_tophat/accepted_hits.bam,Con1_Rep2_tophat/accepted_hits.bam,Con1_Rep3_tophat/accepted_hits.bam Con2_Rep1_tophat/accepted_hits.bam,Con2_Rep2_tophat/accepted_hits.bam,Con2_Rep3_tophat/accepted_hits.bam

For the short pipeline, create a slurm script

cuffdiff -o cuffdiff_short_output -p 16 -L Con1_s,Con2_s ../ref/Dm.BDGP6.91.gtf Con1_Rep1_tophat/accepted_hits.bam,Con1_Rep2_tophat/accepted_hits.bam,Con1_Rep3_tophat/accepted_hits.bam Con2_Rep1_tophat/accepted_hits.bam,Con2_Rep2_tophat/accepted_hits.bam,Con2_Rep3_tophat/accepted_hits.bam

Check the above two scripts and identify the difference

Day 4

Load cuffdiff output from short and long pipeline in R using cummeRbund

To do

Link to cummeRbund manual: Manual

Day 5

Introduction to counting reads

To do

Link to featureCounts manual: Manual

Day 6-7

DESeq2 and cummeRbund

To do

Link to DESeq2 webpage: Link
Link to DESeq2 manual: Link
Link to cummeRbund webpage: Link
Link to cummeRbund manual: Link
Link to DESeq2 Rscript: DESeq2.R
Link to cummeRbund Rscript: cummeRbund.R
Link to video on FPKM, RPKM and TPM): Youtube video
Link to video on DESeq2 Normalisation: Youtube video

Day 8

Gene set enrichment analysis
Gene Ontology
Pathway analyses - KEGG
Gorilla
g:Profiler
Amigo
DAVID
PANTHER

Also, one can get the ortholog information from ENSEMBL

Day 9

De novo assembly using Trinity
Link to slide: presentation slides
Trinity website: Github wiki
Trinity worklow pipeline: Nature protocol publication