一文学会常规转录组分析
数据来源:Cytosolic acetyl-CoA promotes histone acetylation predominantly at H3K27 in Arabidopsis;GSE79524
我只试做了转录组分析那一部分。简单概括就是为了评估乙酰化对基因表达的影响,对野生型和突变体都做了转录组分析(基因差异表达分析和GO注释)。
1. 数据获取及质控
#1.脚本查看 $ cat dir_6.txt SRR3286802 SRR3286803 SRR3286804 SRR3286805 SRR3286806 SRR3286807 $ cat 1.sh dir=$(cut -f1 dir_6.txt) for sample in $dir do prefetch $sample done $ cat 2.sh for i in `seq 2 7` do fastq-dump --gzip --split-3 --defline-qual '+' --defline-seq '@$ac-$si/$ri' ~/ncbi/public/sra/SRR328680${i}.sra -O /ifs1/Grp3/huangsiyuan/learn_rnaseq/mRNA-seq/data/ & done #2.下载及解压 sh 1.sh sh 2.sh #解压之后是这样的,可以看出是双端测序 $ ls 1.sh SRR3286802_1.fastq.gz SRR3286803_2.fastq.gz SRR3286805_1.fastq.gz SRR3286806_2.fastq.gz 2.sh SRR3286802_2.fastq.gz SRR3286804_1.fastq.gz SRR3286805_2.fastq.gz SRR3286807_1.fastq.gz dir_6.txt SRR3286803_1.fastq.gz SRR3286804_2.fastq.gz SRR3286806_1.fastq.gz SRR3286807_2.fastq.gz #3.质量检测:fastqc,multiqc ls *fastq.gz | while read id do fastqc ${id} & done multiqc *fastqc* #4.过滤接头序列及低质量reads片段 ##就这组数据来说,质量检测的结果表明数据质量很好,因此这里省略的过滤这一步。
2. 下载gff/gtf注释文件并提取出感兴趣的基因/转录本区间
一个基因可能对应不同的转录本,不同的转录本可能对应不同的功能。
以拟南芥的gff注释文件为例:
#提取基因 $ less Arabidopsis_thaliana.TAIR10.42.gff3 | awk '{ if($3=="gene") print $0 }' | cut -d ";" -f1 | head -n 5 1 araport11 gene 3631 5899 . + . ID=gene:AT1G01010 1 araport11 gene 6788 9130 . - . ID=gene:AT1G01020 1 araport11 gene 11649 13714 . - . ID=gene:AT1G01030 1 araport11 gene 23121 31227 . + . ID=gene:AT1G01040 1 araport11 gene 31170 33171 . - . ID=gene:AT1G01050 #提取转录本 $ less Arabidopsis_thaliana.TAIR10.42.gff3 | awk '{ if($3=="mRNA") print $0 }' | cut -d ";" -f1 | head -n 10 1 araport11 mRNA 3631 5899 . + . ID=transcript:AT1G01010.1 1 araport11 mRNA 6788 8737 . - . ID=transcript:AT1G01020.6 1 araport11 mRNA 6788 8737 . - . ID=transcript:AT1G01020.2 1 araport11 mRNA 6788 9130 . - . ID=transcript:AT1G01020.3 1 araport11 mRNA 6788 9130 . - . ID=transcript:AT1G01020.5 1 araport11 mRNA 6788 9130 . - . ID=transcript:AT1G01020.4 1 araport11 mRNA 6788 9130 . - . ID=transcript:AT1G01020.1 1 araport11 mRNA 11649 13714 . - . ID=transcript:AT1G01030.1 1 araport11 mRNA 11649 13714 . - . ID=transcript:AT1G01030.2 1 araport11 mRNA 23121 31227 . + . ID=transcript:AT1G01040.1
从ID可以看出AT1G01020基因有6个转录本。这篇文献比较的是基因层面的表达差异,所以这里我提取的是基因gff,算上线粒体和叶绿体上的基因一共27655个。
$ less Arabidopsis_thaliana.TAIR10.42.gff3 | awk '{ if($3=="gene") print $0 }' > gene27655.gff
3. 将reads比对到参考基因组
3.1 为参考基因组建立索引
nohup ~/mysoft/hisat2-2.1.0/hisat2-build Arabidopsis_thaliana.TAIR10.dna.toplevel.fa Arabidopsis_thaliana &
3.2 比对
$ cat 3.sh for i in `seq 2 7` do hisat2 -x ~/learn_rnaseq/mRNA-seq/ref/Arabidopsis_thaliana -p 8 \ -1 ~/learn_rnaseq/mRNA-seq/data/SRR328680${i}_1.fastq.gz \ -2 ~/learn_rnaseq/mRNA-seq/data/SRR328680${i}_2.fastq.gz \ -S ~/learn_rnaseq/mRNA-seq/map/SRR328680${i}.sam done $ nohup sh 3.sh &
从报告文件来看比对率都挺高的,97%以上。
3.3 sam转bam, 并排序
$ cat 1.sh for i in `seq 2 7` do /ifs1/Software/biosoft/samtools-1.9/samtools view -@ 8 -Sb SRR328680${i}.sam > SRR328680${i}.bam /ifs1/Software/biosoft/samtools-1.9/samtools sort -@ 8 -n SRR328680${i}.bam > SRR328680${i}.n.bam done $ nohup sh 1.sh &
4. 计算表达量
Usage: featureCounts [options] -a-o input_file1 [input_file2] ... nohup ~/mysoft/subread-1.6.3-source/bin/featureCounts -F GTF -t gene -g gene_id -T 4 -a ~/learn_rnaseq/mRNA-seq/ref_ht/gene27655.gff -o ~/learn_rnaseq/mRNA-seq/count/test ~/learn_rnaseq/mRNA-seq/map/SRR328680*.n.bam & #上面这一步运行完之后会多出test.summary,test这两个文件 nohup ~/mysoft/subread-1.6.3-source/bin/featureCounts -F GTF -t gene -g gene_id -T 4 -p -a ~/learn_rnaseq/mRNA-seq/ref_ht/gene27655.gff -o ~/learn_rnaseq/mRNA-seq/count/test2 ~/learn_rnaseq/mRNA-seq/map/SRR328680*.n.bam 2> log2 & #上面这行只多了-p选项,为了看看在双端测序中统计fragments和reads有什么区别。运行完了多出test2.summary,test2两个文件。
从test2和text两个矩阵文件中,可以看出加了p之后求的是fragments的计数,数值大概是reads计数的1/2,这很好理解,因为fragment的定义中就是包含了一对reads的。
$ lsx test2 |head -n 20 | tail -n 5 AT1G01140 1 64166 67774 - 3609 2309 2073 2323 1522 1742 1254 AT1G01150 1 69911 72138 - 2228 3 0 0 1 2 4 AT1G01160 1 72339 74096 + 1758 766 730 857 745 819 747 AT1G01170 1 73931 74737 - 807 1796 1758 1748 1061 1540 1340 AT1G01180 1 75390 76845 + 1456 374 315 311 358 284 312 $ lsx test |head -n 20 | tail -n 5 AT1G01140 1 64166 67774 - 3609 4587 4118 4613 3025 3469 2484 AT1G01150 1 69911 72138 - 2228 6 0 0 2 4 8 AT1G01160 1 72339 74096 + 1758 1449 1383 1637 1412 1538 1409 AT1G01170 1 73931 74737 - 807 3220 3121 3169 1914 2771 2369 AT1G01180 1 75390 76845 + 1456 745 619 619 708 558 621
将test2文件中的这7列提取出来即为表达矩阵。
grep "^AT" test2 | cut -f1,7,8,9,10,11,12 > 6sample_count_matrix.txt
5. 差异表达分析
使用DESeq2 包。
a <- as.matrix(read.table("6sample_count_matrix.txt",sep="\t",header = T,row.names=1)) condition <- factor(c(rep("WT",3),rep("acc1.5",3)), levels = c("WT","acc1.5")) colData colData condition WT1 WT WT2 WT WT3 WT acc1.51 acc1.5 acc1.52 acc1.5 acc1.53 acc1.5 #确保 > all(rownames(colData) == colnames(a)) [1] TRUE dds <- DESeqDataSetFromMatrix(a, colData, design= ~ condition) dds <- DESeq(dds) #运行结束会报告 estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing #得到初步结果并查看 res <- results(dds, contrast=c("condition", "acc1.5", "WT")) ##log2(fold change)也是一个衡量差异显著性的指标 resLFC <- lfcShrink(dds, coef="condition_acc1.5_vs_WT", type="apeglm") resLFC ##按照p值由小到大排列 resOrdered <- res[order(res$pvalue),] resOrdered ##矫正p值小于0.001的有多少个差异基因 sum(res$padj < 0.001, na.rm=TRUE) ##画个MA-plot看看大体趋势 plotMA(res, ylim=c(-2,2)) plotMA(resLFC, ylim=c(-2,2)) #按照自定义的阈值提取差异基因并导出 diff_gene_deseq2 <-subset(res, padj 1) write.csv(diff_gene_deseq2,file= "DEG_acc1.5_vs_WT.csv")
6. 简单的GO注释
首选clusterProfiler包。
library(”clusterProfiler“) library("org.At.tair.db") e <- read.table("DEG_acc1.5_vs_WT.csv", header = T,sep = ",") f <- e[,1] #转换ID并将ENTREZID编号作为后面的识别ID ids <- bitr(f, fromType="TAIR", toType=c("SYMBOL","ENTREZID"), OrgDb="org.At.tair.db") f <- ids[,3]
按照GO系统给基因分类
ggo head(ggo) ID Description Count GeneRatio GO:0005886 GO:0005886 plasma membrane 237 237/718 GO:0005628 GO:0005628 prospore membrane 0 0/718 GO:0005789 GO:0005789 endoplasmic reticulum membrane 3 3/718 GO:0019867 GO:0019867 outer membrane 6 6/718 GO:0031090 GO:0031090 organelle membrane 63 63/718 GO:0034357 GO:0034357 photosynthetic membrane 23 23/718
富集分析
ego2 head(ego2) ID Description GeneRatio BgRatio pvalue p.adjust qvalue GO:0009505 GO:0009505 plant-type cell wall 20/654 274/24998 3.989815e-05 0.002907596 0.002713546 GO:0048046 GO:0048046 apoplast 22/654 328/24998 5.995043e-05 0.002907596 0.002713546 geneID GO:0009505 ATBXL2/ATPMEPCRA/LRX1/AtWAK1/ATPME2/GLL22/LRX2/SS3/ATPAP1/ATC4H/CASP1/BGLU15/RHS12/BGAL1/ATGRP-5/ATPCB/EARLI1/ATPGIP2/FLA13/PME5 GO:0048046 iPGAM1/ATDHAR1/AOAT1/ANN1/ATPEPC1/GDPDL1/CLE12/AGT/ENO2/GOX1/ATPCB/AtBG2/CRK9/CORI3/PMDH2/BETA/UDG4/ATSBT4.12/LTP3/AtPRX71/ATBXL4/ANNAT2 Count GO:0009505 20 GO:0048046 22
GO基因集富集分析
geneList = 2^e[,3] names(geneList)=as.character(ids[,3]) geneList = sort(geneList, decreasing = TRUE) ego3 <- gseGO(geneList = geneList, OrgDb = org.At.tair.db, ont = "CC", nPerm = 1000, minGSSize = 100, maxGSSize = 500, pvalueCutoff = 0.05, verbose = FALSE) head(ego3)
enrichGO和gseGO这两个都用得比较多。
可视化
barplot(ggo, drop=TRUE, showCategory=12) dotplot(ego2)
自己的分析略显简单,没有过多的细化,后面随着学习的深入再来补充。注:第2步提取gene_feature这一步没有必要,因为软件可以自动识别feature。——2019.8.10
reference
Statistical analysis and visualization of functional profiles for genes and gene clusters:
Analyzing RNA-seq data with DESeq2:
http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html
因水平有限,有错误的地方,欢迎批评指正!