数据差异分析_TCGA数据挖掘（四）：表达差异分析（4）

在之前我们的文章：TCGA数据挖掘(三)：表达差异分析中，我们利用的是TCGAbiolinks包中的TCGAanalyze_DEA函数进行差异表达分析，我们也提到可以选择基于limma或edgeR包进行分析，TCGA数据挖掘(三)：表达差异分析这一讲中我们利用的是edgeR包，之后我们在文章：TCGA数据挖掘(四)：表达差异分析(2)和TCGA数据挖掘(四)：表达差异分析(3)中分别也介绍了其他方法的差异分析，包括edgeR和DESeq包，今天这一讲，我们就利用TCGAbiolinks包中的TCGAanalyze_DEA函数基于limma包进行差异分析。

数据下载

基因表达数据的下载

数据下载代码和之前的一样，这里再提供一次。避免出错不知道原因。

setwd("E:\\BioInfoCloud\\TCGABiolinks\\case_study1")# 加载相应的包，可能会需要其他包，提示错误就安装缺少的包。# 因为每个人已经安装的包不一样。library(TCGAbiolinks)library(plyr)library(limma)library(biomaRt)library(SummarizedExperiment)# 请求数据。query "TCGA-BRCA",                  data.category = "Transcriptome Profiling",                  data.type = "Gene Expression Quantification",                   workflow.type = "HTSeq - Counts")# 从query中获取结果表，它可以选择带有cols参数的列，并使用rows参数返回若干行。# 1222个barcode                 samplesDown "cases"))# samplesDown中筛选出TP(主要实体瘤)样本的barcode# 1102个barcodedataSmTP                                   typesample = "TP")# 113个barcodedataSmNT                                   typesample = "NT")# 选择100个正常组织和100个肿瘤组织样本作为研究对象dataSmTP_short dataSmNT_short # 根据前面的筛选，再次请求数据queryDown "TCGA-BRCA",                       data.category = "Transcriptome Profiling",                      data.type = "Gene Expression Quantification",                       workflow.type = "HTSeq - Counts",                       barcode = c(dataSmTP_short, dataSmNT_short))#读取下载的数据并将其准备到R对象中，在工作目录生成brca_case1.rda文件，#同时还产生Human_genes__GRCh38_p12_.rda文件GDCdownload(query = queryDown)dataPrep1 "brca_case1.rda")

数据处理

# 去除dataPrep1中的异常值，dataPrep数据中含有肿瘤组织和正常组织的数据dataPrep object = dataPrep1,                                       cor.cut = 0.6,                                      datatype = "HTSeq - Counts")#使用EDASeq进行标准化dataNorm                                       geneInfo = geneInfoHT,                                      method = "gcContent")#根据分位数一处count较低的基因。dataFilt                                   method = "quantile",                                     qnt.cut =  0.25)# 将计数数据转换为log2计数/百万(LogCPM)，估计均值-方差关系并使用此关系计算适当的观测级别权重。# 然后，数据即可用于线性建模。v.dataFilt#in order to have 2 batches with multiple samples and avoid batches with one sample#the user need to call first the get_IDs function on the top of this script if this has not been done alreadyc1"E9")c2"E2")#taking log transformed data for exploration of batch effectsTCGAbatch_Correction(tabDF = v.dataFilt$E[,c(c1,c2)], batch.factor="TSS", adjustment=NULL)#if you get error in plot.new():figure margins too large - please use:#par(mar=c(1,1,1,1))#and then rerun the command above

#### After exploration of the batches we can continue to work on the original gene matrix for DEA ##################Tumor purity filtering##############vector containing all TCGA barcodes that hhave 60% tumor purity or morePurity.BRCA################DEA with Molecular subtypes################Molecular subtypes:####Filtering data so all samples have a pam50 subtype for BRCA:#diff contains TCGA samples that have an available molecular subtype###Also Applying Tumor purity filtering by making a set difference#documentation for TCGA_MolecularSubtype is available##########Important############in this example no change happens because all the TCGA barcodes####have available subtype info###setdiff() is here used to remove samples with no subtype info from the TCGA BRCA samples that satisfy###tumor purity criteriadiffdataFilt.brca.cancerdataFilt.brca.normaldataFilt.brcamol_subtypes# All barcodes have available molecular subtype infomol_subtypes# to convert ENSEMBL id to gene namerownames(dataFilt.brca)# we redo here the normalization and filtering steps dataNorm.brca                                            geneInfo = geneInfo,                                           method = "gcContent")# 将标准化后的数据再过滤，得到最终的数据dataFilt.brca.final                                        method = "quantile",                                        qnt.cut =  0.25)

表达差异分析

这里利用的是TCGAbiolinks包中的TCGAanalyze_DEA函数，是基于limma包的差异分析。

DEG.brca.TSS final,                            pipeline="limma",                            batch.factors = c("TSS"),                            Cond1type = "Normal",                            Cond2type = "Tumor",                            fdr.cut = 0.01 ,                            logFC.cut = 1,                            voom = TRUE,                            Condtypes = mol_subtypes,                            contrast.formula = "Mycontrast = (BRCA.LumA+BRCA.LumB)/2 -Normal")#in this DEA we use Normal as a reference, thus genes with LogFC > 1 are up regulated in the subtypes#with respect to the normal samples and viceversa for the LogFC < -1.write.csv(DEG.brca.TSS, "DEGsBRCA_limma_091018.csv", quote = FALSE)#3241 genes identified#to check how many with logFC > 5 (for plotting purposes)DEG.brca.filt.TSS= 6), ]#47 genes identified

可视化

利用火山图进行可视化。

TCGAVisualize_volcano(DEG.brca.TSS$Mycontrast$logFC, DEG.brca.TSS$Mycontrast$adj.P.Val,                      filename = "LuminalABvsNormal_FC6.TSS.pdf", xlab = "logFC",                      names = rownames(DEG.brca.TSS$Mycontrast), show.names = "highlighted",                      x.cut = 1, y.cut = 0.01,                       highlight = rownames(DEG.brca.TSS$Mycontrast)[which(abs(DEG.brca.TSS$Mycontrast$logFC) >= 6)],                      highlight.color = "orange")

在之前利用edgeR包获得的结果图如下，我们简单比较一下2中方法的差异。

很明显看到2中方法得到的结果是有差异的，这有时候我们会纠结那种方法好，这个就看你自己的研究啦，什么结果符合自己的用什么，当然这有点投机取巧的感觉，最好是2中方法得到后取交集。

不过下面我们对这2种方法得到的结果进行一个相关性评估。

limma和edgeR结果的相关性比较

library(ggplot2)DEGs_limma "DEGsBRCA_limma_091018.csv", row.names = DEGs_edgeR "DEGsBRCA_edgeR_091018.csv", row.names = genes_toTest 1:genes_common dataset select = dataset colnames(dataset) "logFC_edgeR", pdf("scatterplot_logFC_limma_edgeR_top1000.pdf", width=5, height=3)ggplot(dataset, aes(x=logFC_limma,y=logFC_edgeR))+geom_point()+  xlab("logFC_limma")+ylab("logFC_edgeR")+  theme(legend.title=element_blank(),axis.title=element_text(size=10),axis.text=element_text(size=10),        legend.text=element_text(size=20))dev.off()cor(dataset$logFC_edgeR, dataset$logFC_limma)  #0.9936

这2种方法得到的结果虽然有差异，但很小得到的大多数差异基因是一样的。

数据差异分析_TCGA数据挖掘（四）：表达差异分析（4）

浏览过的版块