sRNA-seq for Alzheimer’s disease diagnosis

Summary

Within the biomarkers research field, miRNA have become a promising molecule for diagnosis of different diseases like cancer (Loke et al. 2019) or neurodegenerative diseases (Pan et al. 2016; Kumar, Vijayan, and Reddy 2017), among others. One of the advantages of these molecules is that they are present on reachable tissues like blood, which is easier to sample than tumors or cerebrospinal fluid. Moreover, the analysis of small RNA could also provide insights of disease regulation and reveal new therapeutic approaches.

Therefore, it is important to extract as maximum information as possible from its analysis. In this page I would provide with an analysis done by myself that could serve as model or inspiration for others. If so, please refer to the web page or the article published (Not published yet, under review).

Bioinformatic processing

We used docker4seq version 2.4.0 in order to process fastq files and obtain the count matrix (Calogero et al. 2020).

library(docker4seq)
downloadContainers(group="docker")

mirnaCounts(group = "docker", fastq.folder = getwd(), 
            scratch.folder = "/data/scratch/", mirbase.id = "hsa", 
            download.status = FALSE, adapter.type = "ILLUMINA", trimmed.fastq = TRUE)

This function generates, among others, 2 text files for further processing: all.counts.txt and cpm.txt. As we had 4 fastq for each sample (that is, technical replicates), we unified same samples into a single column, summing up the counts. We also changed the identifier of each sample. Both archives were stored in the corresponding project folder for further processing.

mexpr <- read.table("all.counts.txt", header = TRUE, sep = "\t")
cpms <- read.table("cpm.txt", header = TRUE, sep = "\t", dec = ".")

indices <- seq(2,193, by = 4)
muestras <- character()
conteos <- numeric()
srna.cpm <- numeric()

for (i in indices){
  j <- i-1
  muestra <- strsplit(colnames(mexpr)[i], "_")[[1]][2]
  muestras <- c(muestras, muestra)
  conteos <- cbind(conteos,rowSums(mexpr[,i:(i+3)]))
  srna.cpm <- cbind(srna.cpm, rowSums(cpms[,j:(j+3)]))
}

conteos <- as.data.frame(conteos)
srna.cpm <- as.data.frame(srna.cpm)
colnames(conteos) <- muestras
colnames(srna.cpm) <- muestras
rownames(conteos) <- mexpr$X

save(conteos, file = "resultados/1_matriz-expresion-docker.Rdata")
save(srna.cpm, file = "resultados/1_cpm-docker.Rdata")

Differential expression analysis

In the present project, there were 48 samples coming from patients belonging to 3 different groups: Control (n=20), Early-Disease (n=8), Late-Disease (n=20). To further identify differentially expressed miRNA in both disease cases, we used 3 differential expression packages from Bioconductor: DESeq2, edgeR and NOISeq. Both DESeq2 and edgeR are based on a parametric assumption that NGS generates counts that follow a negative binomial distribution defined by the mean (μ) and the overdispersion parameter (α). In contrast, NOISeq is a package based on a non-parametric assumption. We used all of them to generate a suitable set of miRNA candidates.

K_ij ∼ NB(μ_ij, α_i)

μ_ij = β₀Control_j + β₁Early_j + β₂Late_j
As we had an unbalanced dataset, we performed the analysis in two parallel ways:

Using a binary outcome (composed of the 20 Controls and 20 Late cases).
Using the 3 outcomes (Controls, Early and Late).

First of all, we filetered out miRNAs to reduce noise, we selected these miRNA having more than 5 counts in at least 10 samples, which resulted in a total of 284 miRNA in case of the binary outcome and 347 miRNA in the dataset with all outcomes.

load("resultados/1_matriz-expresion-docker.Rdata")
grupo <- condicion$`grupo experimental` != "case_1"
grupo <- condicion$`grupo experimental`[grupo]
grupo <- factor(grupo)
conteos.bin <- conteos[,condicion$`grupo experimental` != "case_1"]
keep <- rowSums(conteos.bin >5) >= 10
conteos.bin <- conteos.bin[keep,]
sum(keep) #selects 284

## [1] 284

keep <- rowSums(conteos > 5) >= 10
conteos <- conteos[keep,]
sum(keep) #selects 347

## [1] 347

So, now we carry out the 3 differential expression analysis.

Exploratory Data Analysis

Before differential expression testing, we performed an exploratory data analysis (EDA) in order to look for outliers. As it can be seen in both datasets (PCA plots), there may be an extreme point towards the left of the plots, belonging to a Late-Disease sample (S17). This was initially considered as outlier but then the analysis was repeated and it showed no differences among the differentially expressed miRNA (de-miRNA) detected. Thus, it was not identified as outlier.

Exploratory data analysis. PCA plots to identify outliers.

DESeq2

Binary analysis identified 10 de-miRNA and multiple outcome identified 41 de-miRNA (pvalue < 0.05).

bin.model <- DESeq(bin.dds)
mcivscont.results.bin <- results(bin.model[order(results(bin.model)$pvalue),], name = resultsNames(bin.model)[2], pAdjustMethod = "fdr")
write.csv(mcivscont.results.bin[order(mcivscont.results.bin$pvalue),], file = "resultados/2-DEmiRNA_bin_MCIvsCONT.csv")

multi.model <- DESeq(multi.dds)
mcivscont.results.multi <- results(multi.model, name = resultsNames(multi.model)[2])
prevscont.results.multi <- results(multi.model, name = resultsNames(multi.model)[3])
write.csv(mcivscont.results.multi[order(mcivscont.results.multi$pvalue),], file = "resultados/2-DEmiRNA_MCIvsCONT.csv")
write.csv(prevscont.results.multi[order(prevscont.results.multi$pvalue),], file = "resultados/2-DEmiRNA_PREvsCONT.csv")

edgeR

Similar to the DESeq2 procedure, we performed the differential expression analysis using default parameters in the edgeR documentation. As a result, binary analysis identified 4 de-miRNA (one in common to DESeq2) and multiclass analysis identified 15 de-miRNA.

library(edgeR)
bin.dge <- DGEList(counts = conteos.bin)
bin.dge$samples$group <- grupo
#sum(duplicated(rownames(bin.dge)))
bin.dge <- calcNormFactors(bin.dge)
desing <- model.matrix(~grupo, data = bin.dge$samples)
bin.dge <- estimateGLMCommonDisp(bin.dge,design = desing)
bin.dge <- estimateGLMTrendedDisp(bin.dge, design = desing)
bin.dge <- estimateGLMTagwiseDisp(bin.dge, design = desing)
bin.result <- glmFit(bin.dge)
bin.result <- glmLRT(bin.result)
bin.demir.glm <- topTags(bin.result,adjust.method = "fdr")
write.csv(bin.demir.glm, file = "resultados/4-DEmiRNA-glm_bin_MCIvsCONT.csv")

multi.dge <- DGEList(counts = conteos)
multi.dge$samples$group <- as.factor(condicion$`grupo experimental`)
#sum(duplicated(rownames(multi.dge))) #no duplicates
desing <- model.matrix(~group, data = multi.dge$samples)
multi.dge <- estimateGLMCommonDisp(multi.dge,design = desing)
multi.dge <- estimateGLMTrendedDisp(multi.dge, design = desing)
multi.dge <- estimateGLMTagwiseDisp(multi.dge, design = desing)
result.glm <- glmFit(multi.dge,design = desing)
result.MCI <- glmLRT(result.glm,coef = 2)
result.precli <- glmLRT(result.glm, coef = 3)
write.csv(topTags(result.MCI), file = "resultados/4-DEmiRNA-glm_MCIvsCONT.csv")
write.csv(topTags(result.precli), file = "resultados/4-DEmiRNA-glm_PRECvsCONT.csv")

NOISeq

Finally, we performed a third analysis following a non-parametric approach, resulting in 20 miRNAs with probability higher than 0.85.

library(NOISeq)
load("resultados/1_matriz-expresion-docker.Rdata")
conteos.bin <- conteos[,condicion$`grupo experimental` != "case_1"]
bin.nsq <- readData(data = conteos.bin, factors = data.frame(grupo = grupo))
#QCreport(bin.nsq, samples = NULL, factor = "grupo", file = "resultados/5-QCreport.pdf")
set.seed(2244) #needed for reproducible result given that it is based on simulation
result.nsq <- noiseq(bin.nsq, k = 0.5, norm = "rpkm", factor = "grupo", replicates = "no") #es un noiseq-sim

## [1] "Computing (M,D) values..."
## [1] "Computing probability of differential expression..."

deg.bin <- degenes(result.nsq, q = 0.85)#prob is not pvalue!!

## [1] "20 differentially expressed features"

write.csv(deg.bin, file = "resultados/5-DEmiRNA_bin_MCI-Cont.csv")

Distribution of de-miRNA in binary analysis

Distribution of de-miRNA in multiclass analysis

Common de-miRNA identified between binary and multiclass outcome

Functional annotation

In order to get insights regarding the biological meaning of this miRNAs, we first obtained the predicted gene targets, which are scored (rank_product) from more probable to less probable (the lower the rank_product value, the better the prediction). We did it twice, one annotation with the de-miRNAs identified in the binary analysis and other annotation with the de-miRNAs identified in the multiclass analysis.

library(miRNAtap)
library(miRNAtap.db)
library(topGO)
library(org.Hs.eg.db)
library(data.table)
rm(list = ls())

miRNAs identified in the binary output

demirna.dsq <- read.csv(file = "resultados/2-DEmiRNA_bin_MCIvsCONT.csv", row.names = 1)
demirna.edg.glm <- read.csv(file = "resultados/4-DEmiRNA-glm_bin_MCIvsCONT.csv", row.names = 1)
demirna.nsq <- read.csv(file = "resultados/5-DEmiRNA_bin_MCI-Cont.csv", row.names = 1)
#sum(demirna.dsq$pvalue<0.05) # 10 selected
demirna.dsq <- demirna.dsq[demirna.dsq$pvalue<0.05,] #selects 10 miRNA
#sum(demirna.edg.glm$PValue < 0.05) # 4 selected
demirna.edg.glm <- demirna.edg.glm[demirna.edg.glm$PValue < 0.05,] #selecciona 4 miRNA
# sum(demirna.nsq$prob > 0.85) # 20 selected
demirna.bin.all <- unique(c(rownames(demirna.dsq), rownames(demirna.edg.glm), rownames(demirna.nsq)))
predictions.bin <- data.table(getPredictedTargets(demirna.bin.all[1], species = "hsa", 
                                                  method = "geom", min_src = 3), keep.rownames = TRUE)
for (i in 2:length(demirna.bin.all)) { #
  mir = demirna.bin.all[i]
  predictions.bin <- rbind(predictions.bin,data.table(getPredictedTargets(mir, species = 'hsa',
                                                                  method = 'geom', min_src = 3), 
                                                        keep.rownames = TRUE), fill = TRUE)
}
#hay dups??
#sum(duplicated(predictions.bin$rn)) #check for duplicates
predictions.bin <- predictions.bin[,lapply(.SD, mean), by = rn] #join duplicates by rank_product mean (!)
# sum(duplicated(predictions.bin$rn)) #check for duplicates
predictions.bin <- predictions.bin[order(predictions.bin$rank_product)] #reordering
predictions.bin <- predictions.bin[predictions.bin$rank_product < 10,] #rank_product filtering

GO enrichment

rankedGenes.bin <- predictions.bin$rank_product
rankedGenes.bin <- setNames(rankedGenes.bin, predictions.bin$rn)
selection = function(x) TRUE # we do not want to impose a cut off, instead we are using rank information
ontos <- c("BP", "CC", "MF")
# onto <- ontos[1]
for (onto in ontos) {
  allGO2genes = annFUN.org(whichOnto=onto, feasibleGenes = NULL,
                           mapping="org.Hs.eg.db", ID = "entrez")
  
  GOdata =  new('topGOdata', ontology = onto, allGenes = rankedGenes.bin, 
                annot = annFUN.GO2genes, GO2genes = allGO2genes, 
                geneSel = selection, nodeSize=10)
  #nodeSize es un parámetro que se usa para jerarquizar los términos GO. Es decir, se descartan aquellos GO que tengan menos de 10 genes en la lista de genes
  results.ks = runTest(GOdata, algorithm = "weight01", statistic = "ks")
  
  allRes = GenTable(GOdata, KS = results.ks, orderBy = "KS", topNodes = max(results.ks@geneData[[4]],20))
  allRes[,c('GO.ID','Term','KS')]
  write.csv(allRes[,c('GO.ID','Term','KS')], file = paste("resultados/6-GO-DEmi-bin_",onto,".csv", sep = ""))
}

miRNAs identified in the multiclass output

rm(list = ls())
demirna.dsq1 <- read.csv(file = "resultados/2-DEmiRNA_MCIvsCONT.csv", row.names = 1)
demirna.dsq2 <- read.csv(file = "resultados/2-DEmiRNA_PREvsCONT.csv", row.names = 1)
demirna.edg.glm1 <- read.csv(file = "resultados/4-DEmiRNA-glm_MCIvsCONT.csv", row.names = 1)
demirna.edg.glm2 <- read.csv(file = "resultados/4-DEmiRNA-glm_PRECvsCONT.csv", row.names = 1)
demirna.dsq1 <- demirna.dsq1[demirna.dsq1$pvalue<0.05,]
demirna.dsq2 <- demirna.dsq2[demirna.dsq2$pvalue<0.05,]
demirna.edg.glm1 <- demirna.edg.glm1[demirna.edg.glm1$PValue < 0.05,]
demirna.edg.glm2 <- demirna.edg.glm2[demirna.edg.glm2$PValue < 0.05,]
demirna.all <- unique(c(rownames(demirna.dsq1),rownames(demirna.edg.glm1),rownames(demirna.dsq2), rownames(demirna.edg.glm2)))
predictions.multinom <- data.table(getPredictedTargets(demirna.all[1], species = "hsa", method = "geom", min_src = 3), keep.rownames = TRUE)
for (i in 2:length(demirna.all)) { #
  mir = demirna.all[i]
  predictions.multinom <- rbind(predictions.multinom,data.table(getPredictedTargets(mir, species = 'hsa',
                                                                          method = 'geom', min_src = 3), keep.rownames = TRUE), fill = TRUE)
}
# sum(duplicated(predictions.multinom)) #check for duplicates
predictions.multinom <- predictions.multinom[order(predictions.multinom$rank_product)] #reordering
predictions.multinom <- predictions.multinom[predictions.multinom$rank_product < 10,] #rank_product filtering

GO enrichment

rankedGenes <- predictions.multinom$rank_product
rankedGenes <- setNames(rankedGenes, predictions.multinom$rn)
selection = function(x) TRUE # we do not want to impose a cut off, instead we are using rank information
ontos <- c("BP", "CC", "MF")
# onto <- ontos[2]
for (onto in ontos) {
  allGO2genes = annFUN.org(whichOnto=onto, feasibleGenes = NULL,
                           mapping="org.Hs.eg.db", ID = "entrez")
  
  GOdata =  new('topGOdata', ontology = onto, allGenes = rankedGenes, 
                annot = annFUN.GO2genes, GO2genes = allGO2genes, 
                geneSel = selection, nodeSize=10)
  #nodeSize es un parámetro que se usa para jerarquizar los términos GO. Es decir, se descartan aquellos GO que tengan menos de 10 genes en la lista de genes
  results.ks = runTest(GOdata, algorithm = "weight01", statistic = "ks")
  
  allRes = GenTable(GOdata, KS = results.ks, orderBy = "KS", topNodes = max(results.ks@geneData[[4]],20))
  allRes[,c('GO.ID','Term','KS')]
  write.csv(allRes[,c('GO.ID','Term','KS')], file = paste("resultados/6-GO-DEmi-multi_",onto,".csv", sep = ""))
}

GO visualization

In order to visualize GO terms, we used the treeMap representation. But first, it was necessary to reduce the GO terms, grouping them by similarity. For this, we used rrvgo package.

library(rrvgo)
rm(list = ls())
go.bin.bp <- read.csv("resultados/6-GO-DEmi-bin_BP.csv", row.names = 1)
go.bin.cc <- read.csv("resultados/6-GO-DEmi-bin_CC.csv", row.names = 1)
go.bin.mf <- read.csv("resultados/6-GO-DEmi-bin_MF.csv", row.names = 1)
go.multi.bp <- read.csv("resultados/6-GO-DEmi-multi_BP.csv", row.names = 1)
go.multi.cc <- read.csv("resultados/6-GO-DEmi-multi_CC.csv", row.names = 1)
go.multi.mf <- read.csv("resultados/6-GO-DEmi-multi_MF.csv", row.names = 1)
go.bp <- rbind(go.bin.bp, go.multi.bp)
go.cc <- rbind(go.bin.cc, go.multi.cc)
go.mf <- rbind(go.bin.mf, go.multi.mf)

go.bp <- go.bp[go.bp$KS<0.1,]
simatrix <- calculateSimMatrix(go.bp$GO.ID, orgdb="org.Hs.eg.db", ont="BP", method="Rel")
scores <- setNames(-log10(go.bp$KS), go.bp$GO.ID)
reducedTerms <- reduceSimMatrix(simatrix, scores, threshold=0.7, orgdb="org.Hs.eg.db")
treemapPlot(reducedTerms = reducedTerms)

go.cc <- go.cc[go.cc$KS<0.1,]
simatrix <- calculateSimMatrix(go.cc$GO.ID, orgdb="org.Hs.eg.db", ont="CC", method="Rel")
scores <- setNames(-log10(go.cc$KS), go.cc$GO.ID)
reducedTerms <- reduceSimMatrix(simatrix, scores, threshold=0.7, orgdb="org.Hs.eg.db")
treemapPlot(reducedTerms = reducedTerms)

go.mf <- go.mf[go.mf$KS<0.1,]
simatrix <- calculateSimMatrix(go.mf$GO.ID, orgdb="org.Hs.eg.db", ont="MF", method="Rel")
scores <- setNames(-log10(go.mf$KS), go.mf$GO.ID) 
reducedTerms <- reduceSimMatrix(simatrix, scores, threshold=0.7, orgdb="org.Hs.eg.db")
treemapPlot(reducedTerms = reducedTerms)

Statistical modelling

Finally, we decided to select all de-miRNA identified by both analysis and build generalized linear models. That is a binomial (strictly bernoulli) regression and a multinomial regression, respectively. Then, in order to evaluate the prediction accuracy, we validated the area under the curve (AUC) with a bootstrap resampling using boot package.

rm(list=ls())
library(nnet)
library(pROC)
library(boot)
load("resultados/1_matriz-expresion-docker.Rdata")
condicion <- readxl::read_xlsx("../Muestras_ConsueloChafer2.xlsx")
grupo <- factor(condicion$`grupo experimental`)

Binomial model

demirna.dsq <- read.csv(file = "resultados/2-DEmiRNA_bin_MCIvsCONT.csv", row.names = 1)
demirna.edg.glm <- read.csv(file = "resultados/4-DEmiRNA-glm_bin_MCIvsCONT.csv", row.names = 1)
demirna.nsq <- read.csv(file = "resultados/5-DEmiRNA_bin_MCI-Cont.csv", row.names = 1)
demirna.dsq <- demirna.dsq[demirna.dsq$pvalue<0.05,] #selecciona 10 miRNA
demirna.edg.glm <- demirna.edg.glm[demirna.edg.glm$PValue < 0.05,] #selecciona 4 miRNA
demirna.all <- c(rownames(demirna.dsq),rownames(demirna.edg.glm),rownames(demirna.nsq))
#demirna.all[duplicated(demirna.all)] #aparece el hsa-miR-4259, identificado por DESeq2 y edgeR
demirna.all <- unique(demirna.all)
conteos.bin <- conteos[grupo != "preclinico"]
glm1 <- glm(grupo ~ ., family = binomial(), data = datos)
glm2 <- step(glm1, trace = 0)
write.csv(coef(glm2), file = "resultados/9-coef_bin_inter.csv")

Bootstrap validation

boot.glm <- function(data, indices){
  data.b <- data[indices,]
  glm.b <- glm(grupo ~ ., family = binomial(), data = data.b)
  roc.b <- pROC::roc(as.numeric(data$grupo)-1, predict(glm.b, newdata = data ,type = "response"))
  return(roc.b$auc)
}
glm.boot <- boot(datos, boot.glm, R = 1000, parallel = "multicore", ncpus = parallel::detectCores())

Multinomial model

demirna.dsq1 <- read.csv(file = "resultados/2-DEmiRNA_MCIvsCONT.csv", row.names = 1)
demirna.dsq2 <- read.csv(file = "resultados/2-DEmiRNA_PREvsCONT.csv", row.names = 1)
demirna.edg.glm1 <- read.csv(file = "resultados/4-DEmiRNA-glm_MCIvsCONT.csv", row.names = 1)
demirna.edg.glm2 <- read.csv(file = "resultados/4-DEmiRNA-glm_PRECvsCONT.csv", row.names = 1)
demirna.dsq1 <- demirna.dsq1[demirna.dsq1$pvalue<0.05,]
demirna.dsq2 <- demirna.dsq2[demirna.dsq2$pvalue<0.05,]
demirna.edg.glm1 <- demirna.edg.glm1[demirna.edg.glm1$PValue < 0.05,] 
demirna.edg.glm2 <- demirna.edg.glm2[demirna.edg.glm2$PValue < 0.05,] 
demirna.all <- unique(c(rownames(demirna.dsq1),rownames(demirna.edg.glm1),rownames(demirna.dsq2), rownames(demirna.edg.glm2)))
datos <- data.frame(grupo = grupo, t(conteos[demirna.all,]))
modelo <- multinom(grupo ~ ., data = datos)
modelo2 <- step(modelo, trace = 0)
write.csv(coef(modelo2), file = "resultados/7-coef_multi_inter.csv")

Bootstrap validation

multiclass.roc(as.numeric(grupo), as.numeric(predict(modelo))) #Overstimated AUC

## 
## Call:
## multiclass.roc.default(response = as.numeric(grupo), predictor = as.numeric(predict(modelo)))
## 
## Data: as.numeric(predict(modelo)) with 3 levels of as.numeric(grupo): 1, 2, 3.
## Multi-class area under the curve: 1

multinom.fun <- function(data, indices){
  data.b <- data[indices,]
  model.b <- multinom(grupo ~ ., data = data.b)
  model.roc.b <- multiclass.roc(as.numeric(data$grupo), as.numeric(predict(model.b, newdata = data)))
  return(model.roc.b$auc)
}
multinom.boot <- boot(datos, multinom.fun, R = 1000,
                      parallel = "multicore", ncpus = parallel::detectCores())

## # weights:  141 (92 variable)
## initial  value 52.733390 
## iter  10 value 18.419041
## iter  20 value 10.403683
## iter  30 value 8.581056
## iter  40 value 7.408737
## iter  50 value 6.723852
## iter  60 value 6.298944
## iter  70 value 5.588832
## iter  80 value 3.217361
## iter  90 value 0.961881
## iter 100 value 0.005292
## final  value 0.005292 
## stopped after 100 iterations

par(mfrow=c(1,2))
plot(density(glm.boot$t), main = "Bootstrap AUC distribution", xlab = "Binomial model")
abline(v=quantile(glm.boot$t, c(0.025, 0.95), names = FALSE), lty = 2)
text(x = mean(glm.boot$t), y = 0.5,labels = paste0("AUC=",signif(mean(glm.boot$t,digits = 3))), cex = 0.8)
mtext("A", adj = 0)
plot(density(multinom.boot$t), main = "Bootstrap AUC distribution", xlab = "Multinomial model")
text(x = mean(multinom.boot$t), y = 0.5,labels = paste0("AUC=",signif(mean(multinom.boot$t,digits = 3))), cex = 0.8)
abline(v = quantile(multinom.boot$t, probs = c(0.025,0.975)), lty =  2)
mtext("B", adj = 0)

References

Calogero, Raffaele, Neha Kulkarni, Luca Alessandri, Marco Beccuti, Greta Romano, and Francesca Cordero. 2020. Docker4seq: Running Ngs Computing Demanding Applications, E.g Reads Mapping and Counting, Wrapped in Docker Containers.

Huber, W., V. J. Carey, R. Gentleman, S. Anders, M. Carlson, B. S. Carvalho, H. C. Bravo, et al. 2015. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.” Nature Methods 12 (2): 115–21. http://www.nature.com/nmeth/journal/v12/n2/full/nmeth.3252.html.

Kumar, Subodh, Murali Vijayan, and P. Hemachandra Reddy. 2017. “MicroRNA-455-3p as a potential peripheral biomarker for Alzheimer’s disease.” Human Molecular Genetics 26 (19): 3808–22. https://doi.org/10.1093/hmg/ddx267.

Loke, Sau Yeen, Prabhakaran Munusamy, Geok Ling Koh, Claire Hian Tzer Chan, Preetha Madhukumar, Jee Liang Thung, Kiat Tee Benita Tan, et al. 2019. “A Circulating miRNA Signature for Stratification of Breast Lesions Among Women with Abnormal Screening Mammograms.” Cancers 11 (12). https://doi.org/10.3390/cancers11121872.

Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for Rna-Seq Data with Deseq2.” Genome Biology 15 (12): 550. https://doi.org/10.1186/s13059-014-0550-8.

Pajak, Maciej, and T. Ian Simpson. 2020. MiRNAtap: MiRNAtap: MicroRNA Targets - Aggregated Predictions.

Pan, Yaoqian, Ruizhu Liu, Erin Terpstra, Yanqing Wang, Fangfang Qiao, Jin Wang, Yigang Tong, and Bo Pan. 2016. “Dysregulation and Diagnostic Potential of microRNA in Alzheimer’s Disease.” Journal of Alzheimer’s Disease 49: 1–12. https://doi.org/10.3233/JAD-150451.

R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. 2010. “EdgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40. https://doi.org/10.1093/bioinformatics/btp616.

Sayols, Sergi. 2020. Rrvgo: A Bioconductor Package to Reduce and Visualize Gene Ontology Terms. https://ssayols.github.io/rrvgo.

Tarazona, Sonia, Pedro Furio-Tari, David Turra, Antonio Di Pietro, Maria Jose Nueda, Alberto Ferrer, and Ana Conesa. 2015. “Data Quality Aware Analysis of Differential Expression in Rna-Seq with Noiseq R/Bioc Package.” Nucleic Acids Research 43 (21): e140.

Venables, W. N., and B. D. Ripley. 2002. Modern Applied Statistics with S. Fourth. New York: Springer. https://www.stats.ox.ac.uk/pub/MASS4/.

miRNA signature for Alzheimer

sRNA-seq for Alzheimer’s disease diagnosis

Summary

Bioinformatic processing

Differential expression analysis

So, now we carry out the 3 differential expression analysis.

Exploratory Data Analysis

DESeq2

edgeR

NOISeq

Functional annotation

miRNAs identified in the binary output

GO enrichment

miRNAs identified in the multiclass output

GO enrichment

GO visualization

Statistical modelling

Binomial model

Bootstrap validation

Multinomial model

Bootstrap validation

References