The data set for this analysis comes from GSE14193, where we will use the 6-weeks wild type and PiZ mice data to perform analysis on a predetermined set of genes.
GSE141593_RAW.tar
and unzip it. There
will be one .txt.gz
file per sample - since we are only
interested in the first 10 samples, we can delete or set aside the
others. Unzip the individual text files and take a look at their format
so that they can be properly read into R later.Below is a list of R packages needed and how to install them. I
highly recommend looking at the websites for each to get an idea of how
they work. Additionally, more information on any functions used can be
found by typing ?function.name()
in the console.
BiocManager – used to install Bioconductor packages;
install.packages("BiocManager")
DESeq2 – tool for differential expression analysis; used here to normalize expression data
BiocManager::install("DESeq2")
openxlsx - write dataframes to excel files
install.packages("openxlsx")
bnlearn - for bayesian network analysis
install.packages("bnlearn")
Rgraphviz - for visualizing graphs
BiocManager::install("Rgraphviz")
Some Bayesian network analysis tools require a specific format for the input, so the count data will need to be processed into the correct format. The end goal of this step is a matrix of discrete expression values for each gene per sample, in which the rows represent the samples, and the columns are the genes plus a column for the experimental condition (wild type or PiZ mutant).
Since the counts are in separate files, we will need to read in each file in R and combine them into a single matrix based on the transcript IDs. Below is a summary of the steps taken to merge multiple files into one data frame.
list.files()
function to create a variable
containing all the file names for each samplelist()
function create an empty list variable
that will contain each sample’s file contentsfor
loop to (1) read in each file, (2) assign
names to the columns in the data, and (3) save it in the list. The first
column, which contains the gene IDs, should all have the same name. The
second column, which contains the counts for that particular sample,
should be given a unique name that helps identify which sample it is
once they are merged.Reduce(function(x,y) merge(x,y, all = TRUE, by = "transcript.id"), individual.files)
to merge all of the data frames contained in the list
individual.files
by the transcript.id column.gse <- "GSE141593"
count.files <- list.files("./data/", pattern = "counts.txt")
gsm <- gsub(pattern = "_.*", replacement = "", count.files)
# read sample files into a list
individual.files <- list()
for (f in 1:length(count.files)){
filepath <- paste0("./data/", count.files[f])
tmp <- read.table(filepath)
# give the count column a unique name for merging later
colnames(tmp) <- c("transcript.id", gsm[f])
individual.files[[f]] <- tmp
}
# merge the datasets into one matrix by the transcript id column
merged <- Reduce(function(x,y) merge(x,y, all = TRUE, by = "transcript.id"), individual.files)
# remove first 5 rows since we don't need them
merged <- merged[-c(1:5),]
# move transcript.id column to rownames and remove
rownames(merged) <- merged$transcript.id
merged <- merged[,-1]
head(merged)
Normalize the count matrix using DESeq2’s median of ratio’s method then convert to Z-score.
create a DESeq2 object from the newly merged count matrix
filter low-expression genes
use estimateSizeFactors()
to estimate size factors
for each sample
return normalized counts using DESeq2’s counts()
function with normalized set to TRUE
standardize using the scale()
function on the
transposed count matrix
library(DESeq2)
# make a quick metadata table for DESeq2
metadata <- data.frame(sampleID = gsm,
treatment = c(rep("wild type", 5), rep("PiZ mutant", 5)))
metadata$treatment <- factor(metadata$treatment, levels = c("wild type", "PiZ mutant"), labels = c("wt", "piz"))
# create DESeq object from count matrix, filter low expression genes and estimate size factors
diffxp <- DESeqDataSetFromMatrix(countData = merged,
colData = metadata,
design = ~ treatment)
keep <- rowSums(counts(diffxp) > 10) >= 3
diffxp <- diffxp[keep,]
diffxp <- estimateSizeFactors(diffxp)
#normalize and standardize
norm.counts <- counts(diffxp, normalized = TRUE)
standardized.counts <- scale(t(norm.counts))
# make sure its in the expected orientation
standardized.counts[1:5,1:5]
Label the gene expression values as
“No Change” if the Z score is between -1 and 1,
“Low” if the Z score is less than or equal to -1
“High” if the Z-score is greater than or equal to 1
discretize_gene_vector <- function(gene_vector){
is_low <- gene_vector < -1
is_high <- gene_vector > 1
is_none <- !is_low & !is_high
gene_vector[is_none] <- "noChange"
gene_vector[is_low] <- "low"
gene_vector[is_high] <- "high"
return(gene_vector)
}
discrete.counts <- apply(standardized.counts, MARGIN = 2, discretize_gene_vector)
# double check
discrete.counts[1:5,1:5]
standardized.counts[1:5,1:5]
read in file containing the genes of interest and find the index of the matching column names of the new discrete count matrix
create and save a data frame using the “treatment” column in the metadata and the relevant subset of the count matrix
I saved the processed data set in 2 locations and formats: The first one is an R object in the folder that will contain the bnlearn analysis, which will make it easy to load and use. The second one is optional, but I also kept an Excel copy in case I need to refer to it later or want to use it in a separate analysis.
library(openxlsx)
# filter for genes of interest, and add treatment column ------------------
genes <- read.table("./data/zn_transporter_genes.txt", sep = "\n")$V1
zn.goi <- which(colnames(discrete.counts) %in% genes)
pizmice.counts <- data.frame(pizMutation = metadata$treatment,
as.data.frame(discrete.counts[,zn.goi]))
save(pizmice.counts, file = paste0("./analysis/bnlearn/", gse, "_zn_transporter_gene_expr.RData"))
write.xlsx(pizmice.counts, file = paste0("./data/", gse, "_zn_transporter_gene_expr_clean.xlsx"))
Learn a structure using the data generated from the previous step.
There are different types of algorithms available in bnlearn and other
tools learn structures from data. The score-based Hill Climbing
algorithm is used here via the function hc()
. Note that
for bnlearn, the data type must be a factor. Also, if you know from
literature certain relationships between your genes of interest, you can
set the direction of specific arcs using the set.arc()
function.
library(bnlearn)
library(Rgraphviz)
#keep track of sample ids
sample.id <- rownames(pizmice.counts)
pizmice.counts <- data.frame(lapply(pizmice.counts, FUN = function(x) factor(x)))
str(pizmice.counts)
# re-add sample ids to row names
rownames(pizmice.counts) <- sample.id
rm(sample.id)
# Hill Climbing
pzm.hc <- hc(pizmice.counts, restart = 500)
# quick look via Rgraphviz
graphviz.plot(pzm.hc, layout = "dot")
Once we have a structure, the distribution parameters can be estimated using the structure and the data. For discrete data, we can use MLE or Bayes method’s to fit the model.
pz.mle.fit <- bn.fit(pzm.hc, data = pizmice.counts, method = "mle")
pz.mle.fit