The data set for this analysis comes from GSE14193, where we will use the 6-weeks wild type and PiZ mice data to perform analysis on a predetermined set of genes.
Read the GEO page on the data set to determine what data they provide and in what formats, which might affect which preprocessing procedures are needed. For this data set, the raw count data from an RNA-seq experiment is available, so we can normalize the counts using DESeq2’s median of ratios method.
Download the file GSE141593_RAW.tar
and unzip it.
There is one .txt.gz
file per sample - since we are only
interested in the first 10 samples, we can delete or set aside the
others. Unzip the individual text files and take a look at their format
so that they can be properly read into R later.
Below is a list of R packages needed and how to install them. I
highly recommend looking at the websites for each to get an idea of how
they work. Additionally, more information on any function can be found
by typing ?function.name()
in the console.
BiocManager – used to install Bioconductor packages;
install.packages("BiocManager")
DESeq2 – tool for differential expression analysis; used here to normalize expression data
BiocManager::install("DESeq2")
openxlsx - write dataframes to excel files
install.packages("openxlsx")
bnlearn - for bayesian network analysis
install.packages("bnlearn")
Rgraphviz - for visualizing graphs
BiocManager::install("Rgraphviz")
Some Bayesian network analysis tools require a specific format for the input, so the count data will need to be processed into the correct format. The end goal of this step is a matrix of discrete expression values for each gene per sample, in which the rows represent the samples, and the columns are the genes plus a column for the experimental condition (wild type or PiZ mutant).
Since the counts are in separate files, we will need to read in each file in R and combine them into a single matrix based on the transcript IDs. Below is a summary of the steps taken to merge multiple files into one data frame.
list.files()
function to create a variable
containing all the file names for each samplelist()
function create an empty list variable
that will contain each sample’s file contentsfor
loop to (1) read in each file, (2) assign
names to the columns in the data, and (3) save it in the list. The first
column, which contains the gene IDs, should all have the same name. The
second column, which contains the counts for that particular sample,
should be given a unique name that helps identify which sample it is
once they are merged.Reduce(function(x,y) merge(x,y, all = TRUE, by = "transcript.id"), individual.files)
to merge all of the data frames contained in the list
individual.files
by the transcript.id column.Normalize the count matrix using DESeq2’s median of ratio’s method then convert to Z-score.
create a DESeq2 object from the newly merged count matrix
filter low-expression genes
use estimateSizeFactors()
to estimate size factors
for each sample
return normalized counts using DESeq2’s counts()
function with normalized set to TRUE
standardize using the scale()
function on the
transposed count matrix
Label the gene expression values as
“No Change” if the Z score is between -1 and 1,
“Low” if the Z score is less than or equal to -1
“High” if the Z-score is greater than or equal to 1
I recommend writing a function that will evaluate each column and use
the apply()
function to apply it to the matrix column-wise:
apply(matrix, MARGIN = 2, function)
read in file containing the genes of interest and find the index of the matching column names of the new discrete count matrix
create and save a data frame using the “treatment” column in the metadata and the relevant subset of the count matrix
I saved the processed data set in 2 locations and formats: The first one is an R object in the folder that will contain the bnlearn analysis, which will make it easy to load and use. The second one is optional, but I also kept an Excel copy in case I need to refer to it later or want to use it in a separate analysis.
Learn a structure using the data generated from the previous step.
There are different types of algorithms available in bnlearn and other
tools learn structures from data. Start with the score-based Hill
Climbing algorithm via the function hc()
. Review the
arguments for the function via ?hc()
, and potentially try
different restart
values to see how it affects the
resulting network.
Note that for bnlearn, the data types of the columns in your data
frame must be a factor, which can be converted using
data.frame(lapply(data.frame, FUN = function(x) factor(x)))
Also, if you know from literature certain relationships between your
genes of interest, you can set the direction of specific arcs using the
set.arc()
function.
View the network with the Rgraphviz package by using the
graphviz.plot()
function on the results generated by
hc()
Once we have a structure, the distribution parameters can be
estimated using the structure and the data via bn.fit()
.
For discrete data, we can use MLE or Bayes method’s to fit the model.
Reviewing the function arguments via ?bn.fit()
is
recommended to understand how to adjust it to a particular analysis.