SMLG (Statistical Machine Learning Group) Discussion Forum

by **cwyoo** » Wed Mar 13, 2024 12:10 pm

Class projects correspondences in Advanced Bayesian Inference, Spring 2024

by **woodbench** » Sun Mar 24, 2024 12:18 pm

3/24 Update - Natalie and I met with Samantha this past Thursday to go over our progress with the RNASeq workflow on our chosen datasets (thank you, Samantha!!). We were able to begin the merge process using the raw counts from two of our datasets, GSE239379 (16 patients) and GSE211554 (12 patients) to compose a meta data table to reference in the merged DESeq object, to filter low impression genes and estimate size factors. Natalie and I will be meeting again on Monday to focus on the merged data table formatting, to create separate columns for the ensemble ID and the gene name, since one of the datasets combines this data into one value. Then, we will be merging by ensemble ID.

by **woodbench** » Mon Apr 15, 2024 10:05 am

4/15 Update - Natalie and I have successfully merged the phenotype data available for each dataset with the raw count expression data. The two datasets that had treatment information available (GSE186332 and GSE154041) within the phenotype data files were converted to long format, and gene ID formats were converted from entrez/NCBI ID to ensemble ID using the hgnc package. The reference genome is HG 38 for both datasets. Natalie has reached out to the authors of the datasets to request additional patient demographic information, we have received a few responses but are still pending the information.

by **woodbench** » Mon Apr 29, 2024 12:03 am

4/28 Update:
Natalie and I have been able to successfully merge the datasets after normalizing with the TMM method, which involved converting the data set that was in raw counts (GSE186332) to a DGEList, converting reads to counts per million (CPM), transposing the matrix to scale it, and transposing it back to long format. For the dataset that was already TMM normalized (GSE154041), the only alteration to the data set was a log base 2 (x + 1) transformation.

We used an inner_join to merge the transformed datasets based on hugo gene symbol, and were able to obtain a tibble with 13,799 genes to analyze. We then removed potential batch effects from the respective studies using the `removeBatchEffects()` function from the limma:: package and discretized expression values from -1 to 1. We ran a quick trial run of the hill climbing procedure using some of the genes heavily cited in literature ("POT1", "HERC2", "BRIP1", "POLE", "EGFR", "SOX9", "PGK1","CA9", "VEGFA", "SPP1", "HIF1A", "HP1BP3", "ZC3H7B", "BRAF","PTEN", "MGMT") to obtain the following result. Of course, we still need to adjust for each treatment and specify our genes of interest using differential expression analysis.

SMLG (Statistical Machine Learning Group) Discussion Forum

Advanced Bayesian Inference, Spring 2024

Advanced Bayesian Inference, Spring 2024

Re: Advanced Bayesian Inference, Spring 2024

Re: Advanced Bayesian Inference, Spring 2024

Re: Advanced Bayesian Inference, Spring 2024

Who is online