SMLG (Statistical Machine Learning Group) Discussion Forum

by **cwyoo** » Mon Aug 20, 2018 8:29 pm

Probabilistic Graphical Models, Fall 2018 Probabilistic class projects related information will be post here. Every Friday, please post what you have been working toward your class project.

by **cpere117** » Fri Oct 26, 2018 6:42 pm

Hello all,

For my week one progress I've attached my proposal paper along with a few datasets I've been working with over the past week to help advance my project proposal. I've highlighted in red font the portion of my proposal that I've updated within the first week and plan to update you all each Friday accordingly. Attached you can find a dataset where all samples for the RNA-Deq data from the Allen Database Repository discussed in the Alzheimer's section of the SMLG forum is available. Here I've highlighted in red males and in blue females along with their unique identifier in the ACT Cohort study. The rationale behind this was to make a more organized approach to begin my across gender comparison of gene expression in brain regions known to be crucial in dementia pathology. Next week the goal is to update my table accordingly, and further interpret and analyze my results from the cross-comparison analysis I will perform using R programming. Thank you for your attention and support of my project over these next few weeks.

by **cpere117** » Fri Nov 02, 2018 9:00 pm

This week for my class project I added an RNA Seq dataset from GEO to my table of Alzheimer datasets posted in my proposal. The GSE accession number of the study is GSE104704 and it's titled "Dysregulation of the epigenetic landscape of normal aging in Alzheimer's disease [RNA-Seq]." I recently discovered this dataset and was happy to see that it provided raw single strand RNA fastq files that I could input into the Galaxy bioinformatics platform. The dataset has 30 samples divided according to age, and disease status (AD or no AD). The three groups consisted of 8 young healthy brains (Young), 10 aged healthy brains, and 12 aged diseased brains determined by the neuropathological presence of Lewy bodies, amyloid beta plaques, and neurofibrillary tangles. After uploading the raw files of all post mortem brain tissue samples into Galaxy I proceeded to the quality assurance stage of RNA-Seq workflow. Here I used FastQC to verify that all of my samples contained no improper contamination or GC bias. I found that there was a problem with overrepresentation of sequences leading me to utilize trimgalore a program that cuts off undesired sequence strands often left over by the company primer used for the RNA-Seq analysis (Illumina, Affymetrix, etc.). The program fixed this problem with the data leading me to the next step of aligning the sequences accordingly using Bowtie2. Following alignment, I then proceeded to compile all of the RNA-Seq counts that were detected against the HG38 reference genome using FeatureCounts a package provided on galaxy. Next, my 30 samples were gathered into their appropriate groups designated by age and disease status. I performed an RNA-Seq differential gene expression analysis comparing RNA Young Samples vs RNA Old Samples, RNA Young Samples vs RNA AD Old Samples, and lastly RNA Old versus RNA AD Old. Utilizing both deseq2 and edger in order to cross-compare both programs results I was able to output datasets with logFC, PValues, mean normalized counts, and gene identifiers (Entrez ID numbers). I've attached some of my output results to this post for your own reference. Notice that the volcano plot produced by edgeR has many more significant genes for Young healthy brain adults versus AD Old diseased brains when compared to other group contrasts. I've also downloaded and color coded the Allen database RNA-Seq data according to gender and disease status (Normal, AD, Dementia, Multiple Etiologies, Vascular Dementia, and other). Attached is the file with color coding. Next week I plan to have the results of the Allen data if possible so it can be discretized for BANJO and also utilize the results of the new GSE study described in this post.

by **DanielTira** » Sun Nov 04, 2018 9:52 pm

This past week I changed my topic hence the delay in a post, I am attaching a fully revised proposal. I will be focusing on the glioblastoma dataset from the lab and will be using both BiDAG and the lab's algorithm. This week I also made updates to the python cleaning code, but have run into a yellow light issue that will need to be resolved soon. This issue involves the use of the GPL platforms to clean the dataset, in that gene names have multiple gene names in one expression value. But they are also located in other regions, there needs to be a decision on how to deal with these numbers if they should be separated and each given the value and just averaged into a single value afterwards as there are multiple occurrences of a gene name within a single patient. I will also need to read about why this is happening from an implementation perspective. Are the machines picking up overlapping signals and this is just the output? Hence why we want to average? Will attach code in its next iteration once I add averaging, normalization, and discretizing.

by **hallelu7** » Mon Nov 05, 2018 11:18 pm

1. Retinal detachment study

My first project is about RD. Currently, my fellow is doing a chart review in Korea and it seems to take some time to get the data(3-4months). I will update later.

by **hallelu7** » Mon Nov 05, 2018 11:30 pm

2. Diabetic Retinopathy Study

This study was based on the fourth and fifth KNHANES (2008–2012) which was conducted as a national health survey in Korea that used a stratified, multistage, clustered sampling method
based on 2005 National Census data to randomly select a population-based sample across 500 national districts to represent the civilian, noninstitutionalized, South Korean population, and
sample design and size were estimated properly so that annual survey results could represent the whole population in Korea.

by **albert07** » Tue Nov 27, 2018 6:34 pm

Hello. Attached is my updated proposal using one of the GEO datasets looking at medulloblastoma. I am having some difficulty with how exactly to proceed so any help would be appreciated. Thanks!

-Albert

by **DanielTira** » Wed Dec 12, 2018 10:14 pm

Attaching a semi-production ready python cleaner code. It's still a little messy in terms of the deployment of the functions. It's run from command line, you also need to specify where you want your files to be dropped into since Python doesn't have a nice work area function like R studio has. And I prefer not to clutter my home area.

by **cpere117** » Mon Feb 11, 2019 1:18 pm

Hey all,

I'm just providing an update in regards to my proposal project centered upon ID3 regulatory targets involved in the onset of Alzheimer's Disease. Over the past week I've compiled more datasets from GEO and have added them to a master table where they've been classified according to study demographics, clinical variables provided, etc. Due to the evidence of neovascularization and volume loss in certain brain regions ( hippocampus, frontal cortex, temporal cortex, and entorhinal cortex), following post mortem tissue analysis along with neuroimaging results, I'm going to center my project on datasets that include any of these four brain regions along with clinical variables that may affect pathological vascular remodeling in AD patients (Braak stage, age, gender, and possibly ethnicity depending on the number of data samples I could acquire). In addition, I've inputted some more AD RNA-Seq datasets bringing my total RNA-Seq datasets to 5, along with 16 microarray datasets. The addition of these data samples will be beneficial to answering the hypothesis of my project, adding more statistical weight or robustness to my analysis. We hypothesize that ID3 and/or its target genes are significantly involved in causal networks associated with neovascularization of brain regions evidenced to be highly impacted by AD pathogenesis. Furthermore, we hypothesize that ID3 will be more highly expressed in samples with a higher age and Braak score. Attached is my revised proposal, and I will keep you all updated as I move forward with my project.

by **cpere117** » Tue Mar 12, 2019 10:40 pm

I've begun the cleaning, processing, and discretization of my datasets. Attached our the cleaned files for my first dataset on the Alzheimer's master table GSE1297. Also, an updated master table.

SMLG (Statistical Machine Learning Group) Discussion Forum

Probabilistic Graphical Models, Fall 2018

Probabilistic Graphical Models, Fall 2018

WEEK 1 Progress C_Perez Project

Re: Probabilistic Graphical Models, Fall 2018

Re: Probabilistic Graphical Models, Fall 2018

Re: Probabilistic Graphical Models, Fall 2018

Re: Probabilistic Graphical Models, Fall 2018

Re: Probabilistic Graphical Models, Fall 2018

Re: Probabilistic Graphical Models, Fall 2018

Re: Probabilistic Graphical Models, Fall 2018

Re: Probabilistic Graphical Models, Fall 2018

Who is online