Page 1 of 1

Probabilistic Graphical Models, Fall 2020

PostPosted: Sat Oct 10, 2020 3:45 pm
by vsteb002
Probabilistic Graphical Models, Fall 2020 class projects

Re: Probabilistic Graphical Models, Fall 2020

PostPosted: Sat Oct 10, 2020 5:12 pm
by vsteb002
The data was taken from the following post:
viewtopic.php?f=84&t=145&start=20#p1892
(assembled and cleaned by Zhenghua Gong)

Number of samples: 1255

Variables with missing data
DiseaseGrade: 1206 missing samples;
age: 53 missing samples;
gender: 31 missing samples;

Due to limited number of samples with available Disease Grade, this variable were removed;

Number of samples after dropping DiscreteGrade, and dropping samples for which gender or age is missing: 1202

The cleaned data is attached.

Re: Probabilistic Graphical Models, Fall 2020

PostPosted: Tue Oct 13, 2020 5:33 am
by vsteb002
I selected the top 40 genes that correlate best with the disease variable. For the discrete dataset, I used the Kendall rank correlation, while for continuous data I used the Spearman correlation approach.

The data with selected genes are attached, as well as the jupyter notebook with corresponding source code.

Re: Probabilistic Graphical Models, Fall 2020

PostPosted: Sun Dec 13, 2020 9:39 pm
by rtanv003
The given dataset was filtered and the samples with other diseases were removed. Now it contains 414 samples, 310 cases and 104 controls. There were 160 male patients and 254 female patients. Among the male patients, there were 129 cases and 39 cases and among the female samples, there were 181 cases and 73 cases.

On the continuous dataset, limma was used to perform differential gene expression analysis. I selected 176 genes based on |logFC|>=3 and adjusted P-value <= 0.05.
Using these 176 genes and gender information, the discrete data was divided into male and female groups. And then, two kinds of score based structure learning algorithm (Hill-Climbing and Tabu) was performed and their BIC scores were compared. The ones with the higher scores were kept for further analysis.

Further analysis is comprised of extracting the common edges from BNs learned from male and female datasets and extracting Markov Blanket genes from both BNs. Their significance was later identified using survival analysis and GO term and Pathway Enrichment analysis.

The data files and the paper is in the attached zip file.