Probabilistic Graphical Models, Fall 2020

Class Projects from courses such as Probabilistic Graphical Network, Biostatistics II, etc.

Probabilistic Graphical Models, Fall 2020

Postby vsteb002 » Sat Oct 10, 2020 3:45 pm

Probabilistic Graphical Models, Fall 2020 class projects
vsteb002
 
Posts: 4
Joined: Sat Oct 10, 2020 1:39 pm

Re: Probabilistic Graphical Models, Fall 2020

Postby vsteb002 » Sat Oct 10, 2020 5:12 pm

The data was taken from the following post:
viewtopic.php?f=84&t=145&start=20#p1892
(assembled and cleaned by Zhenghua Gong)

Number of samples: 1255

Variables with missing data
DiseaseGrade: 1206 missing samples;
age: 53 missing samples;
gender: 31 missing samples;

Due to limited number of samples with available Disease Grade, this variable were removed;

Number of samples after dropping DiscreteGrade, and dropping samples for which gender or age is missing: 1202

The cleaned data is attached.
Attachments
Merged-discret-dropNA.csv
(78.68 MiB) Downloaded 65 times
Merged-continue-dropNA.csv
(290.16 MiB) Downloaded 65 times
vsteb002
 
Posts: 4
Joined: Sat Oct 10, 2020 1:39 pm

Re: Probabilistic Graphical Models, Fall 2020

Postby vsteb002 » Tue Oct 13, 2020 5:33 am

I selected the top 40 genes that correlate best with the disease variable. For the discrete dataset, I used the Kendall rank correlation, while for continuous data I used the Spearman correlation approach.

The data with selected genes are attached, as well as the jupyter notebook with corresponding source code.
Attachments
vitalii_selecting_top_genes.pdf
(865.68 KiB) Downloaded 72 times
Merged-discret-top40.csv
(213.17 KiB) Downloaded 70 times
Merged-continue-top40.csv
(649.87 KiB) Downloaded 66 times
vsteb002
 
Posts: 4
Joined: Sat Oct 10, 2020 1:39 pm

Re: Probabilistic Graphical Models, Fall 2020

Postby rtanv003 » Sun Dec 13, 2020 9:39 pm

The given dataset was filtered and the samples with other diseases were removed. Now it contains 414 samples, 310 cases and 104 controls. There were 160 male patients and 254 female patients. Among the male patients, there were 129 cases and 39 cases and among the female samples, there were 181 cases and 73 cases.

On the continuous dataset, limma was used to perform differential gene expression analysis. I selected 176 genes based on |logFC|>=3 and adjusted P-value <= 0.05.
Using these 176 genes and gender information, the discrete data was divided into male and female groups. And then, two kinds of score based structure learning algorithm (Hill-Climbing and Tabu) was performed and their BIC scores were compared. The ones with the higher scores were kept for further analysis.

Further analysis is comprised of extracting the common edges from BNs learned from male and female datasets and extracting Markov Blanket genes from both BNs. Their significance was later identified using survival analysis and GO term and Pathway Enrichment analysis.

The data files and the paper is in the attached zip file.
Attachments
PHC 6067 Project RBT.zip
(99.82 MiB) Downloaded 69 times
rtanv003
 
Posts: 4
Joined: Wed Aug 26, 2020 10:45 pm


Return to Class Projects

Who is online

Users browsing this forum: No registered users and 5 guests