Glioblastoma Gene Expression Prediction analysis.

Analyses related to genomic, proteomic, clinical and envirointal interactions in brain tumor

Glioblastoma Gene Expression Prediction analysis.

Postby schen072 » Sat Mar 19, 2016 5:42 pm

Attached are the Datasets used for Naive bayes and logistic regression analysis to predict Glioblastoma. Results are also attached for the reference.
Total of 17813 genes were present in TCGA and more than 20,000 in GEO. matching of TCGA and GEO yielded 6769 genes. Out of which below 19 genes were highly correlated to the disease in 184 patients.
Gene R value P value
ABCB5 -0.30821 <.0001
CD36 -0.26107 0.0002
CIT -0.26494 0.0002
FCRLB -0.26777 0.0002
GOLT1A -0.25901 0.0003
SNRPE -0.28957 <.0001
ADAM7 -0.15648 0.0298
ALPK2 -0.15346 0.0331
FAM26F -0.17889 0.0128
MCOLN3 -0.24179 0.0007
MDM4 -0.19583 0.0063
PAPPA2 -0.17849 0.013
PCDHB18 -0.18433 0.0103
PERP -0.22455 0.0017
PLEKHA5 -0.15757 0.0286
RSPO3 -0.17301 0.0161
SLC5A9 -0.14229 0.0484
TMTC1 -0.17525 0.015
These data were passed through R software for NBC and Logistic Regression.

NBC- Prior Propabilities are attached.
Training error for NBC was: 0.04663212


Logistic Regression-
Training Error: 0.02590674
Attachments
Logistic Regression Results.txt
Prediction Probabilities
(65.91 KiB) Downloaded 7078 times
NBC Results.txt
Apriori Probabilities and Pred. and Pred.1.
(51.25 KiB) Downloaded 6533 times
diseaseb.txt
This is Binary data used for NBC approach
(11.25 KiB) Downloaded 6466 times
disease.txt
This is standardised data.
(48.35 KiB) Downloaded 6508 times
Last edited by schen072 on Tue Apr 05, 2016 2:09 pm, edited 3 times in total.
schen072
 
Posts: 13
Joined: Thu Nov 05, 2015 12:09 pm

Re: Glioblastoma Gene Expression Prediction analysis.

Postby cwyoo » Wed Mar 23, 2016 7:28 am

schen072 wrote:Attached are the Datasets used for Naive bayes and logistic regression analysis to predict Glioblastoma. Results are also attached for the reference.
Total of 17813 genes were present and 6399 were selected based on matching the genes between TCGA and GSM datasets. These 19 genes were highly correlated to the disease in 193 patients.
Gene R value P value
ABCB5 -0.30821 <.0001
CD36 -0.26107 0.0002
CIT -0.26494 0.0002
FCRLB -0.26777 0.0002
GOLT1A -0.25901 0.0003
SNRPE -0.28957 <.0001
ADAM7 -0.15648 0.0298
ALPK2 -0.15346 0.0331
FAM26F -0.17889 0.0128
MCOLN3 -0.24179 0.0007
MDM4 -0.19583 0.0063
PAPPA2 -0.17849 0.013
PCDHB18 -0.18433 0.0103
PERP -0.22455 0.0017
PLEKHA5 -0.15757 0.0286
RSPO3 -0.17301 0.0161
SLC5A9 -0.14229 0.0484
TMTC1 -0.17525 0.015
These data were passed through R software for NBC and Logistic Regression.

NBC- Prior Propabilities are attached.
Training error for NBC was: 0.04663212

Logistic Regression-
Training Error: 0.02590674


Great progress. Please work on implementing calculating IC, log (natural) likelihood, and training error for NBC and logistic regression and report them with the model. Work on the model with all the common genes included. Test forward selection and backward elimination algorithm. Extend NBC model to take gene expression to be modeled as {low (-1 => Z), no change (-1 < Z < 1), high (Z => 1)}.
cwyoo
Site Admin
 
Posts: 385
Joined: Sun Jun 22, 2014 2:38 pm

Re: Glioblastoma Gene Expression Prediction analysis.

Postby cwyoo » Sun Apr 03, 2016 5:11 pm

schen072 wrote:Attached are the Datasets used for Naive bayes and logistic regression analysis to predict Glioblastoma. Results are also attached for the reference.
Total of 17813 genes were present and 6399 were selected based on matching the genes between TCGA and GSM datasets. These 19 genes were highly correlated to the disease in 193 patients.
Gene R value P value
ABCB5 -0.30821 <.0001
CD36 -0.26107 0.0002
CIT -0.26494 0.0002
FCRLB -0.26777 0.0002
GOLT1A -0.25901 0.0003
SNRPE -0.28957 <.0001
ADAM7 -0.15648 0.0298
ALPK2 -0.15346 0.0331
FAM26F -0.17889 0.0128
MCOLN3 -0.24179 0.0007
MDM4 -0.19583 0.0063
PAPPA2 -0.17849 0.013
PCDHB18 -0.18433 0.0103
PERP -0.22455 0.0017
PLEKHA5 -0.15757 0.0286
RSPO3 -0.17301 0.0161
SLC5A9 -0.14229 0.0484
TMTC1 -0.17525 0.015
These data were passed through R software for NBC and Logistic Regression.

NBC- Prior Propabilities are attached.
Training error for NBC was: 0.04663212

Logistic Regression-
Training Error: 0.02590674


What is the total subjects that you will use? In your final class project Wiki, you state there are 667 subjects, and here there are 193. What is the dataset that you emailed me (named finalC.csv) with 184 subjects? Please post the cleaned dataset that you are going to use in the class project here.
cwyoo
Site Admin
 
Posts: 385
Joined: Sun Jun 22, 2014 2:38 pm

Re: Glioblastoma Gene Expression Prediction analysis.

Postby schen072 » Sun Apr 03, 2016 7:34 pm

Attached are the final datasets for the project. Total of 184 patients data was selected form TCGA and GEO databases. similiar genes across the datasets were matched and missing data was removed leaving 184 patients. Total of more than 20,000 genes were present in both the data sets combined.The data was standardised individually and latter merged for analysis. 6769(6766 genes+Age+Gender+Race) were matching among the datasets, which are passed through R for further analysis of logLikelihood.

Logistic Regression - LogLikelihood didnot converge showing an error.

NaiveBayes- Loglikelihood was negative infinity.

For training error for significantly coorelated genes see previous posts
Log-likelihood and AIC/BIC for of significantly correlated genes using naive-Bayes and logistic regression will be reported next.
Attachments
Logistic regression AIC.PNG
AIC for logistic regression-Full model
Logistic regression AIC.PNG (6.22 KiB) Viewed 52433 times
DiseaseF2.csv
Dataset for Logistic regression
(10.14 MiB) Downloaded 8157 times
Disease.Binary.csv
Dataset with Binary Transformation
(2.42 MiB) Downloaded 6632 times
lastsave.txt
Individual probabilities using naive Bayes algorithm.
(133.44 KiB) Downloaded 6496 times
NBCfinal.JPG
Log-Likelihood- NaiveBayes
NBCfinal.JPG (90.18 KiB) Viewed 52443 times
Logistic regression output.JPG
Log-Likelihood-Logistic regression
Logistic regression output.JPG (21.86 KiB) Viewed 52443 times
schen072
 
Posts: 13
Joined: Thu Nov 05, 2015 12:09 pm

Re: Glioblastoma Gene Expression Prediction analysis.

Postby schen072 » Tue Apr 05, 2016 3:19 pm

I found an interesting output.
Observation #1
I tried running Non-discretised data in naiveBayes packages. I got the same answer for likelihood.
The answer may be extreme but it may explain that the R e1071 package can read non discretized data too in the formula.

Observation #2
When a missing value is present in the dataset the conditional probabilities are shown individually. As attached below for Race.
Attachments
MIssing value.PNG
MIssing value.PNG (4.42 KiB) Viewed 52433 times
NBC output.PNG
Output for Discretized and non-discretized
NBC output.PNG (5.18 KiB) Viewed 52433 times
NBC with Continuous variables.txt
Conditional probability with Non discretized data
(135.4 KiB) Downloaded 7057 times
NBC Binary Probabilities.txt
NBC with binary Dataset-Conditional Probabilities
(131.65 KiB) Downloaded 6517 times
schen072
 
Posts: 13
Joined: Thu Nov 05, 2015 12:09 pm


Return to Brain Tumor

Who is online

Users browsing this forum: No registered users and 2 guests

cron