Page 1 of 1

Glioblastoma Gene Expression Prediction analysis.

PostPosted: Sat Mar 19, 2016 5:42 pm
by schen072
Attached are the Datasets used for Naive bayes and logistic regression analysis to predict Glioblastoma. Results are also attached for the reference.
Total of 17813 genes were present in TCGA and more than 20,000 in GEO. matching of TCGA and GEO yielded 6769 genes. Out of which below 19 genes were highly correlated to the disease in 184 patients.
Gene R value P value
ABCB5 -0.30821 <.0001
CD36 -0.26107 0.0002
CIT -0.26494 0.0002
FCRLB -0.26777 0.0002
GOLT1A -0.25901 0.0003
SNRPE -0.28957 <.0001
ADAM7 -0.15648 0.0298
ALPK2 -0.15346 0.0331
FAM26F -0.17889 0.0128
MCOLN3 -0.24179 0.0007
MDM4 -0.19583 0.0063
PAPPA2 -0.17849 0.013
PCDHB18 -0.18433 0.0103
PERP -0.22455 0.0017
PLEKHA5 -0.15757 0.0286
RSPO3 -0.17301 0.0161
SLC5A9 -0.14229 0.0484
TMTC1 -0.17525 0.015
These data were passed through R software for NBC and Logistic Regression.

NBC- Prior Propabilities are attached.
Training error for NBC was: 0.04663212


Logistic Regression-
Training Error: 0.02590674

Re: Glioblastoma Gene Expression Prediction analysis.

PostPosted: Wed Mar 23, 2016 7:28 am
by cwyoo
schen072 wrote:Attached are the Datasets used for Naive bayes and logistic regression analysis to predict Glioblastoma. Results are also attached for the reference.
Total of 17813 genes were present and 6399 were selected based on matching the genes between TCGA and GSM datasets. These 19 genes were highly correlated to the disease in 193 patients.
Gene R value P value
ABCB5 -0.30821 <.0001
CD36 -0.26107 0.0002
CIT -0.26494 0.0002
FCRLB -0.26777 0.0002
GOLT1A -0.25901 0.0003
SNRPE -0.28957 <.0001
ADAM7 -0.15648 0.0298
ALPK2 -0.15346 0.0331
FAM26F -0.17889 0.0128
MCOLN3 -0.24179 0.0007
MDM4 -0.19583 0.0063
PAPPA2 -0.17849 0.013
PCDHB18 -0.18433 0.0103
PERP -0.22455 0.0017
PLEKHA5 -0.15757 0.0286
RSPO3 -0.17301 0.0161
SLC5A9 -0.14229 0.0484
TMTC1 -0.17525 0.015
These data were passed through R software for NBC and Logistic Regression.

NBC- Prior Propabilities are attached.
Training error for NBC was: 0.04663212

Logistic Regression-
Training Error: 0.02590674


Great progress. Please work on implementing calculating IC, log (natural) likelihood, and training error for NBC and logistic regression and report them with the model. Work on the model with all the common genes included. Test forward selection and backward elimination algorithm. Extend NBC model to take gene expression to be modeled as {low (-1 => Z), no change (-1 < Z < 1), high (Z => 1)}.

Re: Glioblastoma Gene Expression Prediction analysis.

PostPosted: Sun Apr 03, 2016 5:11 pm
by cwyoo
schen072 wrote:Attached are the Datasets used for Naive bayes and logistic regression analysis to predict Glioblastoma. Results are also attached for the reference.
Total of 17813 genes were present and 6399 were selected based on matching the genes between TCGA and GSM datasets. These 19 genes were highly correlated to the disease in 193 patients.
Gene R value P value
ABCB5 -0.30821 <.0001
CD36 -0.26107 0.0002
CIT -0.26494 0.0002
FCRLB -0.26777 0.0002
GOLT1A -0.25901 0.0003
SNRPE -0.28957 <.0001
ADAM7 -0.15648 0.0298
ALPK2 -0.15346 0.0331
FAM26F -0.17889 0.0128
MCOLN3 -0.24179 0.0007
MDM4 -0.19583 0.0063
PAPPA2 -0.17849 0.013
PCDHB18 -0.18433 0.0103
PERP -0.22455 0.0017
PLEKHA5 -0.15757 0.0286
RSPO3 -0.17301 0.0161
SLC5A9 -0.14229 0.0484
TMTC1 -0.17525 0.015
These data were passed through R software for NBC and Logistic Regression.

NBC- Prior Propabilities are attached.
Training error for NBC was: 0.04663212

Logistic Regression-
Training Error: 0.02590674


What is the total subjects that you will use? In your final class project Wiki, you state there are 667 subjects, and here there are 193. What is the dataset that you emailed me (named finalC.csv) with 184 subjects? Please post the cleaned dataset that you are going to use in the class project here.

Re: Glioblastoma Gene Expression Prediction analysis.

PostPosted: Sun Apr 03, 2016 7:34 pm
by schen072
Attached are the final datasets for the project. Total of 184 patients data was selected form TCGA and GEO databases. similiar genes across the datasets were matched and missing data was removed leaving 184 patients. Total of more than 20,000 genes were present in both the data sets combined.The data was standardised individually and latter merged for analysis. 6769(6766 genes+Age+Gender+Race) were matching among the datasets, which are passed through R for further analysis of logLikelihood.

Logistic Regression - LogLikelihood didnot converge showing an error.

NaiveBayes- Loglikelihood was negative infinity.

For training error for significantly coorelated genes see previous posts
Log-likelihood and AIC/BIC for of significantly correlated genes using naive-Bayes and logistic regression will be reported next.

Re: Glioblastoma Gene Expression Prediction analysis.

PostPosted: Tue Apr 05, 2016 3:19 pm
by schen072
I found an interesting output.
Observation #1
I tried running Non-discretised data in naiveBayes packages. I got the same answer for likelihood.
The answer may be extreme but it may explain that the R e1071 package can read non discretized data too in the formula.

Observation #2
When a missing value is present in the dataset the conditional probabilities are shown individually. As attached below for Race.