SMLG (Statistical Machine Learning Group) Discussion Forum

by **lsand039** » Mon Apr 17, 2017 3:05 pm

A continuation of the last post

by **lsand039** » Tue Apr 18, 2017 10:09 am

The files below have scores on GeNIe and BaNJO that have scores we expect (the GMLS derived from Bene has a better score than the CS) and have parameters that match the data.

Here are their significance values:
Banjo:
CS: -10836.4009
GMLS: -10827.6995
min: 0.012897781
max: 0.113568398

Genie
CS: -8517.694735
GMLS: -8493.975566
min: 7.07E-06
max: 0.002659

by **lsand039** » Tue Apr 18, 2017 11:40 am

I went to validate the GMLS & CS files previously posted since they have the correct parameters on GeNIe. I'm getting the same Accuracy, ROC, and prediction values from both of them using the Leave One Out test. The only difference is that now some of the AD/Non AD prediction values on the GMLS and CS files actually matche with the prediction values in the validation output files.t I'm still not sure why not all the AD/Non AD prediction values match with the prediction values in its output files are matching on both the CS and GMLS .

Questions I still have:
How is GeNIe scoring and predicting structures?
Why are only some of the prediction values from the validation file matching the structure prediction values for the GMLS & CS files?

by **lsand039** » Tue Apr 18, 2017 10:26 pm

I tried to change the ESS on BaNJO from the default value of 1.0 to 0.001 to match what I've been using on GeNIe. Unfortunately, I don't think I can specify anything lower than 1.0. Attached are the setting files and output summary files from my attempt.

by **lsand039** » Wed Apr 19, 2017 11:08 am

Here are the dot structures for the GMLS & CS. The thickness of the arcs correspond with the magnitude of the influence score. Influence scores can be found on the results summary of the previous post
blue with arrow: positive influence score
red with perpendicular end: negative influence score.
black: influence score of 0

Please let me know if they are difficult to read. I had to play around with the thickness of arcs so all of them could be visible and not overly obnoxious.

by **lsand039** » Thu May 11, 2017 12:09 pm

I found that out of the 12 datasets we have, there were only 8257 genes in common. I removed the extra genes from previous post that contained the data for what I though were 8286 common genes Attached are the datasets with those 8257 common genes and the gene expression levels already discretized.

The data from GSE48350 with the 8257 common genes are not up yet since I need Access or Base to find the common genes in this dataset. I'll be posting the input files for BaNJO that has Age, Sex, AD, and Brain Region discretized next.

by **lsand039** » Thu May 11, 2017 12:10 pm

A continuation of the last post.

by **lsand039** » Thu May 11, 2017 5:29 pm

Here I've discretized the age, sex, brain region, and Alzheimer's status of the samples.
Age>65=1, Age<65=0
Hippocampus=1, Non-hippocampus=0
Female=1, Male=0
Alzheimer=1, Control/Non-Alzheimer=0

Sheet3 for 84422 contains only the samples that were definitely AD or Normal.

by **lsand039** » Thu May 11, 2017 5:30 pm

Continued from las post

by **lsand039** » Fri May 12, 2017 2:56 pm

Here is a table of the 12 datasets I plan to be using. They have 8257 genes in common.

: Dataset Summary.png (44.77 KiB) Viewed 92058 times

The GSE # refers to their GEO accession number. GSE84422 used two platforms, GPL96 and GPL570. I only counted the samples that were definitively AD and controls. I still need to clean up GSE48350 using Base/ Access, but right now I'm having issues opening the file on either of those.

To find out how much of each data set was included in the list of common genes, I went to the list of genes in the GPL file. The column labled "Original # of genes in GPL" refers to the number of genes I found in the GPL file. The number within the parentheses is the GPL#.

Not all the genes in the GPL file are always shown in the GSE dataset. Because multiple probe IDs can match with the same gene, I couldn't directly determine how many genes were available in each dataset. I could find out using Base or Access, but I'm running into a couple issues. Base needs Java Runtime Environment which doesn't seem to be installed in Path 3 and maybe Path 5. Java Runtime Environment is installed in Path 4, but Base keeps freezing up. I think it's because of the size of the files I'm using.

I think Access lets me work with larger files, but the Virtual Machine on Path-3 is too low on disk space. I've tried to increase the memory and delete any unnecessary files, but I can't get enough free space to open my files. I will also eventually need Excel so I can include all 2221 samples during a BaNJO run. LibreOffice Calc has a 1024 column limit, so there won't be enough room to format the data in either variables as columns/samples as rows or samples as columns/variables as rows.

Once I can use Access or Base, the data should be ready to go through BaNJO!

SMLG (Statistical Machine Learning Group) Discussion Forum

GEO datasets

Re: GEO datasets

Re: GEO datasets

Re: GEO datasets

Re: GEO datasets

Re: GEO datasets

Re: GEO datasets

Re: GEO datasets

Re: GEO datasets

Re: GEO datasets

Re: GEO datasets

Who is online