I've finally finished checking and cleaning the data using Efrain's code. In total there are 14 different studies with 2908 different samples and 7097 common genes. I'm able to use 2248 samples (1174 Alzheimer's/ 1074 controls, 1089 females/ 1138 males) after removing those that that had missing data or were not Alzheimer's or Alzheimer control samples.
Datasets usedGSE1297 GPL96
GSE15222 GPL2700
GSE16759 GPL570
GSE23290 GPL5175
GSE26927 GPL6255
GSE28146 GPL570
GSE29378 GPL6947
GSE36980 GPL6244
GSE37263 GPL5175
GSE39420 GPL11532
GSE44772 GPL4372
GSE48350 GPL570
GSE5281 GPL570
GSE84422 GPL570
GSE84422 GPL96
There were 1021 different subjects from those 2248 samples (535 Alzheiemer's/ 496 Control subjects).
contains more details each of the studies and the samples chosen. The first tab lists the studies from the GEO search that had information on both the age & sex of the sample. The studies in red could not be used for reasons explained in the notes column. The second tab breaks down the number of samples for the demographic/ clinical variables. The third tab provides the number and different types of brain samples within the chosen studies. The fourth tab has the number of genes and probe IDs that were matched and the proportion of the original dataset that matched up with other datasets. The fifth and sixth tabs list the dataset and GSE sample names that came from the same subject or were not included in the analysis.
are the resulting text files from Efrain's RClean4.R code. These just match the gene name to the probe ID.
are the text files that resulted from Efrain's RPostClean.R code. This consolidated (averaged) the values of repeated genes (genes with multiple probe IDs), and discretized the gene expression values using the z-scores of the consolidated genes.
contain the Excel files where I discretized the Age, Sex, Brain Region, and Alzheimer status.
is the data file I'll be using for BaNJO.