SMLG (Statistical Machine Learning Group) Discussion Forum

by **efrain.gonzalez0** » Tue Jun 13, 2017 3:04 pm

This file includes information on the data that is available for each of the GSE datasets that were specified as having used RNA sequencing. The SRP file seems to be the only one that is common to most GSE datasets. As you will see there is one GSE dataset that only contains an http link which instantly starts the download of a 134 GB file. I did not include the Series Matrix, SOFT, or MinML formats because they only seem to contain clinical information.

by **efrain.gonzalez0** » Wed Jun 14, 2017 1:00 pm

If you click on the ftp link that is associated with the SRA file you will be redirected to a ftp site that contains the data. This data is in the Sequence Read Archive (SRA) database. You will notice that the files end with ".sra". The submission format for these files could have been any of the following BAM, SFF, HDF5, and FASTQ with the first three being the preferred submission formats by NCBI. The NCBI website recommends that we use the SRA toolkit for downloading this data. From what I have been reading we can easily use this kit to download and convert the files into fastq or sam format. Below I have included the links that provide us with all of the information necessary for the task at hand.

https://www.ncbi.nlm.nih.gov/geo/info/faq.html
https://www.ncbi.nlm.nih.gov/sra/docs/sradownload/
https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/
https://github.com/ncbi/sra-tools/wiki
https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc
https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=std

by **efrain.gonzalez0** » Thu Jun 15, 2017 12:25 pm

Hello everyone,

I installed the SRA Toolkit on path-five. I was able to produce a ".fastq" version of the ".sra" file provided by NCBI. The following are the commands I used to do this and the links include more information on these commands:
All of this was done from within the ~/sratoolkit.2.8.2-1-ubuntu64/bin directory
first check if the toolkit can find a path to the SRR file:
./srapath SRR828708
second bring in the file and convert it to fastq:
./fastq-dump SRR828708

These links may also help:

More on Installation of SRA Toolkit https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=std
Some Documentation for Toolkit https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc
Downloading Data using Toolkit https://github.com/ncbi/sra-tools/wiki/Download-On-Demand

by **lsand039** » Thu Jun 15, 2017 4:15 pm

I've finally finished checking and cleaning the data using Efrain's code. In total there are 14 different studies with 2908 different samples and 7097 common genes. I'm able to use 2248 samples (1174 Alzheimer's/ 1074 controls, 1089 females/ 1138 males) after removing those that that had missing data or were not Alzheimer's or Alzheimer control samples.
Datasets used
GSE1297 GPL96
GSE15222 GPL2700
GSE16759 GPL570
GSE23290 GPL5175
GSE26927 GPL6255
GSE28146 GPL570
GSE29378 GPL6947
GSE36980 GPL6244
GSE37263 GPL5175
GSE39420 GPL11532
GSE44772 GPL4372
GSE48350 GPL570
GSE5281 GPL570
GSE84422 GPL570
GSE84422 GPL96

Combined Datasets.xlsx: (51.61 MiB) Downloaded 178 times

There were 1021 different subjects from those 2248 samples (535 Alzheiemer's/ 496 Control subjects).

Selected Datasets.xlsx: (59.93 KiB) Downloaded 172 times

contains more details each of the studies and the samples chosen. The first tab lists the studies from the GEO search that had information on both the age & sex of the sample. The studies in red could not be used for reasons explained in the notes column. The second tab breaks down the number of samples for the demographic/ clinical variables. The third tab provides the number and different types of brain samples within the chosen studies. The fourth tab has the number of genes and probe IDs that were matched and the proportion of the original dataset that matched up with other datasets. The fifth and sixth tabs list the dataset and GSE sample names that came from the same subject or were not included in the analysis.

aftexcel.tar.gz: (395.87 MiB) Downloaded 225 times

are the resulting text files from Efrain's RClean4.R code. These just match the gene name to the probe ID.

Discretized.tar.gz: (19.11 MiB) Downloaded 186 times

are the text files that resulted from Efrain's RPostClean.R code. This consolidated (averaged) the values of repeated genes (genes with multiple probe IDs), and discretized the gene expression values using the z-scores of the consolidated genes.

7097 Genes.tar.gz: (44.16 MiB) Downloaded 183 times

contain the Excel files where I discretized the Age, Sex, Brain Region, and Alzheimer status.

7097genes.txt: (30.54 MiB) Downloaded 184 times

is the data file I'll be using for BaNJO.

by **lsand039** » Fri Jun 16, 2017 11:50 am

I've made some slight edits to the Methods since using Efrain's R code. I haven't changed much on the information for BaNJO or Gene Ontology.

Methods.docx: (216.35 KiB) Downloaded 188 times

I'll be posting the Results part after I list the datasets which include genotypes.

by **lsand039** » Fri Jun 16, 2017 1:21 pm

The only microarray datasets that have genotype information were GSE39420 & GSE29652.

GSE39420matchedDPSEN1Mutation.xlsx: (915.34 KiB) Downloaded 181 times

This series was already included among the 14 studies. I've only added the PSEN1 mutation as a variable to this file.

GSE29652aftexcel.xlsx: (5.28 MiB) Downloaded 168 times

GSE39420matchedDPSEN1Mutation.xlsx: (915.34 KiB) Downloaded 181 times

This series has not yet been included in the 14 studies. The series matrix file only distinguishes samples as having an APOE e4 allele or not. It also doesn't specify the age or sex of the sample. I checked the published paper of the study, and it looks like the actual APOE genotype for the samples are listed along with age & sex, but the series matrix file doesn't provide any information for me to link the information from the paper to the appropriate sample on the series matrix.

by **lsand039** » Mon Jun 19, 2017 4:11 pm

Here are more updated Methods & Results sections.

by **lsand039** » Wed Jun 21, 2017 5:01 pm

Here are files from the BaNJO run of the 15 datasets. I've done 1, 2, 4, and 8 hour runs each in Path 2, 3, and 5.

7097genes.txt is the input data file with all 2254 samples. The settings#.txt correspond to the settings for the number of hours BaNJO ran.

BaNJO setting files.tar.gz: (3.4 MiB) Downloaded 178 times

This folder contains the original dot text files of the resulting graphs.

Original dot text files.tar.gz: (864.09 KiB) Downloaded 184 times

This folder contains the image results for the full structures. The Markov Blanket genes have been colored pink (1st degree) and orange (2nd degree). The Alzheimer node has been colored yellow.

Full structures.tar.gz: (12.92 MiB) Downloaded 189 times

This folder contains just the 1st and 2nd degree Markov Blanket genes. It follows the same color scheme at the full structures.

MB structures.tar.gz: (15.74 KiB) Downloaded 188 times

This is a list of the Markov Blanket genes per each result. The first tab shows the genes found per BaNJO running time interval for each path and the complete list of 136 Markov Blanket genes found (40 first degree MB genes and 96 second degree MB genes). Only one gene was found more than once: TNFRSF1A. The subsequent tabs show the number the gene corresponds to in the dot text file.

MB .xlsx: (15.69 KiB) Downloaded 181 times

This file lists the scores of each structure and the log likelihoods of each graph. The best graph was the one resulting from Path 2 at an 8 hour run as it made up 100% of the total score.

Scores.xlsx: (4.69 KiB) Downloaded 178 times

by **lsand039** » Fri Jun 23, 2017 12:06 pm

Efrain found that when he automated the code, the raw values were getting decimal places were getting a little cut off which affected the discretization process. GSE44772 was most affected. I reran BaNJO on the 15 datasets with the same settings as before.

7097genes.txt contains the new input data file with all 2262 samples. There are 8 more samples in this run than the 1st run since there were less subjects in GSE4772 that were missing data. I'm currently looking into why this is.

7097genes.txt: (30.69 MiB) Downloaded 183 times

This folder contains the original dot text files of the resulting graphs.

Original dot text files.tar.gz: (789.99 KiB) Downloaded 172 times

This folder contains the image results for the full structures. The Markov Blanket genes have been colored pink (1st degree) and orange (2nd degree). The Alzheimer node has been colored yellow.

Full Structures.tar.gz: (7.66 MiB) Downloaded 184 times

This folder contains just the 1st and 2nd degree Markov Blanket genes. It follows the same color scheme at the full structures. The 1 hour runs for Path 2 and 3 did not result in any MB genes.

MB structures.tar.gz: (5.67 KiB) Downloaded 182 times

The tab "Trial 2" is the list of the Markov Blanket genes per each result. Only 42 Markov Blanket genes found in this run(13 first degree MB genes and 29 second degree MB genes). None of the genes were found more than once. The subsequent tabs show the number the gene corresponds to in the dot text file.

MB7097.xlsx: (25.06 KiB) Downloaded 171 times

The second group of scores in this file lists the scores of each structure and the log likelihoods of each graph. The best graph was the one resulting from Path 2 at an 8 hour run as it made up 100% of the total score.

Scores.xlsx: (5.26 KiB) Downloaded 168 times

by **lsand039** » Fri Jun 23, 2017 5:15 pm

I noticed that the number of subjects that had missing data on GSE44772 were different when using the cleaning code (RCleanDscrete.R) vs. the automated version (RAutoCIDs.R). The discrete file from automated version missed 8 samples that had missing information on a probe ID/ gene expression value, meaning that it listed 0 column NAs. I'm not sure why since both the aftexcel files put "null" if there isn't a number for that probe ID/ gene.

Samples with missing data not picked up by RAutoCIDs.R:
GSM1090363
GSM1090365
GSM1090586
GSM1090628
GSM1090762
GSM1090792
GSM1090803
GSM1090910

GSE444772differences.xlsx shows the original probe IDs and values on the series matrix file for GSM1090363 in columns A & B.
Columns D-F are the values on the aftexcel files. Trial 1 uses RCleanDscrete.R and Trial 2 uses RAutoCIDs.R.
Columns H-J are the values for the dscrete files. This is where I'm running into several issues. The NAs found in Trial 1 seem to automatically encode as 2. Also, many genes encoded as 2 in Trial 1 are coded as 0 in Trial 2 and vice versa. I'm not sure why this is or which trial has the correct discretization.

GSE444772differences.xlsx: (1.59 MiB) Downloaded 176 times

Attached are the aftexcel and dscrete files for GSE44772 for Trial1 and Trial2. I'm working with Efrain to figure this out. For now, I'm not sure if I should run a third BaNJO trial where I just use the remove the samples that were included in Trial 2.

SMLG (Statistical Machine Learning Group) Discussion Forum

GEO datasets

RNA Sequence Alzheimers Data sets

Re: RNA Sequence Alzheimers Data

Re: RNA Sequence

Re: GEO datasets

Re: GEO datasets

Re: GEO datasets

Re: GEO datasets

Re: GEO datasets

Re: GEO datasets

Re: GEO datasets

Who is online