SMLG (Statistical Machine Learning Group) Discussion Forum

by **shstyoo** » Mon Mar 07, 2016 5:23 pm

I think I understand what we're looking to accomplish now. I'll work on the script and upload it later tonight if time permits.

by **shstyoo** » Mon Mar 07, 2016 8:21 pm

Quick update, I've gotten the script to work (for small datasets). However the issue is that files like GSE6360 have over 50k+ lines that the script has to run through, so I'm going to be working on reducing the runtime.
That being said, there are also a couple of bugfixes I have to do, so the code should be done by Tuesday.

I'll write up a tutorial on Git (I assume you are running a Windows machine correct?) once the script is finished.

by **shstyoo** » Tue Mar 08, 2016 1:49 am

The finalized script is up and running. If you would like to download a local version of the script do so here:

https://github.com/shstyoo/alzheimer-prediction-model

I'm not sure where to push the script on the Gitlab page. I could push it to Summer 2014 Chronic Disease Model or 2014 acheeti code. Let me know which one you want me to push it to.

In order to use the script, you will have to download it to a local folder.

In the folder put the Probe & Gene ID CSV file (name it something easy to type into a command line), also put the Dataset file in the same folder as well.

Click the gene-probe_id.py application and launch it.

Follow the prompts and type in the names of the files (case sensitive and you are required to type in the file extension as well).

Let me know if there are any issues with the script, If I can find some of the probe ID's for the GSE6360, or any equivalently sized file I'll try to test it out. Currently it seems to work for moderately sized files.

by **cwyoo** » Tue Mar 08, 2016 7:05 am

lsand039 wrote:
cwyoo wrote:
shstyoo wrote:Just a quick update, my computer is having trouble opening the actual GSE6360_family.soft file (found on the GEO website). It looks like the file size is too large for Open Office to handle. Are you able to open it through Microsoft Excel?

Steve, could you create a script that reads in two text files (which are in comma separated (csv) or tap separtate (txt) format) and creates one text file with same format? You may use the two files that Lauren posted here (Probe IDs and Gene Names.csv is already in comma separated format; and GSE63060.xlsx should be converted into a comma separated (csv) or a tap separtate (txt) format).

So, your script should read in Probe IDs and Gene Names.csv and GSE63060.xls (converted into a comma separated (csv) or a tap separtate (txt) format; let's call it GSE63060.csv) and produce a result text file that adds a column called GeneID (that corresponds to the Probe ID from Probe IDs and Gene Names.cs) into GSE63060.csv.

Since Lauren has more files like GSE63060.xlsx, she is planning to use your script and produce the text files that are needed to do further analyses. Please let us know if you have any other questions/comments.

I wasn't able to open the actual GSE6360_family.soft file, but I did find another file, GSE63060_series_matrix.txt, that contained the probe ID and array information. This file was small enough to let me open on Excel, and I used this file to create GSE63060.xlsx. The file was downloaded from http://www.ncbi.nlm.nih.gov/geo/query/a ... c=GSE63060.

Would you prefer I upload files like GSE63060.xlsx as *.csv files? Also, it would be great if you could let me know how to use Git. Thanks!

I believe the code requires the csv file, there is no need to post it here again since it is easily converted with Excel file. For Git tutorial, please go to:

Board index ‹ Manuscripts & Documentation ‹ Useful Tools for Implementation ‹ Document Revision Control

by **cwyoo** » Tue Mar 08, 2016 7:13 am

shstyoo wrote:The finalized script is up and running. If you would like to download a local version of the script do so here:

https://github.com/shstyoo/alzheimer-prediction-model

I'm not sure where to push the script on the Gitlab page. I could push it to Summer 2014 Chronic Disease Model or 2014 acheeti code. Let me know which one you want me to push it to.

In order to use the script, you will have to download it to a local folder.

In the folder put the Probe & Gene ID CSV file (name it something easy to type into a command line), also put the Dataset file in the same folder as well.

Click the gene-probe_id.py application and launch it.

Follow the prompts and type in the names of the files (case sensitive and you are required to type in the file extension as well).

Let me know if there are any issues with the script, If I can find some of the probe ID's for the GSE6360, or any equivalently sized file I'll try to test it out. Currently it seems to work for moderately sized files.

You can login to SMLG Git server and create a project called "Gene ID & Probe ID" and mark it as public and push it there.

by **lsand039** » Thu Mar 24, 2016 2:44 pm

Here are the *.csv files for GDS810 GDS4136. The last column all the genes' correlation value. The *.xlsx file just has a list of the top genes for each file. I still need to find the top genes that match in each file, which I'll resume after my interview.

by **lsand039** » Thu Mar 24, 2016 6:07 pm

Attached are 50 genes from both GDS810 & GDS4136 that both have a high correlation values. These were chosen by looking at the top genes correlation values form GDS810 and finding the same gene with the highest correlation value on GDS4136. Many of the genes with the highest correlation values on GDS4136 weren't found on GDS810, which is why I started with GDS810.

Let me know what I should do next so I can add to my poster board. Also, Dr. Yoo, you mentioned that you would post Steve's past poster so I could use as an additional guide. I just wanted to follow up with that. Thanks!

by **lsand039** » Fri Mar 25, 2016 1:31 pm

GDS810 and GDS4136.xlsx have the table containing the z-score values of the genes of interest for each sample (sheet 3) and the list of correlation values used to figure out the top genes (sheet 1). Many of the genes of interest had multiple rows for the same gene, so I chose rows at random. Most of the high correlation values were positive values, even after taking absolute values in consideration.
The files used to calculate the z-scores for each study are GDS810.xlsx and GDS4136.xlsx.

by **lsand039** » Sat Mar 26, 2016 9:31 pm

I got through fixing the correlation values and the genes of interest. Attached are the revised files

by **cwyoo** » Mon Mar 28, 2016 7:44 am

lsand039 wrote:I got through fixing the correlation values and the genes of interest. Attached are the revised files

Attached is the excel file that includes the disctetized sheet added. Please refer to the discritized sheet to see how the the data was categorized to run in Banjo (see http://smlg.fiu.edu/phpbb/viewforum.php?f=33). Here are Bayesian Network learning results (attached) using Banjo and Log normalization (see http://smlg.fiu.edu/phpbb/viewtopic.php?f=42&t=19). Please read and learn how to use these tools.

You may download and look into GeNIe (see http://smlg.fiu.edu/phpbb/viewforum.php?f=32) to further process the results.

SMLG (Statistical Machine Learning Group) Discussion Forum

GEO datasets

Re: GEO datasets

Re: GEO datasets

Re: GEO datasets

Re: GEO datasets

Re: GEO datasets

Re: GEO datasets

Re: GEO datasets

Re: GEO datasets

Re: GEO datasets

Re: GEO datasets

Who is online