SMLG (Statistical Machine Learning Group) Discussion Forum

Posted: **Sun Oct 22, 2017 5:46 pm**

I have worked in downloading GEO Datasets from Heavy metal exposures in Danio Rerio fish species exposed to heavy metals.
Throughout this search I have begun by developing a table from which I have selected studies which relate best to my search.
Once my data has been downloaded I shall continue with cleaning up unnecessary information.
Best,
Juan Morales, MPH

Posted: **Sun Oct 22, 2017 6:00 pm**

After Downloading the GEO Platforms and Matrix series the next process is to clean up the data with R script
- Open R script
- Make a folder
-Change (your) directory to the folder just created with all your data from GEO
- Use code for (your) records
What does this do?
- It cleans unnecessary data not needed for analysis
- It categorizes your samples and genes in different columns/ rows
- It allows your work flow/ time to be minimized rather than doing it manually

- The R code will analyze and generate
*After_xcel, Discretized, GPL_orig, together_xcel, and z-scores

This code was modified to specifically run Danio Rerio Fish species since there are acronyms not found in human studies
Once you open and clean up your data even more, you can begin the next step into matching your genes from all your datasets
Hope this helps
Best,

Juan Morales, MPH

Posted: **Sun Oct 22, 2017 6:36 pm**

Following the R_code cleaning, my n ext step will be to associate the most common genes in each datasets
By doing this I will not only allow to narrow down my search but it will provide commonality between the genes found most prominent in each study
To do this:
-Use the R Script
- Follow the same procedure in selecting your directory from the folder were your new datasets have been cleaned
-Use the RMatchGesnes_use. text file provided
- This will prompt you to a new query stating:

------Check2Match()
How many rows of clinical data are their in each data set (separate each number by a comma no spaces)?:

- This is where you need to (open the discretized file in excel and count down the spaces from the top in column A)- (DO NOT count the ID_REF)
- For Example, you need to count all the clinical data provided by GEO Datasets and count in column A until you reach GENE #1 and so forth)

(Do this for ALL datasets) and keep them organized as to how many clinical rows each data set contains

For example, in a total of 5 datasets ------- GSE30482=8 clinical, GSE47039= 3 Clinical, GSE50648=9 clinical, GSE74038= 9 clinical, and GSE1010582 = 5 clinical

Again, when R states THIS!!! : Check2Match()
How many rows of clinical data are their in each data set (separate each number by a comma no spaces)?:

I typed in the same order I collected in all the GSE discretized files----(8,3,9,9,5) ENTER

The process of matching and analysis takes around 1 1/2 hours, providing your matched genes

Best,

Juan Morales, MPH

Posted: **Thu Nov 09, 2017 10:16 am**

In this week I will begin to categorize my data in regards to averaging out my sample replicates. The chemicals analyzed in my research will be Arsenic, cadmium, cobalt, copper and lead.
Zebrafish are analyzed as adult as well as in their embryonic stages. I will try to generate a text file which contains control/ exposed with discretized data 0,1,2

As-8hrs
As-24hrs
As-48hrs
As-96hrs
Cd-8hrs
Cd-24hrs
Cd-48hrs
Cd-96hrs
Co-8hrs
Co-24hrs
Co-48hrs
Co-96hrs
Cu-8hrs
Cu-24hrs
Cu-48hrs
Cu-96hrs
Pb-8hrs
Pb-24hrs
Pb-48hrs
Pb-96hrs

Posted: **Mon Dec 04, 2017 7:50 pm**

This week I have been able to install Putty in my workstation but I still need to master the coding for Ubuntu
I also ran static simulations in Banjo 1 hour each.
Chemicals involved were Nickel, Lead, Arsenic and Cadmium
The next step will be to obtain my output since I left them running.
I encountered problems were Banjo did not detect the directory were my observational data was situated.
Later I realized that it was a simple syntax error. I also sat down with Lauren and Kaumudi but need more time to evaluate my results.
There were other errors in regards to the number of columns not matching the variables.

Posted: **Tue Dec 05, 2017 8:37 pm**

Great piece of information I came across this week. It explains the importance of using Dynamic Bayesian Networks from time series data.
Title: "Modelling Gene Expression Data using Dynamic Bayesian Networks"
https://users.cs.duke.edu/~amink/public ... 01.psb.pdf

Enjoy,
Best,

Juan

Posted: **Wed Sep 11, 2019 11:40 am**

Hello everyone,

This post will help everyone who is interested in analyzing and comparing zebrafish samples. Attached I will be posting a descriptive table that identifies the exposure category, GEO accession series, platform, sample IDs, number of samples, treatment conditions and gender.
All samples were extracted from the NCBI GEO website. The raw data was downloaded individually from their respective samples. The platform and series matrix were joined and matched according to sample IDs to gene IDs to expression intensities using excel and R studio.

Brief summary of the experiment design:

Series Title: Trancriptome Kinetics of Arsenic-induced Adaptive Response in Zebrafish Liver.
Series Accession Number: GSE 3048
Platform: GPL 2715
Expression Profile: Microarray
Organism: Zebrafish
CH1: Cy5--> Treated sample RNA
CH2: Cy3-->Reference pooled sample (male and female) liver RNA

The arrays contain 16,416 oligonucleotide probes. The probes were designed by Compugen and synthesized by Sigma Genesis. The array also contains 172 spots representing the same beta-actin probe as controls. In order to estimate a suitable concentration for the arsenic experiment, the adult zebrafish were treated with different concentrations of arsenic V. Basis from the toxicology test, (ran with different concentrations) the concentration of 15ppm As V was utilized to treat the adult zebrafish. For the microarray study , Fish liver samples were sampled at (8,24,48,and 96 hours) from which they were separated into three pooled samples at each time point. During the initial design steps, reference RNA was obtained by pooling equal amounts of male and female total RNA extractred from the livers tissues of wildtype zebrafish.

Please fell free to comment below if there are any questions that you may encounter
Thanks, and have a great week

Best,
Juan Carlos Morales, MPH | Environmental Health Sciences
U.S. Department of Energy Doctoral Fellow - Graduate Research Assistant
Systems Biology Group
Robert Stempel College of Public Health & Social Work
Florida International University
11200 S.W. 8th Street, AHC 5 353 . Miami, FL 33199
Tel. 305.348.3994 Mobile 786.282.4458 E: jumorale@fiu.edu

Posted: **Sun Sep 15, 2019 4:26 pm**

Hello everyone,

Following my next step into analyzing dataset (GSE 3048/ GPL2715) will be to obtain the (Tk medians) for each gene. Tk median = ( R spot median- R background median/ G spot median- G background median). The raw data included the spot intensity along with the reference sample used to obtain the expression ratio for each sample.
Thank you,

Juan Morales

Posted: **Sun Sep 15, 2019 5:00 pm**

On my previous post, I was able to identify the median Tk expression values. This was calculated by obtaining (CH1) form the raw expression value for each gene and subtracting the background intensity levels from each probe. The reference intensity (CH2) values are also subrated by the background intensity values. Tk medians= (Ch1 medians-Ch1 Bkground/ CH2 medians-Ch2 Bkground).

The samples used in this analysis were: (total samples 24)
CONTROL--> 3 control (8hrs), 3 control (24hrs), 3 control (48hrs) and 3 control (96 hrs)
TREATMENT--> 3 treatment (8hrs), 3 treatment (24hrs), 3 treatment (48 hrs) and 3 treatment (96 hrs).

Sample IDs
GSM67011
GSM67012
GSM67013
GSM67014
GSM67015
GSM67016
GSM67017
GSM67018
GSM67019
GSM67020
GSM67021
GSM67022
GSM67023
GSM67024
GSM67025
GSM67026
GSM67027
GSM67028
GSM67029
GSM67030
GSM67031
GSM67032
GSM67033
GSM67034

Posted: **Sun Sep 15, 2019 7:23 pm**

Good afternoon everyone, I hope everyone was able to follow along into analyzing the samples. Here I will attach and describe how to transform the data from median Tk expression ratios to log 2 expression ratios. The samples represented as log2 )expression ratio) values, will identify up-regulation and down -regulation captured in a systematic matter. For example, 4-fold up regulation maps to log 2 (4) =2 and a 4 fold down regulation mapes to log 2 (1/4)= -2. From this attachment the reader will be able to identify the differential regulated gene under any condition.

This transformation has a great advantage since it will treat the differential up and down regulation equally, and also has a continuous mapping space. . The excel table includes all the formulas needed to (convert the -->Tk medians to Log2 ) to identify up and down regulated genes.

Please review the table,

Thank you,

Juan Morales

SMLG (Statistical Machine Learning Group) Discussion Forum

Danio rerio species GEO Dataset (Heavy metal) Microarray

Danio rerio species GEO Dataset (Heavy metal) Microarray

Re: Danio rerio species GEO Dataset (Heavy metal) Microarray

Re: Danio rerio species GEO Dataset (Heavy metal) Microarray

Re: Danio rerio species GEO Dataset (Heavy metal) Microarray

Re: Danio rerio species GEO Dataset (Heavy metal) Microarray

Re: Danio rerio species GEO Dataset (Heavy metal) Microarray

Re: Danio rerio species GEO Dataset (Heavy metal) Microarray

Re: Danio rerio species GEO Dataset (Heavy metal) Microarray

Re: Danio rerio species GEO Dataset (Heavy metal) Microarray

Re: Danio rerio species GEO Dataset (Heavy metal) Microarray