SMLG (Statistical Machine Learning Group) Discussion Forum

by **cwyoo** » Tue Sep 17, 2019 2:24 pm

Probabilistic Graphical Models, Fall 2019 class projects related information will be post here. Every Friday, please post what you have been working toward your class project.

by **Olu1** » Tue Sep 24, 2019 12:46 pm

I'm trying to study around datasets for the class project. Hopefully, i will be able to come out with sensible questions.
So far, i observed some datasets are found in more than one category. please, see attached file for these datasets and categories they belong.
Dataset GSE84010 in particular, when merged with other datasets, has reduced significantly the number of genes in the final merged datasets for Grade, Treatment and Survival. Case control category do not have this particular dataset.

by **Olu1** » Thu Oct 03, 2019 1:55 pm

I try to make sense of the original data before merging by reviewing them in NCBI database, and i found out that some datasets grouped witth GRADE do not have any grade information. Also, some grouped with SURVIVAL do not have any survival info. Treatment group is okay.

The following were grouped as GRADE: GSE9885, GSE84010, GSE73038, GSE53228, GSE42670, GSE36426, GSE31545, GSE25632, GSE10878. But GSE9885, GSE73038 and GSE42670 do not have any grade info, even though the are tumor samples. Instead GSE9885 has survival info, although not extracted. It will be good to merged it with survival group instead.

The following also were grouped as SURVIVAL: GSE84010, GSE7696, GSE42670, GSE31545 but GSE42670 and GSE31545 do not have any survival info. I suggest the should be removed from the group.

Sir, can you please review it and make suggestion.

Thank you.

by **Olu1** » Fri Oct 18, 2019 3:31 pm

I attached here some descriptive statistic i did on the class project so far.
Please, you input and comment will be highly appreciated.
I merged some dataset as explained in my earlier post generate the attached dataset.

by **Olu1** » Tue Oct 22, 2019 1:07 pm

Discretized glioblastoma datasets are combined here into two groups; case control & grade group and treatment & survival group.
Please see attached csv files for these new datasets.

Case control and Grade
See attached cc&gmerged.csv file.
The following dataset (which were originally case control and grade dataset) were merged as a single dataset i called "CC&G":
GSE7696
GSE6014
GSE41467
GSE36278
GSE25632
GSE10878
GSE9885
GSE73038
GSE53228
GSE42670
GSE36426
GSE31545

Treatment and Survival
See attached t&smerged.csv file.
The following dataset (which were originally treatment and survival dataset) were combined to for a single dataset i called "T&S":
GSE84010
GSE7696
GSE7344
GSE42670
Please, review for corrections and comments.

by **bernardoj** » Thu Nov 07, 2019 10:52 pm

Using the dataset that Olu posted, I transposed it and cleaned it up a bit. I filled in missing values and corrected a few values. I had to transpose it since the BN algorithm will think that trials are variables and not the other way around. After doing that I went through and got rid off unnecessary issues. At the start, there were 618 total patients/observations. I removed 32 that did not have a gender specified, which left 586. Then I removed 13 that did not have an age specified. this left a total of 573 total observations. From there I looked at how each glioblastoma case was graded. There were 16 cases where the grade was 0, 86 where the grade was 1, 4 where the grade was 3, 114 where the grade was 4 and 352 where the grade was 11. Since I'm not sure what 11 means, I decided to do two things. In one scenario, I turned anything that was not 0 into 1 to signify which people had glioblastoma and which did not. I called this dataset CaseControlGrade Binary. After doing a little research, I came to the realization that 11 might mean Grade II. Therefore, in the second scenario, I changed just the 11s to 2. This way the data sets now have grade 0, 1, 2, 3, 4. Lastly, I ran the Min-Max Hill Climbing algorithm to see what I get as a trial, and to see how long it would take. It took about 4 hours to finish. In the zip file you can see the results. I will be rerunning the MMHC to see what results I get now with a more clean dataset.

by **Olu1** » Sat Nov 09, 2019 2:25 pm

Datasets for survival information on GBM (GSE7696, GSE84010) was downloaded from NCBI database using Geoparse in python. Then, data was clean and merged as a single data using a python code we created. After cleaning and merging, we have in the data individual specific ID, demographic information, overall survival time, 638 genes for 428 subjects. The data is now ready for analysis. Summary for this data is available in the attached updated proposal.

Please see attached dataset and proposal.

by **Olu1** » Thu Nov 14, 2019 1:06 pm

bernardoj wrote:Using the dataset that Olu posted, I transposed it and cleaned it up a bit. I filled in missing values and corrected a few values. I had to transpose it since the BN algorithm will think that trials are variables and not the other way around. After doing that I went through and got rid off unnecessary issues. At the start, there were 618 total patients/observations. I removed 32 that did not have a gender specified, which left 586. Then I removed 13 that did not have an age specified. this left a total of 573 total observations. From there I looked at how each glioblastoma case was graded. There were 16 cases where the grade was 0, 86 where the grade was 1, 4 where the grade was 3, 114 where the grade was 4 and 352 where the grade was 11. Since I'm not sure what 11 means, I decided to do two things. In one scenario, I turned anything that was not 0 into 1 to signify which people had glioblastoma and which did not. I called this dataset CaseControlGrade Binary. After doing a little research, I came to the realization that 11 might mean Grade II. Therefore, in the second scenario, I changed just the 11s to 2. This way the data sets now have grade 0, 1, 2, 3, 4. Lastly, I ran the Min-Max Hill Climbing algorithm to see what I get as a trial, and to see how long it would take. It took about 4 hours to finish. In the zip file you can see the results. I will be rerunning the MMHC to see what results I get now with a more clean dataset.

Hey Bernardo, for each of your dataset, which GSE did you combined. Zhengua needs to authenticate your dataset.

by **bernardoj** » Thu Nov 14, 2019 9:29 pm

Olu1 wrote:
bernardoj wrote:Using the dataset that Olu posted, I transposed it and cleaned it up a bit. I filled in missing values and corrected a few values. I had to transpose it since the BN algorithm will think that trials are variables and not the other way around. After doing that I went through and got rid off unnecessary issues. At the start, there were 618 total patients/observations. I removed 32 that did not have a gender specified, which left 586. Then I removed 13 that did not have an age specified. this left a total of 573 total observations. From there I looked at how each glioblastoma case was graded. There were 16 cases where the grade was 0, 86 where the grade was 1, 4 where the grade was 3, 114 where the grade was 4 and 352 where the grade was 11. Since I'm not sure what 11 means, I decided to do two things. In one scenario, I turned anything that was not 0 into 1 to signify which people had glioblastoma and which did not. I called this dataset CaseControlGrade Binary. After doing a little research, I came to the realization that 11 might mean Grade II. Therefore, in the second scenario, I changed just the 11s to 2. This way the data sets now have grade 0, 1, 2, 3, 4. Lastly, I ran the Min-Max Hill Climbing algorithm to see what I get as a trial, and to see how long it would take. It took about 4 hours to finish. In the zip file you can see the results. I will be rerunning the MMHC to see what results I get now with a more clean dataset.

Hey Bernardo, for each of your dataset, which GSE did you combined. Zhengua needs to authenticate your dataset.

I used the datasets that you posted then manually added the grade column and info.

by **bernardoj** » Thu Nov 14, 2019 9:33 pm

Here is data from when I ran the Min Max Hill Climbing Algorithm as well as the Grow Shrink Algorithm for the bayesian networks. Right now I'm currently trying to run a score based algorithm but it is currently taking far too long to complete.

SMLG (Statistical Machine Learning Group) Discussion Forum

Probabilistic Graphical Models, Fall 2019

Probabilistic Graphical Models, Fall 2019

Re: Probabilistic Graphical Models, Fall 2019

Re: Probabilistic Graphical Models, Fall 2019

Re: Probabilistic Graphical Models, Fall 2019

Re: Probabilistic Graphical Models, Fall 2019

Re: Probabilistic Graphical Models, Fall 2019

Re: Probabilistic Graphical Models, Fall 2019

Re: Probabilistic Graphical Models, Fall 2019

Re: Probabilistic Graphical Models, Fall 2019

Re: Probabilistic Graphical Models, Fall 2019

Who is online