R code for Cleaning Data

R code for Cleaning Data

Postby efrain.gonzalez0 » Thu May 25, 2017 10:57 am

Good morning/afternoon/evening,

Here I will be discussing the new R code that I have been working on http://smlg.fiu.edu/gitlab/efraingonzalez0/cleaning-and-fixing-data-with-r/tree/master. I will also be answering any questions and solving the problems that you may be running into when using this new code. Please use the most recent version of the code which is currently RCleanDscret.R. I recommend checking the gitlab site frequently for new updates for this version of the code. There are still many problems that I will be tackling.
The following are some of the ones that I have noticed:
  1. The code currently does not allow the user to set a vector of data sets to clean. Meaning that the user is required to manually search for the filenames for each data set that they want to analyze. This is an issue that I have been tackling and I am currently testing the automated version of the code.
  2. Due to the fact that the code is being automated I am in the process of improving the way in which it handles errors.
  3. The code currently only works for data sets that are in the series matrix format and are associated with a GPL that has been downloaded using the Download full table option. I believe I have found the solution to this problem but I am still testing this solution. The solution is available in the current version of the code.
  4. The handling of NA's should be improved so that we do not see as many warnings from R. Although the warnings in R can be easily ignored it would be nice not to have any.
Last edited by efrain.gonzalez0 on Wed Jun 21, 2017 2:43 pm, edited 9 times in total.
efrain.gonzalez0
 
Posts: 138
Joined: Tue May 02, 2017 12:29 pm

Using the New Code

Postby efrain.gonzalez0 » Thu May 25, 2017 12:36 pm

In this post I describe how to use the current version of the code.
  1. To begin I recommend storing all GSE and their corresponding GPL files within the same folder. Then within your R session set the working directory to the folder in which all of these files are contained. If you are using R this can be done by clicking the "File" drop down menu and choosing "Change dir..." then choose your folder. If you are using R studio click on "Session" and then click on "Choose Directory" within "Set Working Directory." Make sure that the GSE file that you download is in the series matrix format. Make sure that you download the GPL file via the download full table option or the SOFT format not the MINiML format. I have not checked the MINiML format. You do not need to extract the file from the zip folder R can handle this process. I am not sure if extracting the file from the folder makes the cleaning process any quicker.
  2. You will see that near the top of the code there is a set of libraries that are required so that you can use the code. If you get an error at this step then most likely there is a package that you have not installed. In order to install a package use install.packages("Package Name Here"). After you have installed a package you must still use the library function to access that package. You may also receive a warning message that says "package ‘Package Name Here’ was built under R version "current R version here" this warning can be ignored or you can choose to update to the most recent R version. You may also receive a message like "The following object is masked from ‘package:Package Name Here’:" do not worry about this message as I have accounted for this within my code. I have also ordered the way in which the libraries are accessed so that the masking does not interfere with the code.
  3. Following the above you will notice a comment that starts with "Necessary Functions." This part is just a set of functions that I made for the task of cleaning the data. There is a small description for each but unless you have particular questions about the functions there is not much to be said about them. You can just copy all of them onto R. I separated them from the rest of the code so that they would only have to be copied once per R session.
  4. "Getting the series matrix file:" For this part of the code there will be a pop up from R that will ask you to choose a file. Here you must choose the GSE series matrix file that corresponds with the data set that you want to clean.
  5. "Getting the GPL file:" For this part of the code there will be a pop up from R that will ask you to choose a file. Here you must choose the GPL file that corresponds with the series matrix file that you previously chose. I will be automating this eventually.
  6. The rest of this code can just be copied and pasted on to the R session window. Nothing is required of the user beyond this point. When the code is done running you will be able to see the new clean data file within the directory that you set. There will be two clean versions one that has the raw data in a clean format and a second one that has discretized the raw data. You will be able to differentiate between the two by there names. The first ends with "aftexcel.txt" and the second ends with "dscrt.txt"
Last edited by efrain.gonzalez0 on Wed Jun 21, 2017 2:48 pm, edited 2 times in total.
efrain.gonzalez0
 
Posts: 138
Joined: Tue May 02, 2017 12:29 pm

Some more info regarding the code

Postby efrain.gonzalez0 » Thu May 25, 2017 1:30 pm

I have included some comments within the code that give you a small understanding of what is happening at each step but I have decided to be more explicit with my description on this forum.
The following gives a more detailed explanation of each step after the user has chosen the appropriate files:
  1. "Set working directory": This was an attempt to automate the process of setting the working directory but I forgot to account for varying operating systems so I have commented this section out completely.
  2. "Working with the wordy...": This part just filters out some of the information that is unnecessary such as contact information for the researchers. Contained within this part of the code you will only find clinical information on the different subjects.
  3. "Changing row names...": This part of the code calls the first function that I created in order to change the names of the columns of the data. It does this by using a glossary of strings that I created. If it finds a particular string within the first row of a column it will then rename the column based on that string. Those columns that were not renamed will be deleted as they do not seem to have information that we are interested in using for our analysis. Currently I have it set to keep information on braak stage, age, sex, pmi, and whether the patient has the disease.
  4. "Reorganizing information...": This part of the code attempts to clean up some of the unnecessary words that have been placed into the cells of the data set by calling the second function that I created. It also changes the class of the information within the cell to an appropriate class if it is necessary. For example before this function is applied we may have a cell that reads "age: 92" but afterwards the cell will read 92 and will have an integer class.
  5. "Working with Actual...": From this point onward we are dealing with the gene data section of the data set.
  6. "Gene ID to...": This part extracts the necessary information from the GPL file that we need in order to match the gene ID to a gene name.
  7. "Changing the ID...": This part changes the gene ID found in the GSE series matrix file to the gene name found within the GPL file by using the third function that I created. This process is the most time consuming part of the code because currently it has to search one by one through the adjusted GPL file.
  8. "Adjusting the column names...": This part sets the column names of the data set from their initial gene IDs to their gene names. It uses the fourth function that I created to complete this procedure.
  9. "Full Data": In this part I bring together the clinical information and the gene data into one data frame. and then write this to the two different text files described previously.
efrain.gonzalez0
 
Posts: 138
Joined: Tue May 02, 2017 12:29 pm

Make Sure to Post Your Clean GPL files

Postby efrain.gonzalez0 » Wed May 31, 2017 3:14 pm

Hello again everyone,

I am writing simply to say that the clean version of the GPL file is being stored to your directories. The cleaning process can sometimes take a while and as we often use the same GPL files I figured that this was a good way to cut down the amount of time it takes for the program to run. In order for this to work well for everyone I think that we should have a place where we can put all of the clean GPL files so that everyone can have access to them. I figured it would be best if all of this information was in the same place as the code.
efrain.gonzalez0
 
Posts: 138
Joined: Tue May 02, 2017 12:29 pm

Re: R code for Cleaning Data

Postby lsand039 » Wed Jun 07, 2017 12:47 pm

Age and Sex are in the same row under "!Sample_characteristics_ch1" for GSE6774. The script doesn't pick this up.
Last edited by lsand039 on Wed Jun 07, 2017 2:03 pm, edited 1 time in total.
lsand039
 
Posts: 237
Joined: Thu Jan 14, 2016 12:17 pm

Re: R code for Cleaning Data

Postby lsand039 » Wed Jun 07, 2017 1:32 pm

Probe IDs without gene names are listed as "---" for GPL5175.
Also, gene names are under the column named "gene_assignment" for GPL5175.
Last edited by lsand039 on Wed Jun 07, 2017 4:32 pm, edited 3 times in total.
lsand039
 
Posts: 237
Joined: Thu Jan 14, 2016 12:17 pm

Re: R code for Cleaning Data

Postby lsand039 » Wed Jun 07, 2017 2:04 pm

lsand039 wrote:Age and Sex are in the same row under "!Sample_characteristics_ch1" for GSE6774. The script doesn't pick this up.


ignore this comment. I don't need to use this GSE file
lsand039
 
Posts: 237
Joined: Thu Jan 14, 2016 12:17 pm

Re: R code for Cleaning Data

Postby lsand039 » Wed Jun 07, 2017 3:41 pm

For GSE15222, the age and sex for the Alzheimer's samples aren't getting picked up. They have a slightly different format than the control's age & sex in the series matrix file.
lsand039
 
Posts: 237
Joined: Thu Jan 14, 2016 12:17 pm

Re: R code for Cleaning Data

Postby lsand039 » Fri Jun 09, 2017 8:39 am

Can you please post where in the code I can specify a column for the gene name?
lsand039
 
Posts: 237
Joined: Thu Jan 14, 2016 12:17 pm

Re: R code for Cleaning Data

Postby lsand039 » Fri Jun 09, 2017 9:54 am

The code is having issues recognizing the column for the gene name I want to use in GPL 5175 since the names are separated by "//", not "///".
lsand039
 
Posts: 237
Joined: Thu Jan 14, 2016 12:17 pm

Next

Return to How to run scripts/programs

Who is online

Users browsing this forum: No registered users and 1 guest

cron