R code for Cleaning Data

Where to specify gene name?

Postby efrain.gonzalez0 » Fri Jun 09, 2017 10:41 am

lsand039 wrote:Can you please post where in the code I can specify a column for the gene name?


This can be done in two different ways.
The first is by changing the appropriate function in the Necessary Functions section near the top of the code.The steps for this are as follows:
  1. Go to the Necessary Functions section near the top of the code.
  2. In this section head to the comment that reads #5#Function for adjusting the gene names. Here you will see a function named gcnames.
  3. Inside the parentheses next to the word function there is a variable titled usecol. By default this variable is set to 1 meaning that it will use the first name in the list of names given for each gene.
  4. Change the default from 1 to any number you wish to use as the default for the gene names. If a particular gene does not have that many gene names the code will use the first gene name it reads for that particular gene.
I do not recommend using the above method for changing the gene name that you would like to use because it changes the function and so you will have to copy and paste the function into your R session every time you change it in this way.
This second method is the one that I recommend for specifying the gene name that you want to use. The steps are as follows:
  1. Scroll to the bottom of the code.
  2. Go to the comment that reads Adjusting the column names aka the gene names.
  3. Right under the comment you will find the function gcnames with ALZDAT in parentheses.
  4. Inside the parentheses and to the right of ALZDAT insert the following ,usecol =
  5. Now to the right of = place the number associated with the position of the gene name that you wish to use
    Example: If for gene ID 12345670 I have several gene names they will usually be displayed as follows GeneName1///GeneName2///GeneName3. So if I want to use the second gene name which in this case is GeneName2 I will have to change the usecol value to 2 and so I will have gcnames(ALZDAT,usecol = 2). For the third gene name the idea is the same and so I will have gcnames(ALZDAT,usecol = 3).

I hope this solves your problem and gives you some more understanding of how the code works.
efrain.gonzalez0
 
Posts: 138
Joined: Tue May 02, 2017 12:29 pm

Re: R code for Cleaning Data

Postby lsand039 » Tue Jun 13, 2017 12:35 pm

In GPL5175, the code leaves a space when it reads the second column for the gene name. It also takes ~3 hours to clean.
lsand039
 
Posts: 237
Joined: Thu Jan 14, 2016 12:17 pm

Re: R code for Cleaning Data

Postby efrain.gonzalez0 » Tue Jun 13, 2017 3:22 pm

lsand039 wrote:In GPL5175, the code leaves a space when it reads the second column for the gene name. It also takes ~3 hours to clean.


Thanks for notifying me of this error. I believe I have fixed the issue with the spaces. An updated version of the cleaning code is available on the Git server. I will look into the issue of the time but for now I would say if anyone else is using GPL5175 ask Lauren for a clean version since she already has one.
efrain.gonzalez0
 
Posts: 138
Joined: Tue May 02, 2017 12:29 pm

Re: R code for Cleaning Data

Postby lsand039 » Wed Jun 14, 2017 12:28 pm

The newest version of the code didn't pick up the AD group for GSE84422 GPL570.
lsand039
 
Posts: 237
Joined: Thu Jan 14, 2016 12:17 pm

Re: R code for Cleaning Data

Postby efrain.gonzalez0 » Fri Jun 16, 2017 11:45 am

lsand039 wrote:The newest version of the code didn't pick up the AD group for GSE84422 GPL570.

With the new code found in RClean4.R it seems that this problem has been fixed. It seems like it was just a problem of adding the word "Normal" to the glossary.
efrain.gonzalez0
 
Posts: 138
Joined: Tue May 02, 2017 12:29 pm

Re: R code for Cleaning Data

Postby Hamza » Sat Jun 17, 2017 6:34 pm

I copied the code in the "First version of cleaning data with R" into R and I changed the directory, but it did not work.
It gave me this error for example, "ibrary(dplyr) : there is no package called ‘dplyr’ > library(tidyr)".
What should I do?
Hamza
 
Posts: 34
Joined: Tue Jun 24, 2014 2:47 am

Re: R code for Cleaning Data

Postby efrain.gonzalez0 » Mon Jun 19, 2017 9:56 am

Hamza wrote:I copied the code in the "First version of cleaning data with R" into R and I changed the directory, but it did not work.
It gave me this error for example, "ibrary(dplyr) : there is no package called ‘dplyr’ > library(tidyr)".
What should I do?


Good morning Hamza,

There is actually a more updated version of the code called "RCleanDscret.R." Please use the new version since there have been quite a bit of changes since the first version. This version of the code outputs both the raw data in its clean form and the discretized version of the raw data. Both are output into separate files one of the files will end in "aftexcel.txt" and the other will end in "dscrt.txt." Now to answer your question, it seems to me like the issue is that the dplyr library is not installed. You must install each one of the libraries if you have never installed them before. In order to install a library into R you must use the following function install.packages("library name here"). So to install the dplyr library you should type in install.packages("dplyr"). I don't know if you have all of the libraries installed so I suggest you do this for each of the libraries.
efrain.gonzalez0
 
Posts: 138
Joined: Tue May 02, 2017 12:29 pm

NEW! Automated Code

Postby efrain.gonzalez0 » Thu Jun 22, 2017 10:59 am

Good morning/afternoon/evening to you all,

As is mentioned in the first post I have been working on a more automated version of the code. This version is titled "RAutoClDs.R" and is up on gitlab. It currently says "Don't use this code yet" in the commit message but I will simply say to ignore this message as I was still testing the last update of the code when I wrote the message. With the last test of mine completed I would like for you all to use this new code but as it is a bit different than its predecessors I will explain how to use it.
How to use the RAutoClDs.R code?:
  1. Make sure that you have installed the libraries that are necessary to run the code. I have discussed this in a previous post. If you have installed the libraries before there is no need to install them again unless it has been a while since you first installed them in which case there may have been an update for each library.
  2. Now create a folder with all of the original GSE files and there corresponding GPL files. By original I mean the files that you get directly from GEO. Do not extract any of the files from the gz folder in which they are contained. In this folder also put in any clean gpl files that you may have. All of the clean gpl files start with the phrase "Clean_GPL." I have put all currently cleaned GPL files into a folder on gitlab.
  3. Set your working directory in R to the folder that you just created. I explained how to do this in an earlier post.
  4. Now copy and paste the RAutoClDs.R code into the R window and press enter.
  5. You will notice a question that reads "Do you want to clean all data files in the directory ...?" At this point you must decide whether you are just interested in cleaning a few of the files you put in your folder, all of them, or none of them.
    • If you don't want to clean all the files in your folder then type 2 and press enter. You will see a second statement on the screen that asks you to "Choose the file/files you want to analyze: ." Make sure to follow the instructions on the screen. After you have made your selection you will not be asked for any further information from my code so feel free to go outside and enjoy the sunshine.
    • If you want to clean all the files in your folder then type 1 and press enter. Now you can sit at your computer and wait for a few hours as the program runs or you can step outside maybe get some coffee or just get some of that much needed sleep. I suggest that if you are planning on cleaning a large quantity of files run the code on a server. I have had the code clean 15 files and it took three hours on path-five.
    • If you don't want to run the code at all just type 0 and press enter and a message will appear on the screen that says "Nothing done" along with an error message.
  6. After the code has completed running there will be a few extra files in your folder. If you did not have a clean version of a particular GPL file you should notice that a clean version has been made. Also, there will be two new files for every GSE file that you cleaned. One of the files will end in "aftexcel.txt" and the other file will end in "dscrt.txt." The former is a clean version of the raw data and the latter is a discretized version of the clean raw data.
  7. Let's say that you finished running the code and you still have your R window open and you want to run the code again because "why not?" Well, there is no need to copy and paste the entire code just type THEFT() into your R window and press enter. From here just follow the steps above starting at step number five.
I try to keep the code as up to date as possible so please make sure to check gitlab for any updates to the code. As always feel free to post your questions on here and I will respond as quickly as possible. Lastly, I will stress that with this automated version I have yet to improve the handling of errors so before you run the code make sure you have all of the GPL files that you will be needing.

Good luck,

Efrain Gonzalez
Last edited by efrain.gonzalez0 on Fri Jun 23, 2017 12:55 pm, edited 1 time in total.
efrain.gonzalez0
 
Posts: 138
Joined: Tue May 02, 2017 12:29 pm

Re: R code for Cleaning Data

Postby efrain.gonzalez0 » Thu Jun 22, 2017 12:23 pm

lsand039 wrote:The code is having issues recognizing the column for the gene name I want to use in GPL 5175 since the names are separated by "//", not "///".

I believe I have solved this problem with the new code. I just had to add "|//" to the gcnames function.
efrain.gonzalez0
 
Posts: 138
Joined: Tue May 02, 2017 12:29 pm

Re: R code for Cleaning Data

Postby vavec001 » Tue Jul 25, 2017 2:38 pm

Afternoon Efrain,
I am using your code "RAutoCIDs.R" to clean my dataset up on R version 3.4.1 and downloaded all the necessary packages. I copied and pasted the code to run it but the question "Do you want to clean all data files in the directory ...?" did not pop up. I just want to make sure I am copying the correct code. Do I copy the whole entire code or do it function by function (ex. #1#Function for handling the changing of row names and column names,#2#Function for reorganizing information within the columns, etc) Thank you.
-Vincent
vavec001
 
Posts: 34
Joined: Thu May 28, 2015 12:49 pm

PreviousNext

Return to How to run scripts/programs

Who is online

Users browsing this forum: No registered users and 1 guest

cron