SMLG (Statistical Machine Learning Group) Discussion Forum

by **efrain.gonzalez0** » Thu Aug 03, 2017 2:51 pm

Good morning/afternoon/evening,

The code labeled RMatchGenes.R found on the GitLab site can be used to find the genes that are common among all of the data sets that you are using. The code works on dscrt.txt, and the aftexcel.txt files produced by the previously mentioned cleaning code whose function is named THEFT(). I have not yet posted the code that also works on the Z-score files but if anyone needs it I believe that Lauren and I have a working version that either one of us can send you.
How to use RMatchGenes.R?:

Make sure that you have installed the libraries that are necessary to run the code. I have discussed this in a previous post. If you have installed the libraries before there is no need to install them again unless it has been a while since you first installed them in which case there may have been an update for each library.
Set your working directory in R to the folder that contains the aftexcel.txt and the dscrt.txt files. I explained how to do this in an earlier post.
Count the amount of rows of clinical data that you have in each file that you plan to use for the code. Write this information down somewhere but make sure that you keep track of the files to which each number belongs to.
Now copy and paste the RMatchGenes.R code into the R window and press enter.
You will notice a statement that reads "Choose the file/files you want to analyze:" and you will see a list of files found within the directory that you previously chose. Follow the directions on the screen for printing the file numbers. Make sure to put the file numbers in ascending order.
Then you will notice a question that reads "How many rows of clinical data are their in each data set ... ?" You must use the information that you acquired in step three. Separate each number by a comma with no spaces. For example, if you have 3 data sets and each data set has 6 rows of clinical data then you should type in 6,6,6. You must choose the files in the same order in which you typed the clinical data information for the previous question.
No more user input is required.
Once the program is completed you will notice that files were created with the names GSE#####matched.txt within the directory that you had specified. These files contain only the clinical data and information on the genes that each file had in common.

Good luck,

Efrain Gonzalez

by **efrain.gonzalez0** » Thu Oct 12, 2017 3:08 pm

Good day,

After working with Juan on some problems involving the Rcode we noticed that different species seem to have different formats for their GPL files. In Juan's case we were looking at zebra fish GPL files. Since the gene IDs for zebra fish use the # character within the ID we had to use a different method in order to prevent it from identifying # as the beginning of a comment. The fix was simple it just required that the following lines of code be commented out:

Code: Select all: #Find out if it is a soft GPL file or not soft <- strsplit(genena,"[\\|/]") %>% .[[1]] %>% .[length(.)] %>% grepl("soft",.)

.
One also needed to add in

Code: Select all: soft <- TRUE

This should be added right after the code that was commented out.

In R a comment is made by adding the "#" character to the front of whatever you want to comment out.
The zebra fish GPL files also used GB_ACC as the name for the column that contained the gene names. This required us to add

Code: Select all: |^GB_ACC$

to the list of possible names.

Edit: Juan and I discovered that the GB_ACC was actually an Accession name and so we looked for other potential columns and found that the GPL files were using GENE_SYMBOL and GENE_SYMBOL_LIST and so we removed

Code: Select all: |^GB_ACC$

and added

Code: Select all: GENE_SYMBOL_LIST|^GENE_SYMBOL$

to the list of possible names. Only one of the GPL files that Juan was using did not contain any gene name information and so we had to remove the GSE that correlated with that GPL file from his list of studies.

Good luck,

Efrain Gonzalez

by **efrain.gonzalez0** » Fri Oct 20, 2017 12:42 pm

vavec001 wrote:Afternoon Efrain,
I am using your code "RAutoCIDs.R" to clean my dataset up on R version 3.4.1 and downloaded all the necessary packages. I copied and pasted the code to run it but the question "Do you want to clean all data files in the directory ...?" did not pop up. I just want to make sure I am copying the correct code. Do I copy the whole entire code or do it function by function (ex. #1#Function for handling the changing of row names and column names,#2#Function for reorganizing information within the columns, etc) Thank you.
-Vincent

Hello Vincent,

I have answered this question in person but just in case people in the future have this question I wanted to make sure that it was answered. The whole code needs to be copied from Git Lab and pasted onto the R window.

Respectfully,

Efrain

by **efrain.gonzalez0** » Fri Oct 20, 2017 1:07 pm

Greetings people of the future,

Assuming that my R Code for cleaning micro-array expression data has not been replaced by a better code you will probably at some point run into an error. Unfortunately, I have not been able to make it error proof yet. I will try to update this as much as possible so that I can at least cover all the errors that are common and the methods I have used for fixing them. So here are a set of steps that you can use to narrow down where the error is occurring:

Use the RCleanDscret.R code for this testing process. This is a code that requires that you specify both the GSE and the GPL file that correlate with each other.
I suggest that before running the code on a new GSE file always clean the screen so use Ctrl+L or in RStudio you can go to the Edit tab and click on "Clear Console." Also go to the Session tab and click on "Clear Workspace" and check the box that says "Include hidden objects."
Now that the screen is clean and all variables are cleared copy the R code from the very beginning to the line that reads "##Is there a clean version of the GPL file available." Paste the code into the RStudio Window and press enter.This code is not automated and so you will have to provide the GSE file first and then the GPL file that corresponds with the GSE file that you chose. Once the code is done scroll up and check for any errors. Both errors and warnings appear in red so make sure that you are looking at an error. R will explicitly tell you whether it is an error or warning. If there is an error write down the file name and mention that there was an error in step 3. This means that there is some kind of issue with the data itself.
If there were no errors then copy some more of the code beginning at the line that we stopped at in the above step. So starting at "##Is there a clean version of the GPL file available" and stop when you get to "##Changing the gene ID to gene name." Again check for any errors as I stated above. If there was an error right down the file name and mention that it occurred during step 4 of error checking. This most likely means that the error occurred in the handling of the GPL file. Most likely the solution will be adding a new word to the glossary.
If there were no errors then copy some more of the code beginning at the line that we stopped at in the above step. So starting at "##Changing the gene ID to gene name" and stop when you get to "#Now for the discretization part." Again check for any errors as I stated above. If there was an error right down the file name and mention that it occurred during step 5 of error checking. This most likely means that the error occurred in the changing between gene ID and gene name. Errors beyond this point are unlikely.
If there were no errors then copy some more of the code beginning at the line that we stopped at in the above step. So starting at "#Now for the discretization part" and stop when you get to "##Discretized the Data." Again check for any errors as I stated above. If there was an error right down the file name and mention that it occurred during step 6 of error checking.
If there were no errors then copy some more of the code beginning at the line that we stopped at in the above step. So starting at "##Discretized the Data" and stop when you get to the end of the R code. Again check for any errors as I stated above. If there was an error right down the file name and mention that it occurred during step 7 of error checking.
If the code has executed without any errors then you should just look at the actual files that were produced for any weird stuff. You can check them out by using excel or if you want you can use the commands Fullalzdwr[1:10,1:6], zscraw[1:10,1:6], and Dscrtalzdw[1:10,1:6] to check the aftexcel.txt, zscore.txt, and dscrt.txt files respectively. Just write down if you find any weirdness within any of these files.
Repeat starting from Step 2 for a new GSE file.

I hope that the above instructions are clear enough. Sometimes finding the errors is a little more complex than this but I think that this is a good method in general.

Good luck,

Efrain Gonzalez

by **efrain.gonzalez0** » Tue Nov 07, 2017 5:23 pm

SEVERE ERROR PLEASE CHECK RESULTS WITH NEW CODE
Good evening all,

With the help of Dr. Park I noticed an error in the matching code. The bottom line is that the last file was being ignored in the matching. I would like all of you who have used the code to run the new code that is posted on GitLab. If you run into any errors let me know.

Respectfully,

Efrain Gonzalez

SMLG (Statistical Machine Learning Group) Discussion Forum

R code for Cleaning Data

Which Genes Are Present In All Your Data Sets?

Re: R code for Cleaning Data

Re: R code for Cleaning Data

Finding Reasons for Errors

ERROR in Matching Code

Who is online