GEO RNA-seq Data Cleaning

GEO Brain Tumor Microarray and RNA-seq Data Cleaning

GEO RNA-seq Data Cleaning

Postby zgong001 » Mon Feb 05, 2018 11:00 am

To clean up RNA-Seq dataset, you need follow below steps, download and install some tools:

a. HISAT2 (https://ccb.jhu.edu/software/hisat2/manual.shtml), updated to https://daehwankimlab.github.io/hisat2/
b. htseq-count (http://htseq.readthedocs.io/en/release_ ... stall.html)
c. ftp://ftp.ensembl.org/pub/release-91/gtf/homo_sapiens/



1. Download HISAT2 from HISAT2 website [url] https://daehwankimlab.github.io/hisat2/[/url]. At right hand of the page, you could find a “Releases” section and then download the version you need.

2. Extract it in a directory, for example: ~/ HISAT2

3. Download index from same web page: [url]https://ccb.jhu.edu/software/hisat2/manual.shtml[/url]. In the “Indexes” section, you could see lots of index, then you select one. I use “genome”. Then you download it and extract it to your local directory, for example: ~/ grch38.

4. Go to folder ~/ HISAT2, and execute:
Code: Select all
hisat2 -x ~/grch38/genome --sra-acc SRRxxxxx -S sample.sam


~/grch38/genome: is the index you extracted. You need change it to your actual dircetory.

SRRxxxxx: one patient sample has a SRR number. You need find it from [url]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi[/url]. When you selected a dataset, you search it in “GEO accession” box. Then you’ll go to a page for this dataset. At the almost bottom, there is a “SRA” number. You click it, then you’ll go to a new page which show all the patients, then you click one, then you’ll see the SRR number at the bottom, for example: SRR5002298.

sample.sam: output file name. You need change filename “sample” to what you want. “.sam” is fixed and you can not change it.

After that step you’ll get a .sam file.

5. Download and install htseq-count.
Go to website: [url]http://htseq.readthedocs.io/en/release_0.9.1/install.html[/url].
htseq-count can only runs on Linux. If you didn’t install python, you need install it.
Execute:
Code: Select all
sudo apt-get install build-essential python2.7-dev python-numpy python-matplotlib python-pysam python-htseq
; or:
Code: Select all
pip install HTSeq


After installation of htseq-count, execute:
Code: Select all
python -m HTSeq.scripts.count -m intersection-nonempty -s no -i gene_name ~/Desktop/HISAT2/alignedfile.sam ~/Desktop/genome.gtf -o  Outputfile.samout > Outputfile_Counts.txt


You need change the following items:
1. ~/Desktop/HISAT2/sample.sam: is the directory which you .sam file is in.
2. Outputfile.samout > Outputfile_Counts.txt: you need change the file name of “Outputfile” to what you want.


After the above steps, you’ll get the Outputfile_Counts.txt file. This is the final cleaned file. You'll finish to clean up a sample (or a patient). Repeat the above steps, when you clean up all the samples in a dataset, you could the cleaned dataset to do analysis.
Last edited by zgong001 on Tue Sep 29, 2020 8:29 am, edited 1 time in total.
zgong001
 
Posts: 463
Joined: Thu Nov 16, 2017 11:10 am

Re: GEO RNA-seq Data Cleaning

Postby zgong001 » Mon Feb 05, 2018 11:12 am

In step 5, in the code: python -m HTSeq.scripts.count -m intersection-nonempty -s no -i gene_name ~/Desktop/HISAT2/alignedfile.sam ~/Desktop/genome.gtf -o Outputfile.samout > Outputfile_Counts.txt,

~/Desktop/genome.gtf is the file Homo_sapiens.GRCh38.91.chr_patch_hapl_scaff.gtf.gz or it's new version that you need download it from ftp://ftp.ensembl.org/pub/release-91/gtf/homo_sapiens/.
zgong001
 
Posts: 463
Joined: Thu Nov 16, 2017 11:10 am

Re: GEO RNA-seq Data Cleaning

Postby DanielTira » Tue Mar 27, 2018 3:37 pm

An update to the steps above,

1. Follow Above
2. Follow Above
3. Follow Above
4. Parallelize your processing to speed up the creation of your sam file and run it in the background allowing you to close your terminal if accessing off server.
Code: Select all
./hisat2 -p 4 -x ./grch38/genome --sra-acc SRR5282131 -S sample5282131.sam


-p # allows you to paralleize the processing of the cleaning, allowing you to use more cpu cores. Be considerate and try not to use more than 4 cpu cores.
Code: Select all
 -p 4 ./grch38


Where do I get my SRR##### ???
Go to the RNA-dataset excel sheet located here http://smlg.fiu.edu/phpbb/viewtopic.php?f=84&t=146
Select a GSE##### Series
Visit https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi
Put in your GSE Number in the Geo Accession Search
For this example, we used GSE95297, selected the first patient who had an SRR5282132


Update to fix syntax in step 5 running the processing

Code: Select all
python -m HTSeq.scripts.count -m intersection-nonempty -s no -i gene_name ./sample5282132.sam ~/Homo_sapiens.GRCh38.91.chr_patch_hapl_scaff.gtf -o Outputfile.sam > Outputfile_Counts.txt


I've attached the cleaned txt output file so you can see an example. And will be posting more here.

Edit: I've edited the script changes I made, the acceleration was not part of the bug, the nohup is somehow causing the bug. If I figure out a way to run in the background without breaking HISAT2 i'll let you know.
Attachments
Outputfile_Counts.txt
(617.77 KiB) Downloaded 843 times
DanielTira
 
Posts: 18
Joined: Thu Feb 15, 2018 5:09 pm

Re: GEO RNA-seq Data Cleaning

Postby khasan » Tue Dec 21, 2021 2:39 pm

An update to the steps above, is you want to work with multiple SRR files

1. Follow Above
2. Follow Above
3. Follow Above
Create an array of SRR files
Code: Select all
my_array=(SRR files)
for i in "${my_array[@]}";
do
   ~/HISAT2/hisat2 -x ~/grch38/genome --sra-acc $i -S ~/${i}_SAM.sam
   python3 -m HTSeq.scripts.count -m intersection-nonempty -s no -i gene_id ~/${I}_SAM.sam ~/Homo_sapiens.GRCh38.104.gtf -o ~/${I}_SAM_Outputfile.samout > ~/${I}_SAM_Outputfile_Counts.txt
done


You need to change the following items:
1. ~/${i}_SAM.sam: is the directory which your .sam file is in.
2. ~/grch38/genome: is the index you extracted. need to change it to the actual directory.
3. ~/...: need to change it to the actual directory

To merge multiple samples of a data series follow the following R script:

https://github.com/smlgfiuedu/RNAseq-Data-Clean-Up

Here, I have attached the cleaned output files. I will upload more.
Attachments
GSE139233_gene_info.csv
(3.54 MiB) Downloaded 717 times
GSE134567_gene_info.csv
(4.34 MiB) Downloaded 699 times
GSE132825_gene_info.csv
(2.21 MiB) Downloaded 718 times
GSE132172_gene_info.csv
(24.83 MiB) Downloaded 696 times
GSE139448_gene_info.csv
(2.25 MiB) Downloaded 714 times
khasan
 
Posts: 28
Joined: Mon May 17, 2021 3:13 pm


Return to Dataset Cleaning

Who is online

Users browsing this forum: No registered users and 5 guests

cron