GEO RNA-seq Data Cleaning
Posted: Mon Feb 05, 2018 11:00 am
To clean up RNA-Seq dataset, you need follow below steps, download and install some tools:
a. HISAT2 (https://ccb.jhu.edu/software/hisat2/manual.shtml), updated to https://daehwankimlab.github.io/hisat2/
b. htseq-count (http://htseq.readthedocs.io/en/release_ ... stall.html)
c. ftp://ftp.ensembl.org/pub/release-91/gtf/homo_sapiens/
1. Download HISAT2 from HISAT2 website [url] https://daehwankimlab.github.io/hisat2/[/url]. At right hand of the page, you could find a “Releases” section and then download the version you need.
2. Extract it in a directory, for example: ~/ HISAT2
3. Download index from same web page: [url]https://ccb.jhu.edu/software/hisat2/manual.shtml[/url]. In the “Indexes” section, you could see lots of index, then you select one. I use “genome”. Then you download it and extract it to your local directory, for example: ~/ grch38.
4. Go to folder ~/ HISAT2, and execute:
~/grch38/genome: is the index you extracted. You need change it to your actual dircetory.
SRRxxxxx: one patient sample has a SRR number. You need find it from [url]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi[/url]. When you selected a dataset, you search it in “GEO accession” box. Then you’ll go to a page for this dataset. At the almost bottom, there is a “SRA” number. You click it, then you’ll go to a new page which show all the patients, then you click one, then you’ll see the SRR number at the bottom, for example: SRR5002298.
sample.sam: output file name. You need change filename “sample” to what you want. “.sam” is fixed and you can not change it.
After that step you’ll get a .sam file.
5. Download and install htseq-count.
Go to website: [url]http://htseq.readthedocs.io/en/release_0.9.1/install.html[/url].
htseq-count can only runs on Linux. If you didn’t install python, you need install it.
Execute:
After installation of htseq-count, execute:
You need change the following items:
1. ~/Desktop/HISAT2/sample.sam: is the directory which you .sam file is in.
2. Outputfile.samout > Outputfile_Counts.txt: you need change the file name of “Outputfile” to what you want.
After the above steps, you’ll get the Outputfile_Counts.txt file. This is the final cleaned file. You'll finish to clean up a sample (or a patient). Repeat the above steps, when you clean up all the samples in a dataset, you could the cleaned dataset to do analysis.
a. HISAT2 (https://ccb.jhu.edu/software/hisat2/manual.shtml), updated to https://daehwankimlab.github.io/hisat2/
b. htseq-count (http://htseq.readthedocs.io/en/release_ ... stall.html)
c. ftp://ftp.ensembl.org/pub/release-91/gtf/homo_sapiens/
1. Download HISAT2 from HISAT2 website [url] https://daehwankimlab.github.io/hisat2/[/url]. At right hand of the page, you could find a “Releases” section and then download the version you need.
2. Extract it in a directory, for example: ~/ HISAT2
3. Download index from same web page: [url]https://ccb.jhu.edu/software/hisat2/manual.shtml[/url]. In the “Indexes” section, you could see lots of index, then you select one. I use “genome”. Then you download it and extract it to your local directory, for example: ~/ grch38.
4. Go to folder ~/ HISAT2, and execute:
- Code: Select all
hisat2 -x ~/grch38/genome --sra-acc SRRxxxxx -S sample.sam
~/grch38/genome: is the index you extracted. You need change it to your actual dircetory.
SRRxxxxx: one patient sample has a SRR number. You need find it from [url]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi[/url]. When you selected a dataset, you search it in “GEO accession” box. Then you’ll go to a page for this dataset. At the almost bottom, there is a “SRA” number. You click it, then you’ll go to a new page which show all the patients, then you click one, then you’ll see the SRR number at the bottom, for example: SRR5002298.
sample.sam: output file name. You need change filename “sample” to what you want. “.sam” is fixed and you can not change it.
After that step you’ll get a .sam file.
5. Download and install htseq-count.
Go to website: [url]http://htseq.readthedocs.io/en/release_0.9.1/install.html[/url].
htseq-count can only runs on Linux. If you didn’t install python, you need install it.
Execute:
- Code: Select all
sudo apt-get install build-essential python2.7-dev python-numpy python-matplotlib python-pysam python-htseq
- Code: Select all
pip install HTSeq
After installation of htseq-count, execute:
- Code: Select all
python -m HTSeq.scripts.count -m intersection-nonempty -s no -i gene_name ~/Desktop/HISAT2/alignedfile.sam ~/Desktop/genome.gtf -o Outputfile.samout > Outputfile_Counts.txt
You need change the following items:
1. ~/Desktop/HISAT2/sample.sam: is the directory which you .sam file is in.
2. Outputfile.samout > Outputfile_Counts.txt: you need change the file name of “Outputfile” to what you want.
After the above steps, you’ll get the Outputfile_Counts.txt file. This is the final cleaned file. You'll finish to clean up a sample (or a patient). Repeat the above steps, when you clean up all the samples in a dataset, you could the cleaned dataset to do analysis.