SMLG (Statistical Machine Learning Group) Discussion Forum

by **zgong001** » Mon Feb 05, 2018 11:00 am

To clean up RNA-Seq dataset, you need follow below steps, download and install some tools:

a. HISAT2 (https://ccb.jhu.edu/software/hisat2/manual.shtml), updated to https://daehwankimlab.github.io/hisat2/
b. htseq-count (http://htseq.readthedocs.io/en/release_ ... stall.html)
c. ftp://ftp.ensembl.org/pub/release-91/gtf/homo_sapiens/

1. Download HISAT2 from HISAT2 website [url] https://daehwankimlab.github.io/hisat2/[/url]. At right hand of the page, you could find a “Releases” section and then download the version you need.

2. Extract it in a directory, for example: ~/ HISAT2

3. Download index from same web page: [url]https://ccb.jhu.edu/software/hisat2/manual.shtml[/url]. In the “Indexes” section, you could see lots of index, then you select one. I use “genome”. Then you download it and extract it to your local directory, for example: ~/ grch38.

4. Go to folder ~/ HISAT2, and execute:

Code: Select all: hisat2 -x ~/grch38/genome --sra-acc SRRxxxxx -S sample.sam

~/grch38/genome: is the index you extracted. You need change it to your actual dircetory.

SRRxxxxx: one patient sample has a SRR number. You need find it from [url]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi[/url]. When you selected a dataset, you search it in “GEO accession” box. Then you’ll go to a page for this dataset. At the almost bottom, there is a “SRA” number. You click it, then you’ll go to a new page which show all the patients, then you click one, then you’ll see the SRR number at the bottom, for example: SRR5002298.

sample.sam: output file name. You need change filename “sample” to what you want. “.sam” is fixed and you can not change it.

After that step you’ll get a .sam file.

5. Download and install htseq-count.
Go to website: [url]http://htseq.readthedocs.io/en/release_0.9.1/install.html[/url].
htseq-count can only runs on Linux. If you didn’t install python, you need install it.
Execute:

Code: Select all: sudo apt-get install build-essential python2.7-dev python-numpy python-matplotlib python-pysam python-htseq

; or:

Code: Select all: pip install HTSeq

After installation of htseq-count, execute:

Code: Select all: python -m HTSeq.scripts.count -m intersection-nonempty -s no -i gene_name ~/Desktop/HISAT2/alignedfile.sam ~/Desktop/genome.gtf -o Outputfile.samout > Outputfile_Counts.txt

You need change the following items:
1. ~/Desktop/HISAT2/sample.sam: is the directory which you .sam file is in.
2. Outputfile.samout > Outputfile_Counts.txt: you need change the file name of “Outputfile” to what you want.

After the above steps, you’ll get the Outputfile_Counts.txt file. This is the final cleaned file. You'll finish to clean up a sample (or a patient). Repeat the above steps, when you clean up all the samples in a dataset, you could the cleaned dataset to do analysis.

by **zgong001** » Mon Feb 05, 2018 11:12 am

In step 5, in the code: python -m HTSeq.scripts.count -m intersection-nonempty -s no -i gene_name ~/Desktop/HISAT2/alignedfile.sam ~/Desktop/genome.gtf -o Outputfile.samout > Outputfile_Counts.txt,

~/Desktop/genome.gtf is the file Homo_sapiens.GRCh38.91.chr_patch_hapl_scaff.gtf.gz or it's new version that you need download it from ftp://ftp.ensembl.org/pub/release-91/gtf/homo_sapiens/.

by **DanielTira** » Tue Mar 27, 2018 3:37 pm

An update to the steps above,

1. Follow Above
2. Follow Above
3. Follow Above
4. Parallelize your processing to speed up the creation of your sam file and run it in the background allowing you to close your terminal if accessing off server.

Code: Select all: ./hisat2 -p 4 -x ./grch38/genome --sra-acc SRR5282131 -S sample5282131.sam

-p # allows you to paralleize the processing of the cleaning, allowing you to use more cpu cores. Be considerate and try not to use more than 4 cpu cores.

Code: Select all: -p 4 ./grch38

Where do I get my SRR##### ???
Go to the RNA-dataset excel sheet located here http://smlg.fiu.edu/phpbb/viewtopic.php?f=84&t=146
Select a GSE##### Series
Visit https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi
Put in your GSE Number in the Geo Accession Search
For this example, we used GSE95297, selected the first patient who had an SRR5282132

Update to fix syntax in step 5 running the processing

Code: Select all: python -m HTSeq.scripts.count -m intersection-nonempty -s no -i gene_name ./sample5282132.sam ~/Homo_sapiens.GRCh38.91.chr_patch_hapl_scaff.gtf -o Outputfile.sam > Outputfile_Counts.txt

I've attached the cleaned txt output file so you can see an example. And will be posting more here.

Edit: I've edited the script changes I made, the acceleration was not part of the bug, the nohup is somehow causing the bug. If I figure out a way to run in the background without breaking HISAT2 i'll let you know.

by **khasan** » Tue Dec 21, 2021 2:39 pm

An update to the steps above, is you want to work with multiple SRR files

1. Follow Above
2. Follow Above
3. Follow Above
Create an array of SRR files

Code: Select all: my_array=(SRR files) for i in "${my_array[@]}"; do ~/HISAT2/hisat2 -x ~/grch38/genome --sra-acc $i -S ~/${i}_SAM.sam python3 -m HTSeq.scripts.count -m intersection-nonempty -s no -i gene_id ~/${I}_SAM.sam ~/Homo_sapiens.GRCh38.104.gtf -o ~/${I}_SAM_Outputfile.samout > ~/${I}_SAM_Outputfile_Counts.txt done

You need to change the following items:
1. ~/${i}_SAM.sam: is the directory which your .sam file is in.
2. ~/grch38/genome: is the index you extracted. need to change it to the actual directory.
3. ~/...: need to change it to the actual directory

To merge multiple samples of a data series follow the following R script:

https://github.com/smlgfiuedu/RNAseq-Data-Clean-Up

Here, I have attached the cleaned output files. I will upload more.

SMLG (Statistical Machine Learning Group) Discussion Forum

GEO RNA-seq Data Cleaning

GEO RNA-seq Data Cleaning

Re: GEO RNA-seq Data Cleaning

Re: GEO RNA-seq Data Cleaning

Re: GEO RNA-seq Data Cleaning

Who is online