SMLG (Statistical Machine Learning Group) Discussion Forum

Posted: **Tue Apr 12, 2016 1:34 pm**

These are 3 samples with RNA sequencing data. You will need to combine all the fastq files for each sample before alignment.

Posted: **Thu Apr 21, 2016 2:00 pm**

Feltyq wrote:These are 3 samples with RNA sequencing data. You will need to combine all the fastq files for each sample before alignment.

Here are the tools that have been used to analyze the next-generation sequencing reads (RNA or DNA):

HISAT2 (see https://ccb.jhu.edu/software/hisat2/index.shtml)
htseq-count (module from HTSeq 0.6.1p2 see http://www-huber.embl.de/HTSeq/doc/count.html)
Gene Name (GTF file) downloaded from ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.chr_patch_hapl_scaff.gtf.gz

Posted: **Mon Apr 25, 2016 10:24 am**

cwyoo wrote:
Feltyq wrote:These are 3 samples with RNA sequencing data. You will need to combine all the fastq files for each sample before alignment.

Here are the tools that have been used to analyze the next-generation sequencing reads (RNA or DNA):

HISAT2 (see https://ccb.jhu.edu/software/hisat2/index.shtml)
htseq-count (module from HTSeq 0.6.1p1 see http://www-huber.embl.de/HTSeq/doc/count.html)
Gene Name (GTF file) downloaded from ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.chr_patch_hapl_scaff.gtf.gz

Attached are the sorted counts.

Posted: **Tue Apr 26, 2016 11:58 pm**

cwyoo wrote:
cwyoo wrote:
Feltyq wrote:These are 3 samples with RNA sequencing data. You will need to combine all the fastq files for each sample before alignment.

Here are the tools that have been used to analyze the next-generation sequencing reads (RNA or DNA):

HISAT2 (see https://ccb.jhu.edu/software/hisat2/index.shtml)
htseq-count (module from HTSeq 0.6.1p1 see http://www-huber.embl.de/HTSeq/doc/count.html)
Gene Name (GTF file) downloaded from ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.chr_patch_hapl_scaff.gtf.gz

Attached are the sorted counts.

These are sorted genes by p-value and fold change between the experimental conditions:

Posted: **Tue Feb 14, 2017 11:55 am**

cwyoo wrote:
Here are the tools that have been used to analyze the next-generation sequencing reads (RNA or DNA):

HISAT2 (see https://ccb.jhu.edu/software/hisat2/index.shtml)
htseq-count (module from HTSeq 0.6.1p1 see http://www-huber.embl.de/HTSeq/doc/count.html)
Gene Name (GTF file) downloaded from ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.chr_patch_hapl_scaff.gtf.gz

For the new analysis, following versions have been used:
HISAT2 2.0.5 (release 11/4/2016)
htseq-count HTSeq 0.6.1p1
Gene Name (GTF file) downloaded from ftp://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.chr_patch_hapl_scaff.gtf.gz

Posted: **Thu Feb 16, 2017 4:58 pm**

cwyoo wrote:
cwyoo wrote:
Here are the tools that have been used to analyze the next-generation sequencing reads (RNA or DNA):

HISAT2 (see https://ccb.jhu.edu/software/hisat2/index.shtml)
htseq-count (module from HTSeq 0.6.1p1 see http://www-huber.embl.de/HTSeq/doc/count.html)
Gene Name (GTF file) downloaded from ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.chr_patch_hapl_scaff.gtf.gz

For the new analysis, following versions have been used:
HISAT2 2.0.5 (release 11/4/2016)
htseq-count HTSeq 0.6.1p1
Gene Name (GTF file) downloaded from ftp://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.chr_patch_hapl_scaff.gtf.gz

These are DNA sequence analyses results.

Posted: **Sat Mar 11, 2017 9:39 pm**

cwyoo wrote:
cwyoo wrote:
cwyoo wrote:
Here are the tools that have been used to analyze the next-generation sequencing reads (RNA or DNA):

HISAT2 (see https://ccb.jhu.edu/software/hisat2/index.shtml)
htseq-count (module from HTSeq 0.6.1p1 see http://www-huber.embl.de/HTSeq/doc/count.html)
Gene Name (GTF file) downloaded from ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.chr_patch_hapl_scaff.gtf.gz

For the new analysis, following versions have been used:
HISAT2 2.0.5 (release 11/4/2016)
htseq-count HTSeq 0.6.1p1
Gene Name (GTF file) downloaded from ftp://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.chr_patch_hapl_scaff.gtf.gz

Using the above settings, these are ID3chipseq analyses results.

Posted: **Tue Aug 01, 2017 3:30 pm**

If you saw my posts within the Alzheimer's forum then you are aware of the fact that I have been able to convert from .sra to .fastq by way of the SRA ToolKit. The commands for doing so are as follows:

Check if a path exists to the .sra file in question:
Code: Select all
./srapath SRR######
Read in and Convert the .sra file to a .fastq file:
Code: Select all
./fastq-dump SRR######

Where ####### represents the SRA accession number for the file. Recently I have been working with the HISAT2 alignment tool and I wanted to make sure that everyone understood the commands that I used so I am posting them here:

To begin after the installation of the HISAT2 tool I had to set the appropriate index required. One can build an index by using the hisat2-build command. An example of using this command for several FASTA files is included below. The files were obtained from the ENSEMBL website http://useast.ensembl.org/info/data/ftp/index.html. You must also download the GTF file which you will be needing later on in the process.
- Code: Select all
  ./hisat2-build `ls *fasta | awk '{printf("%s,"$1)}' | sed -e 's/,$//'` HT2_IDX
  
  Where fasta represents a folder that contains the fasta files that you downloaded from the ENSEMBL website and then unzipped.
However, in my reading it was stated that it is better to use pre-built indexes. I have found such files on the right hand side of the HISAT2 website https://ccb.jhu.edu/software/hisat2/manual.shtml. I used the file that was named grch38 and then used the make function that came with it in order to build a fresh copy of the index. Although I found that by using the grch38_snp_tran file one can obtain a better overall alignment percentage, there is no way to build this index on any of our servers as it requires 200GB of RAM.
Now that the indexes were built I attempted to align the data by using the following code:
Code: Select all
./hisat2 -x ./grch38/genome -U SRR######.fastq -S alignedfile.sam

The command for htseq-count is as follows:

Code: Select all
python -m HTSeq.scripts.count -m intersection-nonempty -s no -i gene_name ~/Desktop/HISAT2/alignedfile.sam ~/Desktop/genome.gtf -o Outputfile.samout > Outputfile_Counts.txt

Posted: **Wed Sep 27, 2017 2:20 pm**

Hello everyone,

I am just posting the links below that describe how to obtain access to the GDC Portal controlled data.
https://gdc.cancer.gov/access-data/obtaining-access-controlled-data
https://gdc.cancer.gov/access-data/obtaining-access-controlled-data/registering-and-working-era-commons-and-dbgap

Posted: **Fri Oct 20, 2017 11:58 am**

Hello everyone,

I am posting some of the results of the HISAT2 and Bowtie2 analysis on Professor Roy's Data. I will also be comparing particular outputs for certain genes to see where the differences lie in the alignment results for these genes. I will be doing it by using the HISAT2 commands that I have posted as well as some Bowtie2 commands that I include in this post. In order to get the actual genes I will be using htseq-count on both the HISAT2 and Bowtie2 files. In this post I will also discuss the initial problem that I faced when using Bowtie2 with htseq-count and how to resolve it.

The command to run Bowtie2 is similar to HISAT2. All of the code was run within the Bowtie2 directory. The code looked like this:
Code: Select all
./bowtie2 -x ./Bowtie2Index/genome -U ~/Sample-A-Input_S43_L007_R1_001.fastq -S BowTie_Sample_A_Input2.sam
The htseq-count code that I used was the same as when I used it for the HISAT2 files. After obtaining the htseq-count outputs for both the Bowtie2 and HISAT2 files we wanted to be able to compare the outputs amongst certain genes (NRF1, APOE, etc.). I was able to compare them by using the following commands:
1. This command allows you to extract information on a certain gene from any htseq-count sam output file and put that information into a new file:
  Code: Select all
  grep GENE_NAME htseq_count_output.sam > gene_info.txt
  
  Ex:
  Code: Select all
  grep APOE COUNTDrRoy_Bowtie.sam > DrRoy_APOE_Bowtie.txt
2. This command allows you to sort the file above in ascending numerical order based on the starting position of a particular sequence (this information is usually found in column four of the txt file) and output that information into a new file:
  Code: Select all
  sort -t$'\t' -k4,4g gene_info.txt > sorted_gene_info.txt
  
  Ex:
  Code: Select all
  sort -t$'\t' -k4,4g DrRoy_APOE_Bowtie.txt > sort_DrRoy_APOE_Bowtie.txt
The problem: htseq-count was giving 0's as the counts for each gene.
The why: It seems as though this is often a problem of using an index in Bowtie2/HISAT2 that does not match with the reference that you are using in htseq-count.
The solution: Make sure that the index you are using in HISAT2/Bowtie2 is by the same people whose reference you will be using in htseq-count.
Ex: In my case I was using an index that was by NCBI for my Bowtie2 analysis but then my reference for the htseq-count was by Ensembl. After changing my index to one created by Ensembl I no longer obtained only zeros in my htseq-count output. You can obtain several different indexes from Illumina's iGenome collection located at the following url: https://support.illumina.com/sequencing/sequencing_software/igenome.html
Here are some references that may help you resolve this issue in the future:

SMLG (Statistical Machine Learning Group) Discussion Forum

RNA-seq

RNA-seq

Re: RNA-seq

Re: RNA-seq

Re: RNA-seq

Re: RNA-seq

Re: RNA-seq

Re: RNA-seq

Re: Using SRA ToolKit, HISAT2, and htseq-count

Access to controlled Data on GDC

Using HISAT2 and BowTie2 on Dr. Roy's Data