by jramo033 » Sun Jun 25, 2017 8:05 pm
Here a brief overview of the process to go from sequence reading to identify transcription factors target genes (in our case NRF1 target genes):
ChIP-seq data requires enough sequence reads (sequencing depth). For mammalian transcription factors (TFs) the number of reads is over 20 million.
Once you have the GEO accession number, you can download public available ChIP seq data set. In this case DNA seq data. Sometimes the scientists have posted the file in FASTQ format but sometimes you will find them in Sequence Read Archive (SRA) format and you will need to convert them into fastq files.
Once you have the Fastq files, you can use different webservers to process and manipulate the data. One of the widely used is GALAXY which integrates different tools for ChIP seq data analysis. If you are using Galaxy, the first step is upload the data (Get the data). Next step is mapping the reads to the reference genome, (in our case the human genome), using a software such as Bowtie which is available thru Galaxy. After mapping the reads, the next step is "Peak Calling" to predict the regions of the genome where the protein (transcription factor-NRF1 our case) is bound by finding regions with significant numbers of mapped reads peaks; MACS is one of the most used software for peak calling, also available thru GALAxy. The final step is "Peak Annotation" whose goal is to associate the ChIP-seq peaks with functionally relevant genomic regions, such as gene promoters and come up with a list of genes. For this final step, different software are available such as GREAT .
Galaxy has tutorials to guide you thru the whole process.