by cpere117 » Fri Nov 02, 2018 9:00 pm
This week for my class project I added an RNA Seq dataset from GEO to my table of Alzheimer datasets posted in my proposal. The GSE accession number of the study is GSE104704 and it's titled "Dysregulation of the epigenetic landscape of normal aging in Alzheimer's disease [RNA-Seq]." I recently discovered this dataset and was happy to see that it provided raw single strand RNA fastq files that I could input into the Galaxy bioinformatics platform. The dataset has 30 samples divided according to age, and disease status (AD or no AD). The three groups consisted of 8 young healthy brains (Young), 10 aged healthy brains, and 12 aged diseased brains determined by the neuropathological presence of Lewy bodies, amyloid beta plaques, and neurofibrillary tangles. After uploading the raw files of all post mortem brain tissue samples into Galaxy I proceeded to the quality assurance stage of RNA-Seq workflow. Here I used FastQC to verify that all of my samples contained no improper contamination or GC bias. I found that there was a problem with overrepresentation of sequences leading me to utilize trimgalore a program that cuts off undesired sequence strands often left over by the company primer used for the RNA-Seq analysis (Illumina, Affymetrix, etc.). The program fixed this problem with the data leading me to the next step of aligning the sequences accordingly using Bowtie2. Following alignment, I then proceeded to compile all of the RNA-Seq counts that were detected against the HG38 reference genome using FeatureCounts a package provided on galaxy. Next, my 30 samples were gathered into their appropriate groups designated by age and disease status. I performed an RNA-Seq differential gene expression analysis comparing RNA Young Samples vs RNA Old Samples, RNA Young Samples vs RNA AD Old Samples, and lastly RNA Old versus RNA AD Old. Utilizing both deseq2 and edger in order to cross-compare both programs results I was able to output datasets with logFC, PValues, mean normalized counts, and gene identifiers (Entrez ID numbers). I've attached some of my output results to this post for your own reference. Notice that the volcano plot produced by edgeR has many more significant genes for Young healthy brain adults versus AD Old diseased brains when compared to other group contrasts. I've also downloaded and color coded the Allen database RNA-Seq data according to gender and disease status (Normal, AD, Dementia, Multiple Etiologies, Vascular Dementia, and other). Attached is the file with color coding. Next week I plan to have the results of the Allen data if possible so it can be discretized for BANJO and also utilize the results of the new GSE study described in this post.
- Attachments
-
- CPEREZ_ProbabilisticGraphical Modeling Proposal %281%29.docx
- (26.74 KiB) Downloaded 150 times
-
- mdplot_ADOld-Young.pdf
- (196.76 KiB) Downloaded 159 times
-
- bcvplot.pdf
- (213.72 KiB) Downloaded 154 times
-
- EDGE R Table for significant DEG counts.PNG (6.5 KiB) Viewed 3261 times
-
- Edger young versus AD Old DEG Results.xlsx
- (1.93 MiB) Downloaded 142 times
-
- Galaxy310-[DESeq2_plots_on_data_307,_data_305,_and_others].pdf
- Plots showing PCA Analysis, Correlation matrix, and frequency of P-value numbers for data
- (1.05 MiB) Downloaded 162 times
-
- DESEQ2 Normalized Counts for all samples.xlsx
- (9.48 MiB) Downloaded 168 times
-
- DESEQ2 DEG Results for RNA YOUNG VERSUS RNA AD OLD.xlsx
- (2.33 MiB) Downloaded 154 times