SMLG (Statistical Machine Learning Group) Discussion Forum

by **jramo033** » Wed Dec 09, 2015 12:10 pm

We finished running Banjo 8 times; after running 1, 2, 3 and 6 hours, I ran again the same times so I got two results for each period of time. We got the best BDe score with 6 hours the second time. Attached you can find the summary of each one of the 8 results, including the Markov blanket for the variable triple negative breast cancer status (TNBC). We used 800 variables and the substructures were different. In the attachment we highlighted some genes that appeared in the substructure more than once.

by **jramo033** » Thu Dec 10, 2015 5:40 pm

I was able to run log normalization with the following results:
Total score
Log score -299049.7417 is 0 % of total score
Total score
Log score -298978.0077 is 0 % of total score
Total score
Log score -298771.8834 is 0 % of total score
Total score
Log score -298864.8713 is 0 % of total score
Total score
Log score -298643.4597 is 0 % of total score
Total score
Log score -298729.5521 is 0 % of total score
Total score
Log score -298824.8327 is 0 % of total score
Total score
Log score -298302.5644 is 0 % of total score
Total score
Log score is 100 % of total score

Dr. Yoo: Does this mean that values are so low that the system is not able to normalize them.

by **jramo033** » Sun Jan 22, 2017 10:33 pm

We started the analysis of METABRIC data set. Metabric contains 1,980 cases of breast cancer with clinical variables and gene expression of approximately 18,000 genes. Our initial goal was to performed survival analysis, for which we stratified the samples based on the status of the estrogen receptor in two groups: ER + and ER negative tumors. We then stratified each one of the groups into four subgroups based on the tumor stage at the time of diagnosis (Tumor stages 1,2,3 and 4). Survival analysis will be performed using the software SAS and the results will be posted as soon as we finish this initial step.

by **jramo033** » Tue Jan 24, 2017 12:46 pm

Results of survival analysis for ER negative tumors are shown in the attached file. This initial analysis compares survival of patients depending on the four different tumor stages at time of diagnosis. Log-Rank and Wilcoxon test show that at least two of the four groups of patients differ significantly between them. As we expected, we found that the group of patients with tumor stage 3 had the poorest survival rate.

by **jramo033** » Mon Jan 30, 2017 7:45 pm

Comparing gene expression levels of NRF1 in breast tumor vs normal tissue (Metabric dataset):

We used ttest to compare the gene expression means of NRF1 in breast tumors (1980 samples) vs normal tissues (144 samples). Normal tissue data had been downloaded in terms of Illumina probe ID's, with three different probe Id's for the same gene (NRF1) while the NRF1 expression data for the cases had been downloaded through cBioportal as one single value. When testing cases VS the average of the three microarray probes there was a statistical significant difference in NRF1 expression; however, we investigated and found out that CBioPortal data of NRF1 gene expression for cases had not been calculated as the average of the three probes but rather choosing one of the three NRF1 splicing variant. When testing cases vs normal using the same probe ID for NRF1 gene expression, there was no significant difference in gene expression. See attached file for results, we use SAS for the Ttest.

by **jramo033** » Mon Feb 06, 2017 6:31 pm

TCGA breast cancer data set was downloaded thru the BROAD institute website. Data set is composed of 1,100 tumor samples and 112 normal tissue samples with approximately 20,000 genes- expression level measured with RNA Seq method- plus clinical annotations. For an initial check of NRF1 gene expression, we compared the means of upregulated tumor samples (greater than the mean + 1 std dev)-175 samples vs the 112 normal tissue. SAS output is attached with results showing statistical significance difference between the two samples. See attachment

by **jramo033** » Sun Feb 19, 2017 9:40 pm

We carried out survival analysis of Metabric data set using Cox proportional hazard model with the following covariates: age, estrogen receptor, breast surgery, menopausal status, NRF1 and PARK2. This has been the best model so far to show the effects of NRF1 and PARK2. All the coefficients except PARK2 and the interaction element between the two of them are significant at 0.05 level ; however, the p values of both coefficient are close to 0.05. Results are shown on the attached file.

by **jramo033** » Tue Jun 13, 2017 11:56 am

Identification of NRF1 target genes is essential to elucidate the mechanisms of NRF1 involvement in breast cancer; previous reports indicated that the number of NRF1 regulated genes were 690 (Cam et al., 2004) until recent studies showed that the number of NRF1 target genes were 2,470 (Satoh et al., 2013). This week we used published ChIP-seq and ChIP microarray data from MCF7, T47D and HCC1954 breast cancer cells, normal human mammary epithelial cells -HMEC and normal blood circulating monocytes to identify candidate NRF1 target genes. We downloaded the identified NRF1 binding peaks dataset and used Genomic Regions Enrichment of Annotations Tool (GREAT) to identify candidate NRF1 target genes for each one of the different cell lines. A summary of the results will be posted shortly.

by **jramo033** » Sun Jun 25, 2017 8:05 pm

Here a brief overview of the process to go from sequence reading to identify transcription factors target genes (in our case NRF1 target genes):
ChIP-seq data requires enough sequence reads (sequencing depth). For mammalian transcription factors (TFs) the number of reads is over 20 million.
Once you have the GEO accession number, you can download public available ChIP seq data set. In this case DNA seq data. Sometimes the scientists have posted the file in FASTQ format but sometimes you will find them in Sequence Read Archive (SRA) format and you will need to convert them into fastq files.
Once you have the Fastq files, you can use different webservers to process and manipulate the data. One of the widely used is GALAXY which integrates different tools for ChIP seq data analysis. If you are using Galaxy, the first step is upload the data (Get the data). Next step is mapping the reads to the reference genome, (in our case the human genome), using a software such as Bowtie which is available thru Galaxy. After mapping the reads, the next step is "Peak Calling" to predict the regions of the genome where the protein (transcription factor-NRF1 our case) is bound by finding regions with significant numbers of mapped reads peaks; MACS is one of the most used software for peak calling, also available thru GALAxy. The final step is "Peak Annotation" whose goal is to associate the ChIP-seq peaks with functionally relevant genomic regions, such as gene promoters and come up with a list of genes. For this final step, different software are available such as GREAT .

Galaxy has tutorials to guide you thru the whole process.

by **jramo033** » Tue Jul 25, 2017 11:20 am

We did the algorithmic analysis, including mapping, peak-calling and gene annotation of normal Human mammary epithelial cells (HMEC) and breast cancer cell line HCC1954. Raw data (fastq files) were got from GEO. Total NRF1 target genes in HMEC were 12,194 and 12,136 in HCC1954. We then compared the list of NRF1 target genes for both cell lines and found that 81 %- 10,911 were common.

SMLG (Statistical Machine Learning Group) Discussion Forum

NRF1 and NRF1 target genes and aggressive Breast Cancer

Re: NRF1 and NRF1 target genes and aggressive Breast Cancer

Re: NRF1 and NRF1 target genes and aggressive Breast Cancer

Re: NRF1 and NRF1 target genes and aggressive Breast Cancer

Re: NRF1 and NRF1 target genes and aggressive Breast Cancer

Re: NRF1 and NRF1 target genes and aggressive Breast Cancer

Re: NRF1 and NRF1 target genes and aggressive Breast Cancer

Survival analysis of Metabric data set using Cox proportiona

Identification of NRF1 target genes

From ChIP seq to NRF1 target genes in breast cancer cell lin

NRF1 target gens in HMEC and HCC1954 cell line

Who is online