This week, from Dr. Yoo's suggestion, I constructed Bayesian networks to compare probabilistic gene-regulatory structures across three groups in the GSE222494 frontal cortex single-nucleus dataset: Controls, Sporadic Alzheimer’s disease (AD), and PSEN1-E280A familial AD. I began by preprocessing the dataset (33,525 genes × 43,744 nuclei) and harmonizing metadata containing diagnostic labels. Following Dr. Yoo’s suggestion, I initially attempted a log₂ fold-change threshold of 1, but this returned zero DEGs for the E280A group and only five DEGs for Sporadic, likely due to the noisiness of single-nucleus expression and the very large sample sizes in each subgroup. To retain statistical power without losing biological signal, I selected a more conservative threshold of |log₂FC| ≥ 0.5, which yielded:
• 8 DEGs for E280A vs. Control,
• 22 DEGs for Sporadic vs. Control,
with a union of 25 unique genes when combining both disease groups.
To ensure biological relevance, I added the four transcription factors central to my dissertation project, NRF1, NFE2L1, NFE2L2, and GABPA, and constructed a final 29-gene feature panel. These genes were extracted from the normalized count matrix and discretized into three quantile-based expression bins (0, 1, 2) for compatibility with discrete Bayesian methods.
The dataset was then split by diagnosis into Control (n = 11,745), Sporadic (n = 14,448), and E280A (n = 17,551) nuclei. For each group, I learned a Bayesian network structure using Hill-Climb Search with BIC (Bayesian Information Criterion) scoring. BIC evaluates how well the network explains the data while penalizing overly complex graphs, enabling quantitative comparisons across groups.
The final BIC values were:
• Control: −224,441.54
• Sporadic: −207,931.54
• E280A: −342,939.69
The more negative BIC in E280A reflects greater structural complexity in the inferred regulatory relationships, consistent with the idea that PSEN1 mutation carriers may have broader transcriptomic disorganization.
I exported each learned network as a DOT graph and rendered high-resolution PNG visualizations. I then extracted full CPD (conditional probability distribution) tables for every gene, the probability of each gene’s expression level given its parents in the network. These tables were saved as separate text files for Controls, Sporadic, and E280A. I am attaching the CPD files and structures here.
