Project 01

约 4483 字大约 15 分钟

2024-10-31

2024 秋学期高通量大作业 | 22230111 郎程闰表观遗传组

基于Snakemake 的 ATAC-seq 足迹识别与差异分析流程开发及应用：以 HCV 相关肝癌为例

1. 前言

1.1 背景

慢性丙型肝炎病毒 ( HCV ) 感染是导致肝硬化和肝细胞癌 ( HCC ) 的主要致病因素。在全球范围内，约有 7100 万慢性 HCV 感染者，其中 10-20% 的患者在其一生中可能出现严重的肝脏并发症。表观遗传机制 ( 如 DNA 甲基化和组蛋白修饰 ) 在多种疾病过程中，尤其是肝病的发生和发展中，发挥了关键作用。HCV 感染能够显著改变这些表观遗传标记，并可能在感染治愈后留下 “表观遗传疤痕” ，从而导致长期的基因调控异常和癌症风险。
近年来，直接作用抗病毒药物 ( DAA ) 的发展为 HCV 的治愈提供了新的希望。然而，临床观察表明，即使在达到持续病毒学应答( SVR ) 后，患者的 HCC 风险仍未完全消除。这提示 HCV 感染可能通过持久的表观遗传损伤影响肝脏细胞功能，为理解这些过程的分子机制提供了研究必要性。

1.2 目标

为了揭示 HCV 感染如何通过表观遗传途径影响肝细胞功能，以及在病毒清除后这些改变是否仍然持续，本研究构建了基于Snakemake 的 ATAC-seq 足迹分析与差异分析流程，对相关数据进行系统性分析。具体目标包括：
1. 构建高效、可重复的 Snakemake 流程，用于 ATAC-seq 数据的标准化处理、足迹识别和差异分析
2. 比较 HCV 感染前后肝细胞系 ( Huh7.5 和 Hu1545 ) 染色质开放性和转录调控的变化
3. 评估 HCV 治愈后，表观遗传标记的改变是否能够恢复正常
4. 阐明 HCV 感染和治愈过程中关键基因失调的分子机制，为治疗策略优化提供数据支持

2. 相关网址

代码仓库： Meteor-oxalis3 / myATAC-seq

3. 数据来源

GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE168178
对应文献：https://pmc.ncbi.nlm.nih.gov/articles/PMC9416882/

Hu1545 was derived from primary human hepatocytes from a de-identified donor without liver disease. Huh7.5 is a derivative of the Huh7 HCC cell line with a RIG-I mutation that leaves the cell line more susceptible to HCV infection

4. 环境工具

流程语言：Snakemake v8.25.5
软件包管理：Micromamba v2.0.4
环境打包分发：Singularity v3.8.6

5. 准备工作

5.1 数据准备

考虑到国内下载 SRA 的速度太慢，我在“赛博活菩萨”—— Cloudflare 那里搭建了一个镜像网站用于加速下载。还编写了一个 shell 小脚本 ( cf-sra-get.sh ) 进一步方便下载数据。

#!/bin/bash
# 检查是否提供了参数
if [ -z "$1" ]; then
    echo "Usage: $0 <SRR_ID>"
    exit 1
fi

# 获取传入的SRR编号
SRR_ID=$1
SRA_NAME="$SRR_ID.sra"

# 下载文件
curl -LS "https://sra.meteor-oxalis3.com/sra/$SRR_ID/$SRR_ID" -o $SRA_NAME

开始下载数据：

chmod +x cf-sra-get.sh
./cf-sra-get.sh SRR13849295
./cf-sra-get.sh SRR13849301
./cf-sra-get.sh SRR13849307
./cf-sra-get.sh SRR13849313

我定义的样本名称与对应的SRR号：
样本名称 SRR号细胞类型
Healthy1 SRR13849295 Hu1545_ATAC
Healthy2 SRR13849301 Hu1545_HCV_ATAC
HCC1 SRR13849307 Huh7.5_ATAC
HCC2 SRR13849313 Huh7.5_HCV_ATAC

样本名称	SRR号	细胞类型
Healthy1	`SRR13849295`	Hu1545_ATAC
Healthy2	`SRR13849301`	Hu1545_HCV_ATAC
HCC1	`SRR13849307`	Huh7.5_ATAC
HCC2	`SRR13849313`	Huh7.5_HCV_ATAC

接下来我使用 fasterq-dump + pigz 的组合进行 .sra 转 .fastq.gz，比 fastq-dump --gzip 快不少。

fasterq-dump -e 30 --split-files SRR13849*.sra
pigz -p 30 *.fastq
mv *.fastq.gz ~/myATAC-seq/data

5.2 环境配置

我采用了 micromamba 与 singularity 相结合的方案进行软件版本控制与分发：通过定义 ATAC_seq_pipeline.def 文件，在 debian:bookworm-slim 基础镜像中调用 micromamba ，使用 env_ATAC_seq.yml 与 env_idr.yml 文件自动化构建 myATAC-seq.sif 镜像。
当然，也可以使用 conda 或者 micromamba 直接构建虚拟环境。
相关文件网址： Meteor-oxalis3 / myATAC-seq · singularity

这里我在我的 Fedora 41 笔记本上构建好 singularity 镜像，再通过 sftp 传输到 .9 服务器进行后续的数据处理。

# 首先克隆我的代码仓库
git clone [email protected]:Meteor-oxalis3/myATAC-seq.git

# 接着进入 singularity 的 .def 文件所在的文件夹
cd myATAC-seq/singularity

# 开始构建 .sif 镜像
singularity build --fakeroot myATAC-seq.sif ATAC_seq_pipeline.def

根据我之前折腾 docker 和 singularity 的经验，软链接相应文件后，挂载自己 $HOME 路径之外的数据可以直接挂载所在用户的家目录，无需在自己的 $HOME 路径新建同名文件夹。

借用一下邵老师已经下载好的相关数据，嘻嘻。

cd ~/myATAC-seq/data
ln -s /public/workspace/shaojf/Course/NGS/Reference/BWA_Index/ .
ln -s /public/workspace/shaojf/Course/NGS/Reference/ENCODE/blacklist.v2/ .
ln -s /public/workspace/shaojf/Course/NGS/Reference/MEME/motif_databases/ .
ln -s /public/workspace/shaojf/Course/NGS/Reference/Ensembl/Homo_sapiens.GRCh38.100.chr.gtf .

流程运行时间较长，所以我在 tmux 里面运行 singularity 。

tmux new -t "myATAC-seq"
mkdir ~/myATAC-seq/pre_and_peakcalling
cd ~/myATAC-seq/pre_and_peakcalling
singularity shell --bind /public/workspace/shaojf:/public/workspace/shaojf ../singularity/myATAC-seq.sif

6. 数据预处理 + Peak Calling

参考这篇文献^[1]推荐的预处理步骤，我使用 snakemake 编写了一个流程，实现了从 ATAC-Seq 双末端测序的原始数据到 Peak Calling 的完整过程。

6.1 代码

在 singularity 里面直接调用 snakemake 流程。

HCC1

snakemake -s ../snakemake/pre_and_peakcalling/Snakefile \
--config prefix="HCC1" cores="10" \
bwa_index="../data/BWA_Index/Homo_sapiens.GRCh38.dna.primary_assembly.fa" \
genome_index="../data/BWA_Index/Homo_sapiens.GRCh38.dna.primary_assembly.fa.fai" \
fastq1="../data/SRR13849307_1.fastq.gz" \
fastq2="../data/SRR13849307_2.fastq.gz" \
blacklist="../data/blacklist.v2/hg38-blacklist.v2.bed"

HCC2

snakemake -s ../snakemake/pre_and_peakcalling/Snakefile \
--config prefix="HCC2" cores="10" \
bwa_index="../data/BWA_Index/Homo_sapiens.GRCh38.dna.primary_assembly.fa" \
genome_index="../data/BWA_Index/Homo_sapiens.GRCh38.dna.primary_assembly.fa.fai" \
fastq1="../data/SRR13849313_1.fastq.gz" \
fastq2="../data/SRR13849313_2.fastq.gz" \
blacklist="../data/blacklist.v2/hg38-blacklist.v2.bed"

Healthy1

snakemake -s ../snakemake/pre_and_peakcalling/Snakefile \
--config prefix="Healthy1" cores="10" \
bwa_index="../data/BWA_Index/Homo_sapiens.GRCh38.dna.primary_assembly.fa" \
genome_index="../data/BWA_Index/Homo_sapiens.GRCh38.dna.primary_assembly.fa.fai" \
fastq1="../data/SRR13849295_1.fastq.gz" \
fastq2="../data/SRR13849295_2.fastq.gz" \
blacklist="../data/blacklist.v2/hg38-blacklist.v2.bed"

Healthy2

snakemake -s ../snakemake/pre_and_peakcalling/Snakefile \
--config prefix="Healthy2" cores="10" \
bwa_index="../data/BWA_Index/Homo_sapiens.GRCh38.dna.primary_assembly.fa" \
genome_index="../data/BWA_Index/Homo_sapiens.GRCh38.dna.primary_assembly.fa.fai" \
fastq1="../data/SRR13849301_1.fastq.gz" \
fastq2="../data/SRR13849301_2.fastq.gz" \
blacklist="../data/blacklist.v2/hg38-blacklist.v2.bed"

6.1.1 数据链接

"link.smk"
将原始 fastq.gz 数据文件软链接到特定文件夹，方便后续调用。

6.1.2 第一次质检

"fastqc_raw.smk"
使用 fastqc 对原始下机数据进行第一次质量检查。

6.1.3 接头剪切

"fastp.smk"
使用 fastp 的双末端模式自动识别接头类型并进行剪切。

6.1.4 第二次质检

"fastqc_trimmed.smk"
使用 fastqc 对剪切接头后的数据进行第二次质量检查。

6.1.5 序列比对

"bwa.smk"
使用 bwa 进行序列比对到参考基因组，生成 .sam 文件。

6.1.6 samtools 系列处理

"samtools.smk"
使用 samtools 对输入文件执行统计、格式转换、修复配对、排序、去重和索引等标准化处理步骤。

6.1.7 偏移修正

"alignmentSieve.smk"
使用 deeptools 里面的 alignmentSieve 工具过滤和调整测序比对数据，并且使用 samtools 按照参考基因组的位置进行排序，方便后续的 Diffbind 差异分析。

6.1.8 Peak Calling

"macs3.smk"
使用 MACS3 检测 peaks。

6.1.9 汇总预处理结果

"multiqc.smk"
使用 multiqc 将接头剪切和序列比对的结果进行汇总报告。

7. 足迹识别

这里我参考了 TOBIAS 官方^[2]提供的流程图，自己编写了从上个流程的 4 个样本结果到 BINDetect 步骤的 snakemake 流程。

7.1 代码

mkdir ~/myATAC-seq/bindetect
cd ~/myATAC-seq/bindetect

snakemake -s ../snakemake/bindetect/Snakefile \
--config case_prefix="HCC" control_prefix="Healthy" cores="20" \
case1="../pre_and_peakcalling/HCC1/" \
case2="../pre_and_peakcalling/HCC2/" \
control1="../pre_and_peakcalling/Healthy1/" \
control2="../pre_and_peakcalling/Healthy2/" \
genome="../data/BWA_Index/Homo_sapiens.GRCh38.dna.primary_assembly.fa" \
genome_index="../data/BWA_Index/Homo_sapiens.GRCh38.dna.primary_assembly.fa.fai" \
motifs="../data/motif_databases/JASPAR/JASPAR2022_CORE_vertebrates_non-redundant_v2.meme"

7.1.1 合并 BAM

"samtools_merge_case.smk", "samtools_merge_control.smk"
由于 TOBIAS 不支持生物学重复，所以需要将各组的 bam 文件合并一下。

7.1.2 IDR 分析

"idr_case.smk", "idr_control.smk"
使用 IDR 进行 Peak 过滤和筛选，接着使用 awk 将通过 IDR 阈值的 Peak 提取后转换为 narrowPeak 格式。
1. HCC_idr.txt.png
1. Healthy_idr.txt.png

7.1.3 合并生成多样本共识 Peaks

"bedtools_merge.smk"
合并 case 组和 control 组的 IDR 结果，并通过 bedtools 对合并后的峰值进行排序和筛选，最后去除线粒体 DNA。

7.1.4 ATACorrect 矫正

"tobias_ATACorrect_case.smk", "tobias_ATACorrect_control.smk"
使用 TOBIAS 对 case 组和 control 组的 BAM 文件和合并的 Peaks 进行 ATAC-seq 偏倚校正 Tn5 转座酶矫正。

HCC_atacorrect.pdf

Healthy_atacorrect.pdf

7.1.5 ScoreBigwig 足迹计算

"tobias_ScoreBigwig_case.smk", "tobias_ScoreBigwig_control.smk"
使用 TOBIAS 根据校正后的信号文件和合并的 Peaks 计算足迹。

7.1.6 BINDetect 分析

"tobias_BINDetect.smk"
使用 TOBIAS 在 case 组和 control 组的足迹信号文件上进行 BINDetect 分析：首先在所有开发区域搜索已知的 motif 基序，确定可能的结合位点，再结合足迹分数，推测是否可能有 TF 结合。
- 详细结果文件:
  1. 表格：~/myATAC-seq/bindetect/footprinting/bindetect_results.xlsx
  2. 交互式网页：~/myATAC-seq/bindetect/footprinting/bindetect_HCC_Healthy.html

7.2 单个 motif 分析

这里我以 FOXA1 为例，进行单个 motif 分析，直观展示不同条件下的足迹可视化图谱。

7.2.1 聚合信号曲线图

使用 TOBIAS PlotAggregate 绘制。

mkdir ~/myATAC-seq/bindetect/motif
cd ~/myATAC-seq/bindetect/motif

TOBIAS PlotAggregate \
--TFBS \
../footprinting/FOXA1_MA0148.4/beds/FOXA1_MA0148.4_HCC_bound.bed \
../footprinting/FOXA1_MA0148.4/beds/FOXA1_MA0148.4_HCC_unbound.bed \
../footprinting/FOXA1_MA0148.4/beds/FOXA1_MA0148.4_Healthy_bound.bed \
../footprinting/FOXA1_MA0148.4/beds/FOXA1_MA0148.4_Healthy_unbound.bed \
--TFBS-labels HCC_bound HCC_unbound Healthy_bound HHealthy_unbound \
--signals ../HCC/tobias/HCC_corrected.bw ../HCC/tobias/HCC_uncorrected.bw ../Healthy/tobias/Healthy_corrected.bw ../Healthy/tobias/Healthy_uncorrected.bw \
--signal-labels HCC_corrected HCC_uncorrected Healthy_corrected Healthy_uncorrected \
--output FOXA1_MA0148.4_PlotAggregate.pdf \
--plot-boundaries

FOXA1_MA0148.4_PlotAggregate.pdf

7.2.2 热图

使用 TOBIAS PlotHeatmap 绘制。

TOBIAS PlotHeatmap \
--TFBS \
../footprinting/FOXA1_MA0148.4/beds/FOXA1_MA0148.4_HCC_bound.bed \
../footprinting/FOXA1_MA0148.4/beds/FOXA1_MA0148.4_HCC_unbound.bed \
../footprinting/FOXA1_MA0148.4/beds/FOXA1_MA0148.4_Healthy_bound.bed \
../footprinting/FOXA1_MA0148.4/beds/FOXA1_MA0148.4_Healthy_unbound.bed \
--TFBS-labels HCC_bound HCC_unbound Healthy_bound HHealthy_unbound \
--signals ../HCC/tobias/HCC_corrected.bw ../HCC/tobias/HCC_uncorrected.bw ../Healthy/tobias/Healthy_corrected.bw ../Healthy/tobias/Healthy_uncorrected.bw \
--signal-labels HCC_corrected HCC_uncorrected Healthy_corrected Healthy_uncorrected \
--output FOXA1_MA0148.4_PlotHeatmap.pdf \
--plot_boundaries

FOXA1_MA0148.4_PlotHeatmap.pdf

7.3 筛选位点

我写了个 shell 脚本 filter.sh ，用于筛选前 20 个位点。

#!/bin/bash
# 过滤条件 HCC_bound + Healthy_bound == 2
awk -F '\t' 'NR == 1 { next } $12 + $13 == 2 {print $0}' \
../footprinting/FOXA1_MA0148.4/FOXA1_MA0148.4_overview.txt > filtered.tmp

# 按 Healthy_score 降序排序, 并取前 20 行
sort -t$'\t' -k11,11nr filtered.tmp | head -n 20 > top20.tmp

# 生成 plot_regions 数据
awk -F '\t' '
    { 
        plot_start = $8 - 1000;          # $8 是 peak_start
        plot_end = $9 + 1000;            # $9 是 peak_end
        print $7, plot_start, plot_end;  # 输出 peak_chr, plot_start, plot_end
    }
' OFS='\t' top20.tmp > "plot_regions.bed"

# 清理中间文件
rm filtered.tmp top20.tmp

使用 TOBIAS PlotTracks 绘制 IGV 风格的全基因组信号轨迹，以观察足迹分布。

TOBIAS PlotTracks \
--bigwigs ../../pre_and_peakcalling/HCC1/macs3/HCC1_treat_pileup.bw \
--bigwigs ../../pre_and_peakcalling/HCC2/macs3/HCC2_treat_pileup.bw \
--bigwigs ../../pre_and_peakcalling/Healthy1/macs3/Healthy1_treat_pileup.bw \
--bigwigs ../../pre_and_peakcalling/Healthy2/macs3/Healthy2_treat_pileup.bw \
--bigwigs ../../bindetect/HCC/tobias/HCC_corrected.bw \
--bigwigs ../../bindetect/Healthy/tobias/Healthy_corrected.bw \
--bigwigs ../../bindetect/HCC/tobias/HCC_footprints.bw \
--bigwigs ../../bindetect/Healthy/tobias/Healthy_footprints.bw \
--regions plot_regions.bed \
--sites ../footprinting/FOXA1_MA0148.4/beds/FOXA1_MA0148.4_all.bed \
--highlight ../footprinting/FOXA1_MA0148.4/beds/FOXA1_MA0148.4_all.bed \
--gtf ../../data/Homo_sapiens.GRCh38.100.chr.gtf \
--colors red darkblue red darkblue red darkblue

20 个位点汇总成一个 all_plots.pdf。

8. 差异分析

8.1 准备工作

构建样本元数据表 ( ~/myATAC-seq/data/sheet.csv ):

SampleID	Tissue	Factor	Condition	Treatment	Replicate	bamReads	Peaks	PeakCaller
Healthy1	NA	Healthy_Factor	Healthy_Condition	HCV-	1	/public/workspace/stu22230111/myATAC-seq/pre_and_peakcalling/Healthy1/alignmentSieve/ATACshifted_Healthy1_sorted.bam	/public/workspace/stu22230111/myATAC-seq/pre_and_peakcalling/Healthy1/macs3/Healthy1_peaks.narrowPeak	bed
Healthy2	NA	Healthy_Factor	Healthy_Condition	HCV+	2	/public/workspace/stu22230111/myATAC-seq/pre_and_peakcalling/Healthy2/alignmentSieve/ATACshifted_Healthy2_sorted.bam	/public/workspace/stu22230111/myATAC-seq/pre_and_peakcalling/Healthy2/macs3/Healthy2_peaks.narrowPeak	bed
HCC1	NA	HCC_Factor	HCC_Condition	HCV-	1	/public/workspace/stu22230111/myATAC-seq/pre_and_peakcalling/HCC1/alignmentSieve/ATACshifted_HCC1_sorted.bam	/public/workspace/stu22230111/myATAC-seq/pre_and_peakcalling/HCC1/macs3/HCC1_peaks.narrowPeak	bed
HCC2	NA	HCC_Factor	HCC_Condition	HCV+	2	/public/workspace/stu22230111/myATAC-seq/pre_and_peakcalling/HCC2/alignmentSieve/ATACshifted_HCC2_sorted.bam	/public/workspace/stu22230111/myATAC-seq/pre_and_peakcalling/HCC2/macs3/HCC2_peaks.narrowPeak	bed

8.2 代码

在 singularity 里面直接通过 Rscript 运行：
```
Rscript ~/myATAC-seq/R_diffbind/diffbind.R
```

R 代码详情如下 (点击下方箭头展开查看完整代码) ：

library(DiffBind)
library(tidyverse)
library(corrplot)
library(PCAtools)

# 新建文件夹
output_dir = "~/myATAC-seq/diffbind"
dir.create(output_dir, recursive = TRUE)

# 切换工作目录
setwd(output_dir)

# 加载样本信息
sampleSheet = read.csv(file = "~/myATAC-seq/data/sheet.csv", sep = "\t")
sampleSheet

# 构建 DiffBind 对象
diff_bind_obj = dba(
sampleSheet = sampleSheet,
config = data.frame(doBlacklist = FALSE, doGreylist = FALSE),
bRemoveM = FALSE, bRemoveRandom = FALSE
)

# 计算峰值
diff_bind_obj = dba.count(
diff_bind_obj,
summits = 200, # 峰宽 = 2 * summits + 1
bUseSummarizeOverlaps = TRUE, # 更准，更慢
score = DBA_SCORE_RPKM
)

# 标准化
diff_bind_obj = dba.normalize(
diff_bind_obj,
normalize = DBA_NORM_TMM
)

# 提取 FPKM 数据
peaks_fpkm = diff_bind_obj$binding %>%
as_tibble() %>%
mutate(peak_id = str_c('chr', CHR, ':', START, '_', END)) %>%
column_to_rownames(var = "peak_id") %>%
select(-c('CHR', 'START', 'END'))

# 绘制相关性热图并保存为 PDF
pdf(file = file.path(output_dir, "correlation_plot.pdf"))
corrplot(cor(peaks_fpkm), addCoef.col = "white")
dev.off()

# PCA 分析
pca = pca(peaks_fpkm, metadata = column_to_rownames(sampleSheet, var = "SampleID"))

# 绘制 PCA 二维图并保存为 PDF
pdf(file = file.path(output_dir, "pca_biplot.pdf"))
biplot(pca, colby = "Factor")
dev.off()

# 差异分析
diff_bind_obj = dba.contrast(
diff_bind_obj,
contrast = c("Condition", "Healthy_Condition", "HCC_Condition"),
minMembers = 2
)

diff_bind_obj = dba.analyze(diff_bind_obj)

# 差异分析结果展示
# (这里的韦恩图好像没啥意义...)
# pdf(file = file.path(output_dir, "differential_binding_venn.pdf"))
# dba.plotVenn(diff_bind_obj, contrast = 1, bDB = TRUE, bGain = TRUE, bLoss = TRUE, bAll = FALSE)
# dev.off()

# 火山图
pdf(file = file.path(output_dir, "differential_binding_volcano.pdf"))
dba.plotVolcano(diff_bind_obj, contrast = 1)
dev.off()

# 保存差异结果表
diff_bind_result = dba.report(diff_bind_obj)
diff_bind_table = as.data.frame(diff_bind_result)
write.csv(diff_bind_table, file = file.path(output_dir, "diff_bind_results.csv"), row.names = FALSE)

# 打印差异分析对比结果
print(dba.show(diff_bind_obj, bContrasts = TRUE))

8.3 结果

三张图。
1. correlation_plot.pdf
1. pca_biplot.pdf
1. differential_binding_volcano.pdf

最后来一张深度为 2 的 tree 作为收尾。

bindetect
- footprinting
  - …
- HCC
  - …
- Healthy
  - …
- motif
  - …
- peak_merge
  - …
data
- blacklist.v2-> /public/workspace/shaojf/Course/NGS/Reference/ENCODE/blacklist.v2/
  - …
- BWA_Index-> /public/workspace/shaojf/Course/NGS/Reference/BWA_Index/
  - …
- Homo_sapiens.GRCh38.100.chr.gtf-> /public/workspace/shaojf/Course/NGS/Reference/Ensembl/Homo_sapiens.GRCh38.100.chr.gtf
- motif_databases-> /public/workspace/shaojf/Course/NGS/Reference/MEME/motif_databases/
- sheet.csv
- SRR13849295_1.fastq.gz
- SRR13849295_2.fastq.gz
- SRR13849301_1.fastq.gz
- SRR13849301_2.fastq.gz
- SRR13849307_1.fastq.gz
- SRR13849307_2.fastq.gz
- SRR13849313_1.fastq.gz
- SRR13849313_2.fastq.gz
diffbind
- correlation_plot.pdf
- diff_bind_results.csv
- differential_binding_volcano.pdf
- pca_biplot.pdf
LICENSE
pre_and_peakcalling
- HCC1
  - …
- HCC2
  - …
- Healthy1
  - …
- Healthy2
  - …
R_diffbind
- diffbind.R
README.md
singularity
- ATAC_seq_pipeline.def
- bin
  - micromamba
- debian.sources
- env_ATAC_seq.yml
- env_idr.yml
- myATAC-seq.sif
snakemake
- bindetect
  - …
- pre_and_peakcalling
  - …

9. 总结

9.1 实验结果

在本次实验中, 我发现了FOX、HNF等转录因子家族对于 HCV 相关肝癌具有潜在调控作用。
通过查阅相关文献资料^[3], 也进一步佐证了我的发现: FOXA 家族成员最初存在于肝细胞核提取物中，研究表明肝脏发育的开始取决于 FOXA 家族成员^[4] ^[5]。FOXA1 和 FOXA2 激活前肠内胚层内的发育基因，以建立对器官特异性信号的反应能力并诱导肝脏特异性基因的表达 ^[5:1]。
因此，当 FOXA1 和 FOXA2 缺失时，肝脏基因调控网络被破坏，导致肝脏关键基因的表达迅速而大幅降低，从而严重影响肝脏发育和功能 ^[4:1]。
来自 HNF 转录因子家族的 HNF4α 在维持肝脏稳态中也起着至关重要的作用。然而, HNF4α 的突变或其靶向基因启动子区域的表观遗传修饰会导致其功能特性的丧失。值得注意的是, HNF4α 表达升高与肝细胞癌 HCC 患者的良好预后相关。^[6]

9.2 小技巧

9.2.1 快速查看服务器上的文件

想要查看结果文件 ( PDF、HTML、PNG等 ), 可以先 cd 到相应路径，然后使用 python3 中的 http.server 模块将文件夹临时暴露成类似于 nginx 那样的静态文件服务（ 11111 是端口号，可以随便设置，建议范围在 10000 - 60000 之间 ):
```
python3 -m http.server 11111
```
在本地电脑上的浏览器输入服务器对应的 ip地址:端口 ，即可访问：

9.2.2 命令行下 `pdf` 转 `png`

使用 imagemagick 软件可以通过命令行快速进行常见图片格式的转换操作。需要使用 conda 或 micromamba 安装 imagemagick 软件后使用。

# 假如你用的是 anaconda/miniconda
conda install -c conda-forge imagemagick
# 假如你用的是 micromamba
micromamba install -c conda-forge imagemagick

# pdf转png (density是分辨率，ppi)
magick -density 300 foo.pdf foo.png

9.3 注意

peak_merge 的结果需要手动去除线粒体 DNA ，我使用的是 grep -Ev '(chrM|chrMT|M|MT|Mito)' 。
DiffBind 输入只能为 bam 或 bed 或 bed.gz，alignmentSieve 那步输出不可为 bedpe，而且还需要再次 samtools 排序。
通过 conda-forge 源安装的 singularity 在某些系统上可能会出现 Failed to create user namespace: user namespace disabled: exit status 1 ，可以联系服务器管理员临时调整最大用户命名空间：sudo sysctl user.max_user_namespaces=65536，或永久调整：echo "user.max_user_namespaces=65536" >> /etc/sysctl.conf
使用 Diffbind 中的 dba.count 函数时，所需的 .bam 文件需要按照参考基因组的位置进行排序，我在 pre_and_peakcalling 的 “偏移修正” 步骤里面已经使用 samtools 排序了。

10. 参考文献

Grandi, F.C., Modi, H., Kampman, L. et al. Chromatin accessibility profiling by ATAC-seq. Nat Protoc 17, 1518–1552 (2022). https://doi.org/10.1038/s41596-022-00692-9 ↩︎
Bentsen, M., Goymann, P., Schultheis, H. et al. ATAC-seq footprinting unravels kinetics of transcription factor binding during zygotic genome activation. Nat Commun 11, 4267 (2020). https://doi.org/10.1038/s41467-020-18035-1 ↩︎
Liu, N., Wang, A., Xue, M. et al. FOXA1 and FOXA2: the regulatory mechanisms and therapeutic implications in cancer. Cell Death Discov. 10, 172 (2024). https://doi.org/10.1038/s41420-024-01936-1 ↩︎
Reizel, Y., Morgan, A., Gao, L., Lan, Y., Manduchi, E., Waite, E. L., Wang, A. W., Wells, A., & Kaestner, K. H. (2020). Collapse of the hepatic gene regulatory network in the absence of FoxA factors. Genes & development, 34(15-16), 1039–1050. https://doi.org/10.1101/gad.337691.120 ↩︎ ↩︎
Lee, C., Friedman, J., Fulmer, J. et al. The initiation of liver development is dependent on Foxa transcription factors. Nature 435, 944–947 (2005). https://doi.org/10.1038/nature03649 ↩︎ ↩︎
Qu, N., Luan, T., Liu, N., Kong, C., Xu, L., Yu, H., Kang, Y., & Han, Y. (2023). Hepatocyte nuclear factor 4 a (HNF4α): A perspective in cancer. Biomedicine & pharmacotherapy = Biomedecine & pharmacotherapie, 169, 115923. https://doi.org/10.1016/j.biopha.2023.115923 ↩︎