AlphaGenome Notes
Produced by Google DeepMind, this organization previously released AlphaFold and won the Nobel Prize just a couple of years ago.
I think reading this paper is a great learning opportunity because the model is quite versatile, and a single paper covers numerous mainstream benchmarks as well as a large number of genomics concepts. The paper is 103 pages long, with the first 17 pages being the key content, and the rest mostly covering engineering details.
但是时间有限我还没有读完, 不懂的概念有点多
What can it do
Section titled “What can it do”*AlphaGenome achieved SOTA performance on 22 out of 24 genome track prediction tasks and 24 out of 26 variant effect prediction tasks.
Training
Section titled “Training”use TPU for training
common training methods, but in a new realm
- U-Net: mainstream in segmentation tasks
- Teacher-Student Model
- 4-fold cross validation
- …
its quiet light weight for inference
taking less than one second on a NVIDIA H100 GPU
Features
Section titled “Features”Covering non_coding sequences, which account for 98% of total genes, also the current research trends
At most 1Mb contexts, surpassing previous models like enformer much
SEQUENCE_LENGTH_2KB = 2**11 # 2_048SEQUENCE_LENGTH_16KB = 2**14 # 16_384SEQUENCE_LENGTH_100KB = 2**17 # 131_072SEQUENCE_LENGTH_500KB = 2**19 # 524_288SEQUENCE_LENGTH_1MB = 2**20 # 1_048_576
output_types
# 表观遗传ATAC = dna_model_pb2.OUTPUT_TYPE_ATAC # ATAC-seq 数据, 反映染色质开放区域DNASE = dna_model_pb2.OUTPUT_TYPE_DNASE # DNase I 超敏感位点数据, 也用于捕捉染色质开放区域CHIP_HISTONE = dna_model_pb2.OUTPUT_TYPE_CHIP_HISTONE # 用于检测组蛋白修饰CHIP_TF = dna_model_pb2.OUTPUT_TYPE_CHIP_TF # 用于检测转录因子结合位点
# 转录RNA_SEQ = dna_model_pb2.OUTPUT_TYPE_RNA_SEQ # 用于量化基因表达水平CAGE = dna_model_pb2.OUTPUT_TYPE_CAGE # Cap Analysis of Gene Expression, 反映基因表达起始位点活性SPLICE_SITES = dna_model_pb2.OUTPUT_TYPE_SPLICE_SITES # 剪接位点:RNA剪接的供体和受体位点SPLICE_SITE_USAGE = dna_model_pb2.OUTPUT_TYPE_SPLICE_SITE_USAGE # 剪接位点使用率:各剪接位点的使用频率SPLICE_JUNCTIONS = dna_model_pb2.OUTPUT_TYPE_SPLICE_JUNCTIONS # 剪接接头数据,通过 RNA-seq 分割读取量计算PROCAP = dna_model_pb2.OUTPUT_TYPE_PROCAP # Precision Run-On sequencing 和 capping, 用于测量基因表达精确水平
# 空间CONTACT_MAPS = dna_model_pb2.OUTPUT_TYPE_CONTACT_MAPS # DNA 3D 接触图, 反映 DNA-DNA 空间接触概率
I don’t really master these, leave it for specialized researchers. I only care code and implementation (engineering stuff).
Read the source code
Section titled “Read the source code”They use gRPC (absolutely for Google), so we can rewrite the SDK in other languages.
Why rewrite (in case) ?:
- I want to practice TypeScript. it’s essential for frontend development, and Node is also quite common as a backend in startups abroad.
- It’s easier to maintain compared to Python.
- I’ve already started building the backend for an agent application, so I can embed this as a module without the hassle of calling Python scripts.
.protos
Compilation: You need to download protobuf and protoc; you can refer to the official gRPC documentation. By default, installation is done via pip (pip → pyproject.toml
→ hatch_build.py
). For manual compilation, you need to adjust the import paths.
Look up parameters passed in .probe
, here is the example:
// Human (Homo sapiens).ORGANISM_HOMO_SAPIENS = 9606;
// Mouse (Mus musculus).ORGANISM_MUS_MUSCULUS = 10090;
Use N
to represent unsure base
_VALID_SEQUENCE_CHARACTERS = frozenset('ACGTN')
Specify exported modules in __init__.py
, just like index.js
in Javascript:
__all__ = ["dna_model_pb2", "dna_model_service_pb2", "dna_model_service_pb2_grpc", "tensor_pb2"]
These four files are the compiled outputs. You can also choose other languages. They are the core code, and depending only on them is sufficient.
启动流程: 创建一个 gPRC channel, 携带目标服务器地址信息, API 类型和内容 保持连接, 不用每次重新建立 可以多线程调用 可以选择负载均衡
源代码默认最大为 5 线程, 可以自行修改
decorator: wraps retryRPC
define return type -> Iterable
confusing: plotting is a part of the api
Questions
Section titled “Questions”artificially mutating some base randomly, like CRISPER (but without experiment) calculating output tracks see which parts cause real variation
predict what could happen due to that variant
can’t directly predict traits, just effective of SNP on gene expression chrom accessibility methylation states binding of that transcription …
benchmarked on data from a single cell type => accross mulitple cell type?
model seems to contextualize what different cell types you can use based on the ontologies
how many variants and there interplay can be considered for optional performance
Do we know anything about how many variants and their interplay can be considered at a time while maintaining optimal performance
the ability to do cell specific predictions train on tissue specific gene expression and predict different tissues put ontology term to give it context about what you are trying to predict
isn’t any imputation of the unobserved parts of the encode tissue by experiment matrix
if it can predict something that we haven’t tested, how to evaluation?
Relevant
Section titled “Relevant”AlphaGenome Enhances Personal Gene Expression Prediction but Retains Key Limitations
只做了一个随机森林对比一下就写成论文了
AlphaGenome and Random Forest made distinct predictions: while both successfully classified individuals into high- and low-expression groups, their predictions within each group showed minimal correlation, suggesting that the two models capture different underlying patterns.
https://www.science.org/doi/10.1126/science.1259037
原论文里面验证的研究
Using AlphaGenome, we predicted that the mutations would activate a nearby gene called TAL1 by introducing a MYB DNA binding motif, which replicated the known disease mechanism and highlighted AlphaGenome’s ability to link specific non-coding variants to disease genes.
graph TD A[体细胞突变] --> B[引入MYB结合基序] B --> C[MYB转录因子结合] C --> D[招募CBP<br/>H3K27乙酰化酶] C --> E[招募转录复合体] E --> F[RUNX1] E --> G[GATA-3] E --> H[TAL1] D --> I[H3K27ac修饰] I --> J[超级增强子形成] F --> J G --> J H --> J J --> K[TAL1癌基因激活] K --> L[T-ALL发生]
M[内源性超级增强子] --> N[被MYB和CBP占据] N --> O[MYB普遍调控<br/>超级增强子]
style A fill:#ffcccc style K fill:#ff6666 style L fill:#ff3333 style J fill:#66ccff style I fill:#99ff99
杂项
uv 和 conda 用哪个好