AlphaGenome Notes

Produced by Google DeepMind, this organization previously released AlphaFold and won the Nobel Prize just a couple of years ago.

I think reading this paper is a great learning opportunity because the model is quite versatile, and a single paper covers numerous mainstream benchmarks as well as a large number of genomics concepts. The paper is 103 pages long, with the first 17 pages being the key content, and the rest mostly covering engineering details.

但是时间有限我还没有读完, 不懂的概念有点多

What can it do

*AlphaGenome achieved SOTA performance on 22 out of 24 genome track prediction tasks and 24 out of 26 variant effect prediction tasks.

Training

use TPU for training

common training methods, but in a new realm

U-Net: mainstream in segmentation tasks
Teacher-Student Model
4-fold cross validation
…

its quiet light weight for inference

taking less than one second on a NVIDIA H100 GPU

Features

Covering non_coding sequences, which account for 98% of total genes, also the current research trends

At most 1Mb contexts, surpassing previous models like enformer much

SEQUENCE_LENGTH_2KB = 2**11  # 2_048
SEQUENCE_LENGTH_16KB = 2**14  # 16_384
SEQUENCE_LENGTH_100KB = 2**17  # 131_072
SEQUENCE_LENGTH_500KB = 2**19  # 524_288
SEQUENCE_LENGTH_1MB = 2**20  # 1_048_576

output_types

# 表观遗传
ATAC = dna_model_pb2.OUTPUT_TYPE_ATAC # ATAC-seq 数据, 反映染色质开放区域
DNASE = dna_model_pb2.OUTPUT_TYPE_DNASE # DNase I 超敏感位点数据, 也用于捕捉染色质开放区域
CHIP_HISTONE = dna_model_pb2.OUTPUT_TYPE_CHIP_HISTONE # 用于检测组蛋白修饰
CHIP_TF = dna_model_pb2.OUTPUT_TYPE_CHIP_TF # 用于检测转录因子结合位点

# 转录
RNA_SEQ = dna_model_pb2.OUTPUT_TYPE_RNA_SEQ # 用于量化基因表达水平
CAGE = dna_model_pb2.OUTPUT_TYPE_CAGE # Cap Analysis of Gene Expression, 反映基因表达起始位点活性
SPLICE_SITES = dna_model_pb2.OUTPUT_TYPE_SPLICE_SITES # 剪接位点：RNA剪接的供体和受体位点
SPLICE_SITE_USAGE = dna_model_pb2.OUTPUT_TYPE_SPLICE_SITE_USAGE # 剪接位点使用率：各剪接位点的使用频率
SPLICE_JUNCTIONS = dna_model_pb2.OUTPUT_TYPE_SPLICE_JUNCTIONS # 剪接接头数据，通过 RNA-seq 分割读取量计算
PROCAP = dna_model_pb2.OUTPUT_TYPE_PROCAP # Precision Run-On sequencing 和 capping, 用于测量基因表达精确水平

# 空间
CONTACT_MAPS = dna_model_pb2.OUTPUT_TYPE_CONTACT_MAPS # DNA 3D 接触图, 反映 DNA-DNA 空间接触概率

I don’t really master these, leave it for specialized researchers. I only care code and implementation (engineering stuff).

Read the source code

They use gRPC (absolutely for Google), so we can rewrite the SDK in other languages.

Why rewrite (in case) ?:

I want to practice TypeScript. it’s essential for frontend development, and Node is also quite common as a backend in startups abroad.
It’s easier to maintain compared to Python.
I’ve already started building the backend for an agent application, so I can embed this as a module without the hassle of calling Python scripts.

.protos Compilation: You need to download protobuf and protoc; you can refer to the official gRPC documentation. By default, installation is done via pip (pip → pyproject.toml → hatch_build.py). For manual compilation, you need to adjust the import paths.

Look up parameters passed in .probe , here is the example:

// Human (Homo sapiens).
ORGANISM_HOMO_SAPIENS = 9606;

// Mouse (Mus musculus).
ORGANISM_MUS_MUSCULUS = 10090;

Use N to represent unsure base

_VALID_SEQUENCE_CHARACTERS = frozenset('ACGTN')

Specify exported modules in __init__.py, just like index.js in Javascript:

__all__ = ["dna_model_pb2", "dna_model_service_pb2", "dna_model_service_pb2_grpc", "tensor_pb2"]

These four files are the compiled outputs. You can also choose other languages. They are the core code, and depending only on them is sufficient.

启动流程: 创建一个 gPRC channel, 携带目标服务器地址信息, API 类型和内容保持连接, 不用每次重新建立可以多线程调用可以选择负载均衡

源代码默认最大为 5 线程, 可以自行修改

decorator: wraps retryRPC

define return type -> Iterable

confusing: plotting is a part of the api

Questions

artificially mutating some base randomly, like CRISPER (but without experiment) calculating output tracks see which parts cause real variation

predict what could happen due to that variant

can’t directly predict traits, just effective of SNP on gene expression chrom accessibility methylation states binding of that transcription …

benchmarked on data from a single cell type => accross mulitple cell type?

model seems to contextualize what different cell types you can use based on the ontologies

how many variants and there interplay can be considered for optional performance

Do we know anything about how many variants and their interplay can be considered at a time while maintaining optimal performance

the ability to do cell specific predictions train on tissue specific gene expression and predict different tissues put ontology term to give it context about what you are trying to predict

isn’t any imputation of the unobserved parts of the encode tissue by experiment matrix

if it can predict something that we haven’t tested, how to evaluation?

Relevant

AlphaGenome Enhances Personal Gene Expression Prediction but Retains Key Limitations

只做了一个随机森林对比一下就写成论文了

AlphaGenome and Random Forest made distinct predictions: while both successfully classified individuals into high- and low-expression groups, their predictions within each group showed minimal correlation, suggesting that the two models capture different underlying patterns.

https://www.science.org/doi/10.1126/science.1259037

原论文里面验证的研究

Using AlphaGenome, we predicted that the mutations would activate a nearby gene called TAL1 by introducing a MYB DNA binding motif, which replicated the known disease mechanism and highlighted AlphaGenome’s ability to link specific non-coding variants to disease genes.

graph TD
    A[体细胞突变] --> B[引入MYB结合基序]
    B --> C[MYB转录因子结合]
    C --> D[招募CBP<br/>H3K27乙酰化酶]
    C --> E[招募转录复合体]
    E --> F[RUNX1]
    E --> G[GATA-3]
    E --> H[TAL1]
    D --> I[H3K27ac修饰]
    I --> J[超级增强子形成]
    F --> J
    G --> J
    H --> J
    J --> K[TAL1癌基因激活]
    K --> L[T-ALL发生]


    M[内源性超级增强子] --> N[被MYB和CBP占据]
    N --> O[MYB普遍调控<br/>超级增强子]

    style A fill:#ffcccc
    style K fill:#ff6666
    style L fill:#ff3333
    style J fill:#66ccff
    style I fill:#99ff99

杂项

uv 和 conda 用哪个好