Nonprofit · Open Source · catalyst.health

The infrastructure biological AI has been missing

The Catalyst Project is building the shared data infrastructure layer that makes AI-assisted biological discovery accessible to every researcher — regardless of institution size, technical background, or computational resources.

See the Pilot Program → How it works

Biology Institutions Data types Technology Catalyst

The Problem

The data exists. The knowledge does not.

Twelve researchers. Four institutions. Three to four years of work. One analytical pass extracted. The second question was never asked — not because the team lacked capability. Because the infrastructure didn't exist.

The biomedical research community has reached a consensus on the problem. The data is there. The researchers are there. What is missing is the infrastructure layer that makes the data computable — across labs, across data types, across institutions, at scale.

No shared normalization standard. Every lab processes data differently. Transcriptomic data from one institution cannot speak to proteomic data from another — even when both describe the same biological system.

AI-assisted research requires resources most labs don't have. The cost and expertise required to deploy AI on biological data at scale means it remains accessible to only a small fraction of the research community. The rest are left out.

Knowledge stays siloed. The relationships discoverable by AI — hiding in decades of accumulated biological data across institutions — remain invisible because nothing connects the dots.

The Platform

Three interconnected platforms. One plugin interface standard.

Built around standardized interface contracts — every component conforms, every component is swappable. We build the grid. Researchers build what they need — not what we decided they should want.

Normalization Infrastructure

The foundation of everything. A hosting platform for swappable, domain-specific normalization plugins — one per data type, built by the researchers who know that data best. Lossless translation throughout. Full human-in-the-loop validation, complete provenance, every step.

Platform 01

Biological Modeling Infrastructure

A hosting and plugin layer for Biological Foundation Models. Auto-encoder/JEPA architecture — small specialized translation layers feeding large-scale reasoning models. The BFM is the reasoning engine. The knowledge graph is the permanent memory. Independently built models interoperate through standardized interface contracts.

Platform 02

Application & Federation Infrastructure

Universal open API. Jupyter-native SDK. Natural language interface. Multi-institutional data federation under rigorous governance and provenance controls. Commercial products built on the API are explicitly welcome.

Platform 03

Core Principles

Plugin-based and swappable

Standardized interface contracts throughout. Researchers build components for their methodology. The platform runs all of them.

Lossless normalization

Translate, never compress. Every normalization step is documented, versioned, and fully traceable back to the raw source.

Human-in-the-loop

AI accelerates. Humans decide. Active learning at every step — every human correction generates a permanent rule.

Versioned everything

Every output carries a permanent version ID. Reproducibility is not a feature — it is a requirement.

Methodology-neutral by design

Competing scientific approaches can build, deploy, and be empirically tested within shared infrastructure. The platform does not pick winners. The science does.

Methodology-neutral

Competing methodological approaches can build, deploy, and be empirically tested within shared infrastructure. The science determines what works.

Proof of Concept Pilot

The first pilot — retinal ganglion cell biology, finally integrated

Retinal ganglion cell research has generated decades of rich, heterogeneous biological data across institutions worldwide — transcriptomics, proteomics, imaging, electrophysiology, and more. It has never been integrated. Different labs, different formats, different conventions. The data exists. The connections between it do not. The pilot changes that — running this data through the full Catalyst pipeline for the first time, to ask questions that have never been askable.

Scientific target: The molecular cascade from initial protein dysregulation to mitochondrial failure and retinal ganglion cell death in glaucoma — illuminated by normalizing and integrating data that has always contained the answer.

Deliverable 01 · Months 1–4

Heterogeneous Data Pool

Purpose-built normalization plugins for each data type in the pilot dataset. Islands connected. Decades of data speaking the same language for the first time.

Deliverable 02 · Months 3–6

RGC Biological Foundation Model

Lightweight BFM trained on the normalized data pool. Auto-encoder/JEPA architecture. Purpose-built for the RGC cascade question.

Deliverable 03 · Months 5–8

RGC Knowledge Graph Seed

Relationships discovered from the data — not pre-programmed. The first integrated, data-driven map of the RGC death cascade.

Deliverable 04 · Months 7–10

Open API — The Grid Goes Live

Any researcher, any institution, any question. The knowledge graph queryable by the field. The infrastructure becomes a public good.

Who We Are

Built at the intersection of biology, AI, and infrastructure

The Catalyst Project is led by Dr. Cynthia Steel, PhD, MBA — a research scientist with deep domain expertise in glaucoma biology — and Mike Steel, whose background spans systems architecture and organizational development. The project has been developed through deep consultation with researchers and engineers at the frontier of computational biology, foundation model development, and large-scale data infrastructure.

BIOLOGY

Domain-anchored science

Glaucoma biology. Retinal ganglion cell research. Multimodal ophthalmic data. The scientific questions drive every architectural decision.

AI + ML

Frontier model expertise

Foundation model architecture. Auto-encoder and JEPA approaches. Large-scale biological AI training. The engineering decisions are grounded in what actually works.

INFRASTRUCTURE

Built-it-before experience

Large-scale ophthalmic data platforms. Clinical data governance. Federated data systems at institutional scale. We know what the problems are because we've hit them.

The infrastructure biological AI has been missing

Open source. Nonprofit. No exceptions.

The data exists. The knowledge does not.

Three interconnected platforms. One plugin interface standard.

The first pilot — retinal ganglion cell biology, finally integrated

Built at the intersection of biology, AI, and infrastructure

The grid is being built. Get involved.