Nonprofit · Open Source · catalyst.health

The infrastructure biological AI has been missing

The Catalyst Project is building the shared data infrastructure layer that makes AI-assisted biological discovery accessible to every researcher — regardless of institution size, technical background, or computational resources.

Biology Institutions Data types Technology Catalyst
Our commitment

Open source. Nonprofit. No exceptions.

No licensing fees. No investor interests shaping what gets built. Community-owned standards, permanently — because infrastructure built for the scientific community belongs to the scientific community.

~40M People living with glaucoma worldwide
3–4 Years to extract one analytical pass from a single dataset
100% Open source. No licensing fees. No commercial strings.
0 Commercial conflicts. Ever.
The Problem

The data exists. The knowledge does not.

Twelve researchers. Four institutions. Three to four years of work. One analytical pass extracted. The second question was never asked — not because the team lacked capability. Because the infrastructure didn't exist.

The biomedical research community has reached a consensus on the problem. The data is there. The researchers are there. What is missing is the infrastructure layer that makes the data computable — across labs, across data types, across institutions, at scale.

No shared normalization standard. Every lab processes data differently. Transcriptomic data from one institution cannot speak to proteomic data from another — even when both describe the same biological system.

AI-assisted research requires resources most labs don't have. The cost and expertise required to deploy AI on biological data at scale means it remains accessible to only a small fraction of the research community. The rest are left out.

Knowledge stays siloed. The relationships discoverable by AI — hiding in decades of accumulated biological data across institutions — remain invisible because nothing connects the dots.

The Platform

Three interconnected platforms. One plugin interface standard.

Built around standardized interface contracts — every component conforms, every component is swappable. We build the grid. Researchers build what they need — not what we decided they should want.

01
Normalization Infrastructure
The foundation of everything. A hosting platform for swappable, domain-specific normalization plugins — one per data type, built by the researchers who know that data best. Lossless translation throughout. Full human-in-the-loop validation, complete provenance, every step.
Platform 01
02
Biological Modeling Infrastructure
A hosting and plugin layer for Biological Foundation Models. Auto-encoder/JEPA architecture — small specialized translation layers feeding large-scale reasoning models. The BFM is the reasoning engine. The knowledge graph is the permanent memory. Independently built models interoperate through standardized interface contracts.
Platform 02
03
Application & Federation Infrastructure
Universal open API. Jupyter-native SDK. Natural language interface. Multi-institutional data federation under rigorous governance and provenance controls. Commercial products built on the API are explicitly welcome.
Platform 03
Core Principles
Plugin-based and swappable
Standardized interface contracts throughout. Researchers build components for their methodology. The platform runs all of them.
Lossless normalization
Translate, never compress. Every normalization step is documented, versioned, and fully traceable back to the raw source.
Human-in-the-loop
AI accelerates. Humans decide. Active learning at every step — every human correction generates a permanent rule.
Versioned everything
Every output carries a permanent version ID. Reproducibility is not a feature — it is a requirement.
Methodology-neutral by design
Competing scientific approaches can build, deploy, and be empirically tested within shared infrastructure. The platform does not pick winners. The science does.
Methodology-neutral
Competing methodological approaches can build, deploy, and be empirically tested within shared infrastructure. The science determines what works.
Proof of Concept Pilot

The first pilot — retinal ganglion cell biology, finally integrated

Retinal ganglion cell research has generated decades of rich, heterogeneous biological data across institutions worldwide — transcriptomics, proteomics, imaging, electrophysiology, and more. It has never been integrated. Different labs, different formats, different conventions. The data exists. The connections between it do not. The pilot changes that — running this data through the full Catalyst pipeline for the first time, to ask questions that have never been askable.

Scientific target: The molecular cascade from initial protein dysregulation to mitochondrial failure and retinal ganglion cell death in glaucoma — illuminated by normalizing and integrating data that has always contained the answer.
Deliverable 01 · Months 1–4
Heterogeneous Data Pool
Purpose-built normalization plugins for each data type in the pilot dataset. Islands connected. Decades of data speaking the same language for the first time.
Deliverable 02 · Months 3–6
RGC Biological Foundation Model
Lightweight BFM trained on the normalized data pool. Auto-encoder/JEPA architecture. Purpose-built for the RGC cascade question.
Deliverable 03 · Months 5–8
RGC Knowledge Graph Seed
Relationships discovered from the data — not pre-programmed. The first integrated, data-driven map of the RGC death cascade.
Deliverable 04 · Months 7–10
Open API — The Grid Goes Live
Any researcher, any institution, any question. The knowledge graph queryable by the field. The infrastructure becomes a public good.
Who We Are

Built at the intersection of biology, AI, and infrastructure

The Catalyst Project is led by Dr. Cynthia Steel, PhD, MBA — a research scientist with deep domain expertise in glaucoma biology — and Mike Steel, whose background spans systems architecture and organizational development. The project has been developed through deep consultation with researchers and engineers at the frontier of computational biology, foundation model development, and large-scale data infrastructure.

BIOLOGY
Domain-anchored science
Glaucoma biology. Retinal ganglion cell research. Multimodal ophthalmic data. The scientific questions drive every architectural decision.
AI + ML
Frontier model expertise
Foundation model architecture. Auto-encoder and JEPA approaches. Large-scale biological AI training. The engineering decisions are grounded in what actually works.
INFRASTRUCTURE
Built-it-before experience
Large-scale ophthalmic data platforms. Clinical data governance. Federated data systems at institutional scale. We know what the problems are because we've hit them.

The grid is being built. Get involved.

Whether you are a researcher, an institution, a funder, or someone who has run into this infrastructure wall yourself — we want to hear from you.

Contact us → Learn more about the platform