Skip to content

Bio AI Tools Playbook

Research Doc #9 — April 2026

GPT-Rosalind, Amazon Bio Discovery, and Anthropic's Coefficient Bio acquisition — what they are, how they work, and exactly how to use them to accelerate Open Enzyme.

Open Enzyme Project • Brian Abent • April 21, 2026


Part 01 — The Bio AI Tools Landscape

Three major AI biology announcements landed in a two-week window in April 2026. This is not coincidence — it's the convergence of transformer architectures, massive protein/genomic datasets, and the realization by every major AI lab that biology is the highest-value frontier. Here's what each tool is and what it means for you.

Metric Value
Major bio AI launches in 14 days 3
AI biology models on Bio Discovery 40+
Drug candidates generated by Rosalind at MSK ~300K
Anthropic paid for Coefficient Bio $400M

GPT-Rosalind (OpenAI)

OpenAI • Launched April 17, 2026

Named after Rosalind Franklin, this is OpenAI's first domain-specific reasoning model — a frontier-class model that was trained directly on biological data including proteins, genes, chemical reactions, molecular pathways, and disease biology. It doesn't just answer biology questions better than GPT-5.4; it reasons differently because its training incorporated the structure of biological knowledge at a fundamental level.

Think of it this way: regular GPT-5.4 has read every biology paper and can discuss biology. Rosalind was trained to think in biological primitives — amino acid sequences, gene regulatory networks, metabolic pathways, protein-protein interactions — the way you think in code architecture. It reasons across biological layers simultaneously: from DNA sequence to protein structure to metabolic function to phenotypic outcome.

What It Can Do: Protein structure prediction, codon optimization, gene expression analysis, molecular interaction prediction, drug candidate generation, sequence-to-function interpretation, experimental planning, multi-step literature review with database cross-referencing

Trained On: 50 of the most common biological workflows, with native access to major public biological databases (PDB, UniProt, NCBI, KEGG, ChEMBL, etc.)

Access Model: Trusted Access Program — limited to qualified US enterprise customers. Requires: legitimate research with clear public benefit, governance and misuse-prevention controls, approved users in secure environments. Currently a research preview.

Pricing: Free during research preview — no token deduction against existing API credits. Available in ChatGPT, Codex, and the API for approved customers. Broader pricing TBD.

Note: Key detail for Open Enzyme: OpenAI also shipped a free Codex plugin the same day that connects to 50+ biology databases and works with the GPT-5.4 model everyone already has access to. This means even without Rosalind access, you can use GPT-5.4 + the Codex Life Sciences plugin for most of the prompts in this document. The plugin adds database connectivity; Rosalind adds deeper biological reasoning.

MSK Case Study: What "Reasoning About Biology" Looks Like

Memorial Sloan Kettering Cancer Center used GPT-Rosalind to generate nearly 300,000 novel antibody molecules for pediatric cancer, then used the model's ranking capabilities to narrow that to the top ~100,000 candidates sent to Twist Bioscience for physical synthesis and testing. The process that traditionally takes up to a year was completed in weeks. This is the same pattern Open Enzyme will use: AI-generated candidates → ranked by predicted performance → synthesized and tested.

How to Get Access

The Trusted Access Program requires organizations to demonstrate legitimate research with clear public benefit. Launch partners include Amgen, Moderna, Thermo Fisher Scientific, the Allen Institute, Oracle Health and Life Sciences, NVIDIA, Benchling, and UCSF School of Pharmacy. For Open Enzyme, the open source + therapeutic enzyme + citizen science angle is a genuinely compelling public-benefit argument. Apply at openai.com — the research preview is free, so the barrier is qualification, not cost.

Note: Immediate action: Even without Rosalind access, the free Codex Life Sciences plugin gives you database-connected biology reasoning today. Install it and start running the prompts in Part 3 with GPT-5.4. When Rosalind access comes through, re-run them for deeper biological reasoning and compare the results.


Amazon Bio Discovery (AWS)

AWS • Launched April 14, 2026

Amazon Bio Discovery is fundamentally different from Rosalind. Where Rosalind is a single reasoning model that you prompt in natural language, Bio Discovery is a platform of 40+ specialized AI models wrapped in an agentic interface. You describe your research goal, and an AI agent selects the right models, chains them together, optimizes inputs, evaluates outputs, and can even route candidates to physical lab partners for synthesis and testing.

Think of it as the AWS for biology: a managed service where each model is a specialist (one predicts binding affinity, another assesses developability, another generates sequences) and the agent orchestrates them into a coherent pipeline. You bring the question; the platform assembles the workflow.

Models Included: 40+ specialized models from partners including Apheris, Boltz, with Biohub and Profluent coming. Covers protein language models, molecular generation, binding affinity prediction, developability assessment, genomics analysis. Users can also upload proprietary models.

Agent Interface: AI agent helps select models for research goals, optimizes inputs, evaluates candidates, and routes results back for iteration. No-code interface for scientists — describe what you want, the agent assembles the pipeline.

Lab-in-the-Loop: Integrated lab partners: Twist Bioscience, Ginkgo Bioworks (A-Alpha Bio coming soon). Send candidates directly for physical synthesis and testing; results route back for AI model refinement. Transparent pricing and turnaround times.

Pricing: Pro plan: $486/month (early access rate) includes 15 Experiment Units. Additional experiments at $32.40/EU. Free trial of 5 experiments to start. Consumption depends on candidates, sequence length, and model selection.

Note: Open Enzyme relevance: Bio Discovery's strength is the lab-in-the-loop pipeline. Once you've designed your uricase variant or optimized your codon sequence using Rosalind or Bio Discovery's models, you can route directly to Twist for gene synthesis and Ginkgo for organism engineering — all from one platform. The $486/month price point is within reach for a project spending $1,200 total on its first experimental round. But the free trial of 5 experiments may be enough to validate your first construct design.

Rosalind vs. Bio Discovery: When to Use Which

Dimension GPT-Rosalind Amazon Bio Discovery
Architecture Single frontier reasoning model Platform of 40+ specialized models with agent orchestration
Interface Natural language prompts (ChatGPT, API, Codex) No-code agent interface + model catalog
Best For Open-ended reasoning, literature synthesis, experimental design, "what if" exploration, cross-domain biological reasoning Structured workflows, model benchmarking, candidate screening, lab integration
Lab Connection None (outputs are text/analysis) Direct integration with Twist, Ginkgo, A-Alpha Bio
Pricing Free during research preview $486/mo or free trial (5 experiments)
Open Enzyme Play Use for design decisions, variant selection, mutation suggestions, risk assessment, literature validation Use for candidate screening, gene synthesis ordering, and when you need multiple models benchmarked against each other

Anthropic / Coefficient Bio

Acquired April 3, 2026 • ~$400M (stock deal)

Coefficient Bio was a stealth biotech AI startup founded in 2025 with around 10 people — but very specific people. CEO Aris Theologis, CTO Nathan Frey (formerly principal scientist at Biogen), and co-founder Joyce Hong (5 years as a principal at Roivant Sciences). This is a team that brings deep computational biology expertise, particularly in protein design and biomolecule modeling.

Anthropic paid $400M in stock — roughly $40M per person — because this team brings the domain-specific knowledge to build specialized life sciences agents inside Claude. The Coefficient team is joining Anthropic's health and life science division.

Anthropic's Bio Timeline: - Oct 2025: Claude for Life Sciences launched — connectors to Benchling, PubMed, BioRender, Synapse.org - Jan 2026: Claude for Healthcare — HIPAA-ready, ICD-10, CMS database integration - Apr 2026: Coefficient Bio acquired — protein design and biomolecule modeling expertise incoming

What This Means for Claude: Expect Claude to gain specialized biological reasoning capabilities — particularly around protein design, drug discovery, and clinical trial planning. The Coefficient team's Biogen/Roivant experience means Claude's bio capabilities will be grounded in real pharma workflows, not just academic benchmarks.

Note: Timeline speculation: Based on the pattern (Life Sciences in Oct, Healthcare in Jan, Coefficient acquired in Apr), expect Claude-native biological reasoning features by Q3-Q4 2026. This could mean protein structure tools built directly into Claude, similar to how Rosalind works but integrated into the Claude ecosystem you're already using for the Open Enzyme project. For now, Claude with the Life Sciences connectors (Benchling, PubMed) is your best Anthropic-side tool.


Hugging Science (Hugging Face AI-for-science index)

Hugging Face • resource index surfaced 2026-05-19

Hugging Science is not one model. It is a curated, machine-readable index of open AI-for-science datasets, models, benchmarks, and blog posts across biology, chemistry, medicine, genomics, scientific reasoning, materials, physics, and related fields. The operationally useful feature is that the site exposes topic markdown files and an /llms.txt index, so an agent can query the catalog directly instead of hand-browsing the Hugging Face Hub.

Open Enzyme relevance: this is a capability-discovery layer. Use it when an open question needs "what open model or dataset exists for this axis?" before launching a comp-NNN or wet-lab spend. Treat catalog entries as leads, not validated OE evidence.

High-priority OE mappings:

OE question class Hugging Science capability to inspect first Why it matters
FDA-approved repurposing / target overlap Eve Bio drug-target activity, TxGemma, AQAffinity, SAIR, CoLiPRI Second-pass screen for compounding-pharmacy candidates, ABCG2 Q141K chaperones, C5aR1 / NLRP3 small-molecule gaps.
ADMET / drug-drug interaction risk OpenADMET challenge data, CYP inhibition model, PXR activation model, MIST toxicity / side-effect / BBB models Fast safety triage for disulfiram, zileuton, supplement-stack compounds, and any newly surfaced repurposing hit.
Host-cell expression perturbation Ginkgo DRUG-seq, Tahoe-100M / Tahoe-x1, X-Atlas, Perturb-Sapiens, STACK / TEDDY Could address expression-modulation questions such as ABCG2 regulation, complement-regulator upregulation, and compound-driven transcriptomic effects before ordering assays.
Cassette / promoter / sequence design Evo-2, Nucleotide Transformer, AlphaGenome, PromoterGPT, ChatNT Candidate second-opinion layer for promoter / 5' UTR / coding-sequence design. Stronger for human and broad genomic contexts than for A. oryzae-specific expression until benchmarked.
Protein / enzyme engineering ESM-2, OpenFold3, Boltz / AQAffinity, ThermoGFN-IF, RFdiffusion / ProteinMPNN-related guides Follow-up layer for uricase acid stability, lactoferrin linker redesign, DAF SCR1-4 folding, and enzyme thermostability hypotheses.
Agent benchmark discipline FutureHouse Lab-Bench / BixBench, BioMysteryBench, ChemBench Useful for evaluating whether an agent workflow is competent at lab-protocol and scientific-reasoning tasks before trusting it with comp-NNN authoring.

Immediate triage queue (no new wiki page yet):

  1. OpenADMET / CYP / PXR pass on compounding candidates. Run disulfiram, zileuton, colchicine custom-dose candidates, and top supplement-stack compounds through the open ADMET models as a safety-side screen. Output belongs in compounding-pharmacy-track.md or the relevant compound pages, not here.
  2. SAIR / AQAffinity / CoLiPRI second opinion for comp-032. Re-score the ABCG2 Q141K pharmacological-chaperone shortlist against protein-ligand structure / affinity models before wet-lab trafficking assay spend.
  3. Perturbation-atlas query for ABCG2 and complement-regulator expression. Search Ginkgo / Tahoe / X-Atlas style resources for compounds that upregulate ABCG2, DAF/CD55, Factor H, CD59, clusterin, or CR1 in epithelial / immune-relevant contexts.
  4. ThermoGFN-IF / ESM-2 follow-up on uricase stability. Compare whether the existing OPT-1 / SB-1 uricase engineering set is supported by newer thermostability-directed protein design priors.
  5. Hugging Science sweep before new comp-NNN briefs. Add a brief step: query /llms.txt and the relevant topic file for open datasets/models before deciding that a question needs a bespoke analysis.

Limitations: Hugging Science is a curated index, not a validation authority. Every resource still needs license review, benchmark fit review, local reproducibility check, and the normal Open Enzyme pre-commit grep-verify gate before a model output becomes load-bearing evidence.


Part 01b — Open Source Protein AI Tools (Available Now)

The commercial tools above are impressive, but Rosalind requires institutional access and Bio Discovery costs $486/month. The good news: there's a mature ecosystem of open source protein AI tools that you can run today, for free, on Google Colab or a local GPU. Several of these are the actual models that power platforms like Bio Discovery under the hood. For a citizen science project on a $1,200 budget, these are not fallbacks — they're your primary computational toolkit.

Tier 1: Structure Prediction (Know What Your Enzyme Looks Like)

AlphaFold 2 / ColabFold

What it does: Predicts 3D protein structure from amino acid sequence with near-experimental accuracy. ColabFold is the community wrapper that lets you run AlphaFold 2 on Google Colab for free — no local GPU required.

Why you need it: Before you can reason about acid stability, protease cleavage sites, or active site mutations, you need the 3D structure of your uricase variant. If there's no crystal structure in PDB, AlphaFold gives you a reliable predicted structure to work from.

Open Enzyme use: Generate structures for each uricase variant candidate (A. flavus, C. utilis, Arthrobacter, etc.). Use the predicted structures as input for stability analysis, docking, and mutation design. Compare predicted structures to known crystal structures (PDB: 1R51 for A. flavus uricase) as a sanity check.

Access: ColabFold notebook — free, runs in browser. Local install also available via pip install colabfold.

Hardware: Google Colab free tier is sufficient for single-chain predictions up to ~1000 residues. Uricase monomers are ~300 residues — well within range.

Limitation: AF2 predicts static structures. It won't tell you how the protein moves or unfolds at low pH. For that, you need molecular dynamics (see below) or stability prediction tools.


Boltz-2 (MIT License)

What it does: Predicts 3D structures of biomolecular complexes — protein-protein, protein-ligand, protein-DNA/RNA. The first open source model to approach AlphaFold 3 accuracy, and the first deep learning model to approach physics-based free-energy perturbation (FEP) accuracy for binding affinity prediction while running 1000× faster.

Why you need it: AlphaFold 2 predicts single proteins well but struggles with complexes. Boltz-2 handles the cases AF2 can't: uricase binding uric acid (protein-ligand), uricase tetramer assembly (protein-protein), and pepsin docking against your engineered variant (protease-substrate complex).

Open Enzyme use: - Predict uricase tetramer assembly (the active form is a homotetramer — does your engineered variant still assemble correctly?) - Model uric acid binding in the active site after mutations - Predict NLRP3 inhibitor binding to the NACHT domain for compound screening (Prompt 7) - Assess whether mutations near the subunit interface disrupt quaternary structure

Access: GitHub (MIT License) — fully open, commercially usable. Weights freely downloadable.

Hardware: Needs a GPU with ≥16GB VRAM for complex predictions. Google Colab Pro ($10/month) works. A local RTX 3090/4090 is ideal.


ESMFold (Meta, Open Source)

What it does: Predicts protein structure from sequence alone — like AlphaFold but ~60× faster because it skips the multiple sequence alignment (MSA) step. Trades a small amount of accuracy for massive speed.

Why you need it: When you're screening dozens of uricase variants or mutation combinations, running full AlphaFold on each one takes hours. ESMFold gives you a structure in seconds. Use it for rapid triage, then run AlphaFold on your top candidates for higher accuracy.

Open Enzyme use: Rapid structural screening of engineered variants. Run 50 mutation combinations through ESMFold in an afternoon, identify the 5 that look structurally sound, then validate those 5 with AlphaFold/Boltz.

Access: Hugging Face or GitHub. Also available as a free web server at ESM Metagenomic Atlas.

Hardware: Runs on Google Colab free tier for proteins under ~400 residues.


Protenix-v1 (Open Source, ByteDance)

What it does: The first fully open source structure prediction model to outperform AlphaFold 3, strictly matching AF3's training data cutoff and model size. Protenix-v1 handles proteins, nucleic acids, ligands, and their complexes.

Open Enzyme use: Alternative to Boltz-2 for complex predictions. Particularly useful if you want a second opinion on protein-ligand docking poses or multimer assembly.

Access: GitHub — open source with weights.


Tier 2: Protein Language Models (Understand What Mutations Do)

ESM-2 (Meta, Open Source)

What it does: The workhorse protein language model. Trained on 250 million protein sequences, ESM-2 learns the "grammar" of proteins — which amino acids belong where, how mutations affect fold stability, and which residues are evolutionarily conserved (and therefore probably important). Available in sizes from 8M to 15B parameters.

Why you need it: ESM-2 embeddings are the foundation for almost everything else — variant effect prediction, stability estimation, function annotation, and more. Think of it as the GPT-4 of proteins: a general-purpose representation that downstream tools build on.

Open Enzyme use: - Variant effect prediction: Score every possible single amino acid substitution in your uricase and rank by predicted deleteriousness. Mutations at highly conserved positions (high ESM-2 log-likelihood) are risky; mutations at variable positions are safer. - Zero-shot fitness prediction: ESM-2's log-likelihood scores correlate with experimental fitness measurements. Use this to pre-screen mutation libraries before ordering gene synthesis. - Embeddings for downstream models: Feed ESM-2 embeddings into stability predictors (SPURS, RaSP) and function predictors.

Access: Hugging Face — the 650M parameter model is the sweet spot of accuracy vs. compute. Full model family on GitHub.

Hardware: The 650M model runs on Google Colab free tier. The 15B model needs multi-GPU.

Key capability for Open Enzyme: Run esm.pretrained.esm2_t33_650M_UR50D() → feed in your uricase sequence → extract per-residue log-likelihoods → every position where the wild-type amino acid has low likelihood is a candidate for improvement. Every position where it has high likelihood is one you should not touch.


ESM-C (EvolutionaryScale, API + Open Weights for Small Model)

What it does: Drop-in replacement for ESM-2 with better performance at lower compute cost. The 300M parameter ESM-C matches ESM-2 650M performance. Primarily available via API, but the small model weights are open.

Open Enzyme use: Same as ESM-2, but faster and cheaper if you're running many sequences. The API has a free tier for academic research.

Access: EvolutionaryScale API with free academic tier. Open weights for ESM-C 300M on GitHub.


ESM-3 (EvolutionaryScale, Open Small Model)

What it does: The next generation — a multimodal generative model that reasons over sequence, structure, and function simultaneously. ESM-3 can generate novel proteins conditioned on desired properties. The landmark result: ESM-3 generated a novel green fluorescent protein (esmGFP) with only 58% sequence identity to any known fluorescent protein — equivalent to simulating ~500 million years of evolution.

Why you care: ESM-3 can generate protein sequences conditioned on structural and functional constraints. In principle, you could ask it to generate uricase variants that maintain the active site geometry while optimizing for acid stability — the multi-objective optimization problem from Prompt 4.

Open Enzyme use: Experimental — use ESM-3-open to explore the sequence space around your uricase, conditioned on maintaining the active site configuration. This is more speculative than ESM-2 variant scoring but could surface non-obvious solutions that rational design misses.

Access: ESM-3-open (small model) on GitHub under non-commercial license. Larger models available via API with academic free tier.

Caveat: ESM-3 is powerful but newer and less battle-tested than ESM-2 for practical enzyme engineering. Use ESM-2 for scoring and triage; use ESM-3 for exploratory generation.


Tier 3: Protein Design & Engineering (Build Better Enzymes)

RFdiffusion2 (Baker Lab, Open Source)

What it does: The most powerful open source tool for de novo enzyme design. RFdiffusion2 generates entirely new protein structures using diffusion (the same approach as image-generation models like Stable Diffusion, but in 3D protein space). The breakthrough feature: you can describe an enzyme active site — just the arrangement of catalytic residues and substrate — and RFdiffusion2 will design a complete protein scaffold around it.

Why this is a big deal for Open Enzyme: Instead of starting from a natural uricase and trying to make it acid-stable (which might be fighting against the protein's fundamental fold), you could in principle design a completely new protein scaffold that houses the uricase active site geometry but is inherently acid-stable. This is the de novo enzyme design approach.

Experimental validation: From 96 designs tested for a benchmark enzyme, the best had catalytic efficiency of 53,000 M⁻¹s⁻¹. Published in Nature Methods (2025).

Open Enzyme use: - Conservative: Use RFdiffusion2 to explore alternative scaffolds for the uricase active site — proteins that might be inherently more acid-stable than the A. flavus parent - Aggressive: Design a completely novel uricase with the active site geometry of A. flavus but a new fold optimized for GI survival - Practical first step: Use the Metallohydrolase Design tutorial on GitHub as a template — it includes Jupyter notebooks walking through the entire enzyme design pipeline

Access: GitHub (BSD License) — fully open, commercially usable.

Hardware: Needs a decent GPU (≥16GB VRAM). Google Colab Pro works for small designs.

Honest assessment: De novo enzyme design is cutting-edge and high-risk. For Open Enzyme Phase 1, the safer bet is engineering a natural uricase (Prompts 1-4). But RFdiffusion2 is the tool for Phase 2 if natural variants don't perform well enough.


ProteinMPNN (Baker Lab, Open Source)

What it does: Given a protein backbone structure, ProteinMPNN designs amino acid sequences that will fold into that structure. Think of it as the inverse of structure prediction: instead of "sequence → structure," it's "structure → sequence."

Why you need it: If you modify a uricase structure (adding disulfide bonds, repacking the core for acid stability), you need to know what sequence will actually fold into your modified structure. ProteinMPNN solves this. It's also the standard tool for validating RFdiffusion2 outputs — after generating a new scaffold, you use ProteinMPNN to design a sequence, then AlphaFold to predict whether that sequence actually folds as intended.

Open Enzyme use: - Design sequences for structurally modified uricase variants - After using RFdiffusion2 to generate scaffolds, use ProteinMPNN to design the actual sequences you'd order for synthesis - Explore sequence diversity: ProteinMPNN can generate multiple different sequences for the same target structure, giving you backup candidates

Access: GitHub (MIT License). Also available as a Colab notebook.


Tier 4: Stability & Variant Effect Prediction (Will It Survive the Gut?)

SPURS (Open Source, 2025)

What it does: Predicts protein stability changes (ΔΔG) upon mutation by rewiring pre-trained protein generative models. Benchmarked across 12 diverse datasets and consistently outperforms previous state-of-the-art methods for both thermostability and melting temperature prediction.

Open Enzyme use: This is directly relevant to Prompt 4 (Protein Engineering for Oral Delivery). For every candidate mutation, SPURS predicts the stability change. Positive ΔΔG = destabilizing (bad). Negative ΔΔG = stabilizing (good). Use it to rank your engineered variants before ordering synthesis.

Access: GitHub — open source. Also available via Nature Communications paper.


RaSP (Rapid Stability Prediction)

What it does: Predicts stability changes from mutations using deep learning representations. Performs on par with biophysics-based methods like FoldX but runs in less than a second per residue — enabling full saturation mutagenesis stability scans in minutes.

Open Enzyme use: Run a complete saturation mutagenesis scan on your uricase: predict the stability effect of every possible single amino acid change at every position. This gives you a complete stability landscape — find all the positions where mutations improve stability, cross-reference with the protease cleavage sites from Prompt 3, and identify the mutations that improve stability without touching the active site.

Access: GitHub — open source.


FoldX (Free for Academics)

What it does: Physics-based protein stability and interaction energy calculations. Predicts ΔΔG for point mutations, models protein-protein interactions, and evaluates the energetic effects of engineering changes. The industry standard for stability engineering for over a decade.

Open Enzyme use: Cross-validate AI predictions. If both SPURS and FoldX agree that a mutation is stabilizing, your confidence increases. FoldX also models pH-dependent stability — directly relevant to predicting acid survival in the stomach.

Access: Free for academic use — requires registration. Not technically "open source" but freely available for research.


DDGemb (Open Source, 2025)

What it does: Combines protein language model embeddings with transformer architectures to predict ΔΔG upon both single- and multi-point mutations. The multi-point capability is key — most tools only handle single mutations, but you need to predict the effect of combining 3-5 mutations simultaneously.

Open Enzyme use: After identifying individual stabilizing mutations with SPURS/RaSP, use DDGemb to predict the combined effect of your top 3-5 mutations together. Epistatic effects mean the combined effect is often not additive — DDGemb handles this.

Access: GitHub — open source.


Tier 5: Molecular Docking (Will It Bind? Will It Get Cleaved?)

DiffDock (MIT, Open Source)

What it does: AI-powered molecular docking using diffusion models. Given a protein structure and a small molecule, predicts the binding pose (where and how the molecule binds). Significantly outperforms traditional docking tools like AutoDock Vina (38% vs. 23% success rate at RMSD < 2Å).

Open Enzyme use: - Dock uric acid into your engineered uricase variants — does the substrate still bind correctly after mutations? - Screen NLRP3 inhibitor candidates (Prompt 7) against the NACHT domain structure - Dock food-derived polyphenols (quercetin, EGCG, curcumin) against NLRP3 to prioritize candidates for microbial production

Access: GitHub (MIT License).

Limitation: DiffDock predicts binding poses but not binding affinities. Use Boltz-2 for affinity prediction, or combine DiffDock poses with physics-based rescoring.


OpenDock (Open Source, 2024)

What it does: PyTorch-based open source framework for protein-ligand docking that integrates multiple docking methods. More flexible than DiffDock for custom workflows.

Open Enzyme use: Alternative to DiffDock, particularly useful for protease susceptibility analysis — dock pepsin, trypsin, and chymotrypsin against your uricase surface to identify accessible cleavage sites.

Access: GitHub — open source.


Tier 6: Codon Optimization (Get the DNA Right)

CodonTransformer (Open Source, 1M+ Downloads)

What it does: A multispecies deep learning model trained on over 1 million DNA-protein pairs from 164 organisms. Goes beyond simple codon frequency tables — it learns context-dependent codon preferences, mRNA secondary structure effects, and organism-specific regulatory signals.

Why it matters: This is the open source answer to the codon optimization question in Prompt 2. Commercial tools (IDT, GenScript) use simpler frequency-based approaches. CodonTransformer considers the same factors that Rosalind would — context effects, secondary structure, ribosome dynamics.

Open Enzyme use: Optimize the uaZ gene for both S. cerevisiae and A. oryzae. Compare outputs against IDT/GenScript optimizations. Use as a second opinion or primary optimizer.

Access: GitHub — open source, also available as a Google Colab notebook. Supports S. cerevisiae and A. oryzae directly.


Free Commercial Tools (Not Open Source, But Free)

These are worth using alongside CodonTransformer for cross-validation:

Recommendation: Run your sequence through all four (CodonTransformer + three commercial tools). Where they agree, you have high confidence. Where they disagree, investigate the specific positions — those are the codons where context effects matter and the AI model may outperform simple frequency optimization.


How Open Source Tools Map to Open Enzyme Prompts

Prompt What You Need Open Source Tool Commercial Alternative
1. Variant Selection Structure comparison, stability prediction AlphaFold/ColabFold + ESM-2 log-likelihoods Rosalind
2. Codon Optimization Organism-specific sequence optimization CodonTransformer + GenSmart/IDT (free) Rosalind
3. GI Survival Protease cleavage site mapping, acid stability DiffDock (docking) + SPURS/RaSP (stability at pH) + FoldX (pH-dependent) Bio Discovery
4. Protein Engineering Mutation design, stability prediction, epistasis ESM-2 (variant scoring) + SPURS/DDGemb (ΔΔG) + ProteinMPNN (sequence design) Rosalind
5. Expression Cassette Expression level prediction Limited open source — use Claude + literature Rosalind / Bio Discovery
6. Koji Construct A. oryzae-specific optimization CodonTransformer (A. oryzae codon tables) + ColabFold Rosalind
7. NLRP3 Screening Compound docking, binding prediction DiffDock + Boltz-2 (affinity) Bio Discovery
8. Digestive Enzymes Enzyme characterization, strain comparison ESM-2 (function annotation) + literature mining Rosalind
9. Cross-Checking Validation, stress-testing All of the above as cross-validation Multiple models

Practical Setup: Your Open Source Bio AI Workstation

Minimum viable setup (free): 1. Google Colab account (free tier) 2. ColabFold for structure prediction 3. ESM-2 (650M) via Hugging Face for variant scoring 4. CodonTransformer notebook for codon optimization 5. SPURS/RaSP for stability prediction

Better setup ($10-20/month): - Google Colab Pro ($10/month) — access to better GPUs (A100, V100) - Adds: Boltz-2 for complex prediction, RFdiffusion2 for design, DiffDock for docking - All tools above run comfortably on Colab Pro

Best setup (one-time hardware investment): - Local workstation with RTX 4090 (24GB VRAM) — ~$1,600 - Run everything locally with no time limits or queue waits - Enables large-scale screening (thousands of variants) - Long-term cheaper than Colab Pro if you're using it regularly

Recommendation for Open Enzyme Phase 1: Start with the free Colab setup. You can do variant selection, codon optimization, and initial stability screening without spending a dollar. If you need to screen hundreds of variants or run RFdiffusion2 designs, upgrade to Colab Pro for a month ($10). The hardware investment only makes sense if you're running computational biology daily.


Part 02 — How Bio AI Models Work

You don't need a bioinformatics degree to use these tools, but understanding the core concepts will help you write better prompts, evaluate outputs critically, and know when the model is giving you gold vs. hallucinating. Here are the six key concepts, explained for an engineer.

Protein Language Models

The insight: An amino acid sequence (like MKWVTFISLLFLFSSAYS...) is structurally analogous to a sentence in natural language. Each "word" (amino acid) has meaning in context, and the order determines the 3D structure and function — just like word order determines meaning in a sentence.

How it works: Models like ESM-2 and AlphaFold were trained on hundreds of millions of protein sequences the same way GPT was trained on text. They learn the "grammar" of proteins: which amino acids tend to appear together, how local sequences predict global folds, and how mutations at one position ripple through the structure. When you give Rosalind a protein sequence, it's "reading" it the way GPT reads English — predicting structure, function, stability, and interactions from the sequence alone.

Why this matters for you: You can ask "what will this uricase variant look like structurally?" and get a useful answer without running a physical crystallography experiment.

Open Enzyme use: Comparing uricase variants, predicting fold stability, identifying vulnerable residues

Codon Optimization

The insight: DNA encodes proteins through three-letter "codons," and the genetic code is degenerate — most amino acids can be encoded by 2-6 different codons. The protein is the same regardless of which synonym you pick, but the expression level varies dramatically depending on the host organism.

How it works: Every organism has a "codon usage bias" — it has more tRNAs for some codons than others. If you try to express a gene from Aspergillus flavus in S. cerevisiae using the original codons, yeast's translation machinery will stall on codons that are rare in yeast. Codon optimization rewrites the DNA sequence to use yeast-preferred codons while encoding the exact same protein. AI models go further: they consider not just codon frequency but mRNA secondary structure, GC content, ribosome pausing, and interactions between consecutive codons.

Why this matters for you: The difference between a well-optimized and poorly-optimized sequence can be 10-100x in expression level. This is the difference between a yeast strain that produces therapeutic amounts of uricase and one that produces undetectable amounts.

Open Enzyme use: Optimizing uaZ (A. flavus uricase gene) for S. cerevisiae, optimizing the same for A. oryzae

Protein Stability Prediction

The insight: A protein that folds correctly in a test tube at pH 7 and 37°C may unfold completely in the stomach at pH 2. Stability prediction models estimate the thermodynamic stability of a protein under different conditions — temperature, pH, ionic strength, presence of denaturants.

How it works: Models compute the free energy of folding (ΔG) and predict how it changes under different conditions. They identify regions of the protein that are most susceptible to unfolding and can predict the effect of point mutations on stability. Some models (like RoseTTAFold, FoldX) can predict stability changes from single amino acid substitutions in seconds.

Why this matters for you: Oral delivery means your uricase must survive pH 2 in the stomach, pepsin attack, then function at pH 6-7 in the small intestine. Stability prediction tells you which residues are the weak links and which mutations could fix them before you order a single gene synthesis.

Open Enzyme use: Predicting GI tract survival, identifying acid-labile residues, designing stability mutations

Molecular Docking / Interaction Prediction

The insight: Proteins don't exist in isolation. They interact with substrates (uric acid, for uricase), with other proteins (digestive proteases that might cleave them), and with small molecules (potential NLRP3 inhibitors). Docking simulates how two molecules fit together physically.

How it works: Given two molecular structures, docking algorithms explore millions of possible orientations to find the lowest-energy binding pose. AI-enhanced docking (like DiffDock) uses learned priors to dramatically speed up the search. The output is a predicted binding affinity (ΔG_binding) and a 3D pose showing exactly how the molecules interact.

Why this matters for you: You can ask "will pepsin cleave my engineered uricase?" by docking pepsin against your variant and looking for accessible cleavage sites. You can also screen food-derived compounds against the NLRP3 NACHT domain to find gout-relevant inflammasome inhibitors.

Open Enzyme use: Protease susceptibility analysis, NLRP3 inhibitor screening, uric acid binding assessment

In Silico Directed Evolution

The insight: Traditional directed evolution is a lab process: mutate a gene randomly, screen thousands of variants for improved function, repeat. It works, but it's slow and expensive. In silico directed evolution uses AI to predict which mutations will improve a property without physical screening.

How it works: The model takes your starting protein, a target property (e.g., "acid stability" or "catalytic rate"), and generates a ranked list of mutations predicted to improve that property while maintaining others. Advanced approaches (like those in Rosalind) consider epistatic effects — how mutations interact with each other — to suggest combinatorial variants that a random screen would never find.

Why this matters for you: Instead of ordering and testing 1,000 random uricase mutants ($$$), you order the top 5-10 AI-predicted variants (~$1,200 in gene synthesis) and test those. If the AI's predictions are even partially accurate, you've compressed years of directed evolution into one synthesis round.

Open Enzyme use: Engineering acid-stable uricase, improving catalytic efficiency, multi-objective optimization

Gene Expression Prediction

The insight: Getting a gene into an organism is step one. Getting it to produce useful amounts of protein is a separate engineering challenge. Expression level depends on promoter strength, signal peptide choice, codon optimization, mRNA stability, and dozens of other factors.

How it works: Models predict expression levels by analyzing the full expression cassette: promoter sequence, 5' UTR, signal peptide, coding sequence, terminator. They've been trained on thousands of experimentally measured expression levels across different host organisms. Some models can predict not just "how much" but "where" — will the protein be secreted or stay intracellular?

Why this matters for you: Choosing TEF1 vs. GPD vs. GAL1 as your promoter in S. cerevisiae will determine whether your yeast makes 1 mg/L or 100 mg/L of uricase. These models let you test cassette designs computationally before committing to a plasmid build.

Open Enzyme use: Promoter selection, signal peptide optimization, expression cassette design for both yeast and koji


Part 03 — Specific Prompts for Open Enzyme

These prompts are designed for GPT-Rosalind but will also work with GPT-5.4 + the Codex Life Sciences plugin (with less sophisticated biological reasoning). Each prompt includes context on what you're asking, why it matters, what to expect back, and how to evaluate the quality of the response.

Note: Pro tip: Always open with a system-level context block that tells the model about the Open Enzyme project. This grounds every subsequent prompt. Use the "Project Context" prompt below as a preamble to any session.


Project Context Preamble

Use This First — Paste this at the start of every Rosalind / GPT-5.4 session to ground the model in your project's constraints and goals.

I'm working on Open Enzyme, an open source project engineering food-safe organisms to produce therapeutic enzymes for home use.

Key constraints:
- Host organisms: Saccharomyces cerevisiae (brewer's yeast) and Aspergillus oryzae (koji mold) — both GRAS
- Primary target: uricase (urate oxidase) for gout — humans lack this enzyme
- Secondary target: digestive enzyme optimization from A. oryzae (lipase, protease, amylase)
- Tertiary target: NLRP3 inflammasome inhibitors producible by engineered organisms
- Delivery: oral, as food/fermented product — must survive GI transit
- Safety: everything must remain food-safe, no antibiotic resistance markers in final strains
- Budget: ~$1,200 for first experimental round (gene synthesis + basic molecular biology)
- Team: Currently just me (engineer). Recruiting 3 potential PhD collaborators (gut microbiome, NF-κB signaling, innate immunity)

This is citizen science with real scientific rigor. I need answers that are specific, actionable, and grounded in current literature. Flag uncertainty honestly.

1. Uricase Variant Selection

Why This Matters: Not all uricases are equal. The enzyme from different organisms varies dramatically in acid stability, protease resistance, specific activity, and immunogenicity. Picking the right starting variant is the single highest-leverage decision in the project — it determines whether the enzyme survives the GI tract, how much you need to produce, and whether it causes immune reactions.

What to expect back: A comparative table with quantitative data where available. The model should cite specific papers for each claim. If it gives you specific activity numbers, verify them against the cited sources.

How to evaluate: Check that it cites real enzymes with real UniProt/PDB accession numbers. Look for specific activity values in the literature. Be skeptical of claims about "oral delivery" performance unless backed by in vivo data.

Compare the following uricase (urate oxidase) variants for oral therapeutic use in the context of the Open Enzyme project:

1. Aspergillus flavus uricase (rasburicase parent)
2. Candida utilis uricase
3. Bacillus fastidiosus uricase
4. Arthrobacter globiformis uricase
5. Soybean (Glycine max) uricase
6. Any other variants you'd recommend I consider

For each, analyze:

a) GI STABILITY: Predicted stability at pH 2.0 (stomach, 1-2 hours), pH 6.0-7.0 (small intestine), pH 7.0-8.0 (colon). Which residues are acid-labile? Does the enzyme have disulfide bonds that contribute to acid stability?

b) PROTEASE RESISTANCE: Susceptibility to pepsin (stomach), trypsin and chymotrypsin (small intestine). Are there exposed cleavage sites? How does the quaternary structure (tetrameric vs. monomeric) affect protease accessibility?

c) EXPRESSION IN S. CEREVISIAE: Known expression levels if available. Predicted expression difficulty based on protein size, glycosylation requirements, folding complexity, and disulfide bond requirements.

d) SPECIFIC ACTIVITY: Km and kcat for uric acid. Turnover number. How much enzyme would be needed to degrade a clinically meaningful amount of uric acid (targeting ~200 mg UA/day from gut lumen)?

e) IMMUNOGENICITY: In the context of oral delivery (gut lumen, not systemic), what are the immunogenic risks? Does the enzyme share epitopes with human proteins? Is there precedent for oral tolerance?

f) ENGINEERING POTENTIAL: How amenable is each variant to stability/activity improvements through rational design or directed evolution?

Provide your recommendation ranked by suitability for oral delivery from engineered S. cerevisiae, with a clear rationale.

2. Codon Optimization for S. cerevisiae

Why This Matters: Once you've selected your uricase variant, the wild-type gene sequence from the source organism will be poorly expressed in yeast. Codon optimization is not optional — it's the difference between the project working and not working. AI-optimized sequences consider factors that commercial tools like IDT's Codon Optimization Tool often miss: mRNA secondary structure, ribosome pausing, and codon context effects.

What to expect back: A full optimized DNA sequence, analysis of the changes, predicted CAI (Codon Adaptation Index), and comparison to what a naive optimization would produce.

How to evaluate: Cross-reference the output against the S. cerevisiae codon usage table. Check that the protein sequence is unchanged. Run the optimized sequence through a secondary tool (GenScript, IDT) for comparison.

I need to optimize the uaZ gene (Aspergillus flavus urate oxidase, UniProt P78609) for expression in Saccharomyces cerevisiae.

Please provide:

1. OPTIMIZED SEQUENCE: Full codon-optimized DNA sequence for S. cerevisiae expression. Consider:
   - S. cerevisiae codon usage bias (CAI optimization)
   - Elimination of rare codons (frequency <10% usage in yeast)
   - mRNA secondary structure minimization (especially near start codon)
   - GC content optimization for yeast (target 38-42%)
   - Avoidance of internal restriction sites (BamHI, EcoRI, XhoI, NotI)
   - Avoidance of poly-A/poly-T runs >5bp (premature termination risk)

2. ANALYSIS: For the optimized sequence, provide:
   - CAI score (before and after optimization)
   - GC content (before and after)
   - Number of rare codons eliminated
   - Predicted mRNA minimum free energy change
   - Any positions where you made a non-obvious choice and why

3. COMPARISON: How does your optimization differ from what IDT's Codon Optimization Tool would produce? What do you do that a simple frequency-based optimizer misses?

3. GI Tract Survival Prediction

Why This Matters: The entire Open Enzyme thesis depends on uricase surviving the GI tract long enough to degrade uric acid in the gut lumen. This is the highest-risk assumption in the project. Understanding exactly which residues are vulnerable — and what mutations might fix them — lets you design a better enzyme before spending money on synthesis.

What to expect back: A transit-stage-by-transit-stage analysis with specific residue numbers. The model should identify cleavage sites by position, predict stability curves, and suggest protective strategies.

How to evaluate: Cross-reference predicted cleavage sites against known pepsin/trypsin specificity rules. Check cited stability data against published thermal/pH denaturation studies for uricases.

Analyze the GI tract survival profile of [SELECTED URICASE VARIANT] for oral delivery as a food-grade therapeutic.

Model the transit through each GI compartment:

STOMACH (pH 1.5-3.0, 30-120 min exposure):
- Predict acid-induced unfolding: which domains unfold first? At what pH?
- Map pepsin cleavage sites on the 3D structure — which are surface-exposed?
- Estimate remaining activity after 30 min at pH 2.0, 60 min at pH 2.0, 120 min at pH 2.0
- Consider: does the yeast cell matrix provide any protection if the enzyme is intracellular?

DUODENUM (pH 5.5-6.5, 15-30 min):
- Recovery from acid denaturation: is the unfolding reversible?
- Trypsin and chymotrypsin cleavage site analysis
- Bile salt interactions — do they help or hurt enzyme stability?

JEJUNUM/ILEUM (pH 6.5-7.5, 2-4 hours):
- This is the primary activity window for gut-lumen UA degradation
- Predicted specific activity at pH 6.5-7.5
- Substrate availability: what uric acid concentration is expected from ABCG2 secretion into the gut lumen?
- Remaining active enzyme fraction after stomach transit

COLON (pH 5.5-7.0, 12-36 hours):
- Microbiome protease exposure
- Is there residual activity? Is this therapeutically useful?

OVERALL ASSESSMENT:
- What % of ingested enzyme reaches the small intestine in active form?
- What daily dose (in mg enzyme) would be needed to degrade 200 mg uric acid?
- What are the top 3 residues/regions to engineer for improved GI survival?
- Would enteric coating, encapsulation, or expression as inclusion bodies help?

4. Protein Engineering for Oral Delivery

Why This Matters: This is where AI bio tools earn their keep. Instead of ordering hundreds of random mutants, you ask the model to reason about which specific mutations would improve acid stability and protease resistance while maintaining catalytic activity — a multi-objective optimization that would take a wet lab years of directed evolution.

What to expect back: Specific mutation suggestions with predicted ΔΔG values and rationale. The best responses will suggest combinatorial variants and flag potential pitfalls (e.g., mutations that improve stability but kill activity).

How to evaluate: Check that suggested mutations are physically plausible (right amino acid type, correct residue numbering). Look up whether any of the suggested mutations have been tested experimentally. Be especially skeptical of predicted fold-improvement numbers.

Design a protein engineering strategy for [SELECTED URICASE VARIANT] optimized for oral delivery from S. cerevisiae. I need mutations that simultaneously improve:

1. ACID STABILITY (primary objective):
   - Improve stability at pH 2.0-3.0 (stomach survival)
   - Strategies: add salt bridges, improve hydrophobic core packing, introduce disulfide bonds at surface-exposed positions
   - Predict ΔΔG for each suggested mutation
   - Do NOT sacrifice stability at pH 6.5-7.5 (this is where it needs to work)

2. PROTEASE RESISTANCE (secondary objective):
   - Modify or bury surface-exposed pepsin cleavage sites (Phe/Tyr/Trp/Leu at P1)
   - Modify trypsin sites (Lys/Arg at P1) that are surface-accessible
   - Consider proline substitutions near cleavage sites (prolines inhibit protease access)
   - Do NOT introduce mutations in the active site or substrate binding channel

3. CATALYTIC EFFICIENCY (maintain or improve):
   - kcat/Km must remain at least 80% of wild-type
   - Consider mutations that improve substrate access or product release
   - Flag any suggested stability mutations that are within 8Å of the active site

4. FOOD SAFETY (hard constraint):
   - No gain of toxic function — no new binding sites for human proteins
   - No introduction of known allergen epitopes
   - Protein must remain safe for oral consumption as a food additive

Provide your suggestions as:
a) Individual mutations ranked by predicted impact, with rationale for each
b) A recommended combinatorial variant (top 3-5 mutations combined)
c) A "safe bet" minimal variant (1-2 highest-confidence mutations only)
d) Predicted overall improvement in GI survival for each variant tier
e) Any mutations you considered and rejected, with explanation

5. Expression Cassette Design

Why This Matters: The expression cassette — promoter, signal peptide, coding sequence, terminator — determines how much protein your yeast produces and where it ends up. A strong constitutive promoter means continuous production; an inducible promoter lets you control timing. Signal peptide choice determines whether uricase is secreted (accessible to gut-lumen uric acid) or stays intracellular (protected but needs cell lysis).

What to expect back: A ranked comparison of promoter and signal peptide options with predicted expression levels. The model should reason about the tradeoff between secretion (better substrate access) and intracellular expression (better protection).

Design the optimal expression cassette for producing [SELECTED URICASE VARIANT] in Saccharomyces cerevisiae for oral therapeutic use.

PROMOTER SELECTION — Compare and recommend:
- TEF1p (constitutive, strong)
- PGK1p (constitutive, strong)
- GPDp/TDH3p (constitutive, very strong)
- GAL1p (galactose-inducible, very strong but requires galactose)
- ADH1p (constitutive, moderate)
- CYC1p (constitutive, weak — useful for titrating expression)

For each: predicted relative expression level, advantages/disadvantages for a food-grade product grown at home, metabolic burden on the cell, genetic stability over many generations (important for a strain the user maintains).

SIGNAL PEPTIDE — Evaluate the secretion vs. intracellular tradeoff:
Secretion candidates:
- α-factor prepro (MFα1) — most common yeast secretion signal
- Ost1 signal peptide
- SUC2 (invertase) signal peptide
- PHO5 signal peptide

For each: predicted secretion efficiency for a 33 kDa tetrameric protein, risk of misfolding during secretion, glycosylation concerns (N-linked glycosylation in the secretory pathway may alter enzyme activity)

Intracellular option:
- No signal peptide — enzyme stays in cytoplasm
- What is the expected intracellular yield?
- Does the user need to lyse cells to release the enzyme, or will natural cell death during digestion be sufficient?

TERMINATOR:
- CYC1t vs. ADH1t — any meaningful difference for this application?

FULL CASSETTE:
Provide your recommended full cassette design as a genetic diagram:
[Promoter] — [Signal peptide (if any)] — [Codon-optimized uricase CDS] — [Terminator]

With predicted expression level in mg/L and reasoning for each choice.

6. Koji (A. oryzae) Construct Design

Why This Matters: The koji track runs in parallel with the yeast track. A. oryzae (koji mold) is the original fermentation organism for miso, sake, and soy sauce — it's GRAS by centuries of tradition. It also naturally produces high levels of digestive enzymes, making it the ideal chassis for Lynn's digestive enzyme therapy and potentially a second uricase production platform.

What to expect back: A. oryzae-specific construct design addressing the unique biology of filamentous fungi vs. yeast. Different codon bias, different promoter systems, different secretion machinery.

Design a gene construct for expressing [SELECTED URICASE VARIANT] in Aspergillus oryzae (koji mold) for oral therapeutic use as a fermented food product.

PROMOTER COMPARISON:
- amyB promoter (α-amylase, starch-inducible, very strong in A. oryzae)
- glaA promoter (glucoamylase, strong, starch-inducible)
- TEF1 promoter (A. oryzae version, constitutive)
- Compare: expression levels, induction requirements, stability, suitability for home fermentation on rice/grain substrates

CODON OPTIMIZATION:
- How does A. oryzae codon usage differ from S. cerevisiae?
- Generate a codon-optimized uricase sequence specifically for A. oryzae
- Key differences from the yeast-optimized version

SECRETION:
- glaA signal peptide (native glucoamylase signal) for secretion
- amyB signal peptide
- Predicted secretion efficiency for uricase in the A. oryzae secretory pathway
- Glycosylation considerations (A. oryzae tends to hyperglycosylate — will this affect uricase activity?)

PREDICTED YIELD:
- Expected mg/g of fermented substrate (rice koji)
- How does this compare to the yeast platform?
- Which platform (yeast vs. koji) do you predict will produce more active uricase per gram of food consumed?

DUAL-USE POTENTIAL:
- Can we co-express uricase alongside the native digestive enzymes (amylase, protease, lipase)?
- Will uricase expression reduce native enzyme production?
- Optimal construct design for maintaining both functions

7. NLRP3 Inhibitor Discovery

Why This Matters: Gout flares are driven by the NLRP3 inflammasome responding to MSU crystals. Even if you degrade uric acid perfectly, suppressing NLRP3-driven inflammation is the other half of the therapeutic equation. If you can find food-derived NLRP3 inhibitors that could be produced by engineered yeast or koji, you've got a multi-mechanism oral therapy.

What to expect back: A screen of food-safe compounds with predicted binding affinities. Be skeptical of binding affinity predictions — they're notoriously unreliable without experimental validation. Focus on compounds where there's already experimental evidence of NLRP3 inhibition.

Screen for food-derived or naturally occurring compounds that inhibit the NLRP3 inflammasome, with a focus on compounds that could be produced by engineered S. cerevisiae or A. oryzae.

KNOWN NLRP3 INHIBITORS (benchmarks):
- MCC950 (CRID3) — IC50 ~7.5 nM, binds NACHT domain, gold standard
- Oridonin — covalent modifier of NLRP3 Cys279
- Tranilast — binds NACHT domain, approved drug in Japan
- OLT1177 (dapansutrile) — clinical trial for gout flares

FOOD-DERIVED CANDIDATES TO EVALUATE:
Screen across these categories:
- Polyphenols (resveratrol, quercetin, EGCG, curcumin, sulforaphane)
- Terpenoids (β-caryophyllene, limonene, ursolic acid)
- Fatty acids (omega-3 DHA/EPA metabolites, short-chain fatty acids)
- Amino acid derivatives (taurine, carnosine)
- Fungal metabolites native to A. oryzae or S. cerevisiae
- Any other food-safe compounds with NLRP3 evidence

For each promising candidate:
a) Mechanism of NLRP3 inhibition (direct binding? upstream signaling? post-translational?)
b) Predicted binding affinity to NLRP3 NACHT domain (if direct binder)
c) Evidence level (in vitro only? animal models? human trials?)
d) Could it be biosynthesized by engineered yeast or koji? What pathway genes would be needed?
e) Effective dose vs. achievable production levels — is it realistic?

RANK the top 5 candidates by: strength of NLRP3 evidence × feasibility of microbial production × food safety

Also flag: Are there any NLRP3 inhibitors that A. oryzae already produces naturally?

8. Digestive Enzyme Optimization (Lynn's Track)

Why This Matters: Lynn's digestive enzyme insufficiency is the second therapeutic target. A. oryzae naturally produces amylase, protease, and lipase — the question is which strain and which fermentation conditions maximize the enzyme profile she needs. This is more about optimizing what nature already provides than engineering new capabilities.

What to expect back: Strain comparisons with specific enzyme activity data, plus fermentation optimization parameters. Much of this data exists in the sake/miso fermentation literature.

Optimize Aspergillus oryzae strain selection and fermentation conditions for maximum production of digestive enzymes relevant to exocrine pancreatic insufficiency (EPI).

TARGET ENZYME PROFILE (in order of importance for EPI):
1. Lipase — the most critical deficiency in EPI; fat malabsorption causes most symptoms
2. Protease — protein digestion, includes acid-stable and alkaline proteases
3. Amylase — starch digestion, less critical but part of the standard enzyme replacement

STRAIN COMPARISON — Evaluate these A. oryzae strains:
- RIB40 (reference genome strain)
- ATCC 11866 (commonly used industrial strain)
- Starter cultures available from fermentation suppliers (GEM Cultures, Cultures for Health)
- Any other strains known for exceptional enzyme production

For each strain, provide (with citations where possible):
a) Lipase activity (U/g koji)
b) Acid protease activity (U/g koji)
c) Alkaline protease activity (U/g koji)
d) Amylase activity (U/g koji)
e) Availability for home fermenters

FERMENTATION OPTIMIZATION:
- Substrate: rice vs. barley vs. soybean — which maximizes lipase?
- Temperature: optimal range for each enzyme class
- Moisture content: effect on enzyme profile
- Fermentation time: activity curves (12h, 24h, 36h, 48h)
- pH management: does substrate pH affect enzyme secretion profile?

DOSE EQUIVALENCE:
- Standard prescription enzyme replacement (Creon): 25,000 USP units lipase per meal
- How many grams of optimized koji would provide equivalent lipase activity?
- Is this a realistic serving size? (Must be palatable as a food condiment)

STRAIN IMPROVEMENT (if needed):
- If wild-type strains don't produce enough lipase, what A. oryzae genes could be overexpressed?
- Could we use CRISPR to upregulate lipase production in an otherwise wild-type background?

9. Cross-Checking Our Research

Why This Matters: The most valuable thing a reasoning model can do is honestly stress-test your assumptions. These prompts ask Rosalind to identify the weakest links in the Open Enzyme thesis — the places where the project is most likely to fail or where the science doesn't hold up.

What to expect back: Honest assessments with specific failure modes. The best responses will distinguish between "technically challenging but feasible" and "fundamentally flawed." Be especially attentive to failure modes you haven't considered.

Prompt: Thesis Validation

Critically evaluate the Open Enzyme project thesis. I want honest assessment, not encouragement.

CORE THESIS: Engineering GRAS organisms (S. cerevisiae, A. oryzae) to produce uricase for oral delivery as a food product can meaningfully reduce serum uric acid in gout patients through gut-lumen degradation of uric acid secreted via the ABCG2 transporter.

Evaluate each link in the chain:

1. GUT-LUMEN UA DEGRADATION:
   - Is there sufficient evidence that reducing gut-lumen uric acid affects serum levels?
   - What does the ALLN-346 Phase 2a data actually show? How strong is it?
   - What does the PULSE probiotic study (Cell Reports Medicine, Oct 2025) demonstrate?
   - Is ABCG2 secretion of UA into the gut quantitatively significant enough?

2. ORAL ENZYME SURVIVAL:
   - What % of orally delivered recombinant enzyme actually reaches the small intestine intact?
   - Are we overestimating the protection provided by food matrix/yeast cell wall?
   - What are the real-world numbers from oral enzyme replacement therapy (e.g., pancreatic enzymes)?

3. EXPRESSION FEASIBILITY:
   - Is S. cerevisiae a realistic host for producing A. flavus uricase at therapeutic levels?
   - What expression levels have actually been achieved for similar heterologous enzymes in yeast?
   - The ACS Syn Bio 2025 paper claims 365 μmol/h/OD in S. boulardii — is this reproducible?

4. SAFETY:
   - What are the actual risks of consuming recombinant uricase regularly?
   - Could engineered organisms colonize the gut? Is that a feature or a bug?
   - Horizontal gene transfer risks — real or theoretical?

5. REGULATORY:
   - Can this actually be distributed as a food product?
   - Where does "engineered yeast you grow at home" fall in the FDA/GRAS framework?
   - What precedent exists?

Rate the overall project feasibility on a scale of 1-10 with explicit assumptions for each score point.

Prompt: Risk Assessment

What are the top 10 risks most likely to kill the Open Enzyme project, ranked by (probability × impact)?

For each risk:
- Specific failure mode (not vague — exactly what goes wrong)
- Probability estimate (with reasoning)
- Impact if it happens (project delay vs. project death)
- Mitigation strategy available now
- Key experiment or data point that would de-risk it

Also: What am I NOT thinking about? What failure mode would a pharma R&D team immediately flag that a citizen science project might miss?

Prompt: Feasibility Rating

Rate the feasibility of engineering S. cerevisiae to express A. flavus uricase at therapeutic levels for oral delivery.

Break this into sub-problems and rate each 1-10:

a) Gene synthesis and cloning — can we get the construct built? ($?)
b) Expression level — will yeast produce enough active enzyme? (mg/L?)
c) Protein folding — will the enzyme fold correctly in yeast? (tetrameric assembly?)
d) Enzyme activity — will it degrade UA at the required rate? (kcat/Km?)
e) GI survival — will enough enzyme survive to the small intestine? (% surviving?)
f) Therapeutic effect — will gut-lumen UA degradation lower serum UA? (mmol/L change?)
g) Strain stability — will the engineered strain maintain expression over generations?
h) Home production — can a non-scientist grow this reliably? (as easy as sourdough?)
i) Taste/palatability — will people actually eat this? (flavor impact?)
j) Safety — is regular consumption of engineered yeast safe long-term?

For the lowest-scoring items: what would it take to move each up by 2 points?

Part 04 — Workflow Integration

The AI bio tools aren't a one-shot oracle — they're a continuous feedback loop integrated into your development process. Here's how to wire them into the Open Enzyme workflow.

The AI-Accelerated Development Loop

Step 1: Design (Rosalind / Bio Discovery) → Step 2: Validate (Cross-check predictions) → Step 3: Synthesize (Order gene synthesis) → Step 4: Build (Transform & express) → Step 5: Test (Measure activity) → Step 6: Feed Back (Results → AI → iterate)

Phase 1: In Silico Design (Weeks 1-2)

  1. Variant Selection & Engineering — Run the uricase variant comparison prompt (Prompt 1) using Claude or Rosalind. Simultaneously, use ColabFold to generate structures for each candidate variant and ESM-2 to score evolutionary plausibility. Run SPURS and RaSP on each structure to predict acid stability. Use the protein engineering prompt (Prompt 4) to design improved variants, then validate each suggested mutation with SPURS/DDGemb for ΔΔG prediction. Run GI survival prediction (Prompt 3) and cross-reference with DiffDock to map actual protease cleavage site accessibility on your predicted structures. You should end this step with 2-3 candidate sequences: wild-type, minimal mutations, and full engineered variant — each backed by both AI reasoning (Claude/Rosalind) and quantitative structural predictions (open source tools).

  2. Codon Optimization & Cassette Design — Run codon optimization (Prompt 2) for each candidate variant. Cross-validate with CodonTransformer and at least one commercial tool (GenSmart or IDT — both free). Run expression cassette design (Prompt 5) to select promoter, signal peptide, and terminator. Output: complete DNA sequences ready for gene synthesis ordering. If using the koji track in parallel, run Prompt 6 for A. oryzae versions (CodonTransformer supports A. oryzae directly).

  3. Cross-Validation — Run the thesis validation prompt (Prompt 9) to stress-test your design before spending money. Use Boltz-2 to predict whether your engineered variants still bind uric acid correctly and assemble into tetramers. Run the same sequences through Amazon Bio Discovery (if using) to get a second opinion from different models. The open source tools give you independent quantitative checks: does ESM-2 score your engineered sequence as plausible? Does SPURS predict improved stability? Does ColabFold show the structure is intact? Agreement across tools increases confidence; disagreement highlights risks to investigate.

Phase 2: Synthesis & Build (Weeks 3-6)

  1. Gene Synthesis Order — Order codon-optimized gene synthesis from Twist Bioscience or IDT (if using Bio Discovery, this can be routed directly). Order 2-3 variants: the AI's top pick, the safe-bet minimal variant, and the wild-type as a control. Typical cost: $0.07-0.09/bp, so a 900bp uricase gene costs ~$70-80 per variant.

  2. Transform & Express — Clone into yeast expression vector, transform S. cerevisiae, select transformants. While waiting for colonies, use AI to predict expression levels and plan your activity assay. The yeast uricase protocol from the engineered-yeast-uricase-proposal.md document provides the detailed bench protocol.

Phase 3: Test & Iterate (Weeks 6-10)

  1. Measure & Feed Back — Measure uricase activity (UA degradation assay at A293nm). Compare actual expression levels to Rosalind's predictions. This is the critical feedback step: take your actual results and feed them back to Rosalind/Bio Discovery. "My wild-type variant expressed at X mg/L, the engineered variant at Y mg/L, acid stability improved by Z%. Based on these results, what should my next round of mutations focus on?" The AI learns from your data to make better predictions for round 2.

Note: The $1,200 vs. $400M comparison: This is the exact same design-build-test-learn loop that pharma companies run, except they spend $400M on high-throughput screening that tests millions of random variants. You're spending $1,200 on AI-guided design that tests 3-5 computationally optimized variants. If the AI predictions are even 30% accurate, your per-discovery cost is 100,000x lower. This is not a metaphor — this is literally what MSK just did with Rosalind for antibodies.

Tool Selection Guide

Task Best Tool (If You Have Access) Open Source Alternative (Available Now) Notes
Protein structure prediction Rosalind / Bio Discovery ColabFold (AlphaFold 2) or ESMFold (fast) ColabFold = gold standard; ESMFold = rapid screening
Complex/multimer prediction Bio Discovery (Boltz model) Boltz-2 (MIT license) or Protenix-v1 Handles protein-ligand, protein-protein complexes
Variant effect scoring Rosalind ESM-2 (log-likelihood scoring) Zero-shot; no training data needed for your specific protein
Stability prediction (ΔΔG) Rosalind / Bio Discovery SPURS + RaSP + FoldX (free academic) Run all three and trust consensus
Multi-mutation epistasis Rosalind DDGemb Predicts combined effect of 3-5 simultaneous mutations
Codon optimization Rosalind CodonTransformer + GenSmart/IDT (free) Cross-validate across multiple tools
De novo enzyme design Rosalind / Bio Discovery RFdiffusion2 + ProteinMPNN Cutting-edge; Baker Lab pipeline; Phase 2 tool
Molecular docking Bio Discovery DiffDock or OpenDock DiffDock for ligands; OpenDock for protease susceptibility
Binding affinity prediction Bio Discovery Boltz-2 First DL model approaching FEP accuracy
Literature synthesis Rosalind (database-connected) Claude + PubMed Claude is your most accessible reasoning tool
Experimental design & protocols Rosalind Claude Claude excels at multi-step reasoning and protocol writing
Gene synthesis ordering Bio Discovery (Twist/Ginkgo) Direct order from Twist/IDT/GenScript Bio Discovery adds routing convenience, not capability
Expression level prediction Rosalind / Bio Discovery Limited — use Claude + literature benchmarks Gap in open source; active research area

Part 05 — What This Means for Open Enzyme

Step back and look at the timing. Open Enzyme was conceived in April 2026. GPT-Rosalind launched April 17. Amazon Bio Discovery launched April 14. Anthropic acquired Coefficient Bio on April 3. Three major AI biology platforms arrived in the same two-week window that the project started.

This is not a coincidence you can take advantage of. This is the reason the project is possible at all.

The Capability Shift

  1. Democratized Access to Pharma-Grade AI — Memorial Sloan Kettering used Rosalind to generate 300,000 drug candidates. The same model is available to you during the free research preview. The same Twist Bioscience that synthesizes MSK's antibodies synthesizes your uricase gene. The tools are the same. The access is the same. The only difference is scope — and for a single-enzyme project, that works in your favor.

  2. The AI-Guided Experiment Changes the Economics — Traditional pharma screens millions of random variants because they can't predict which ones will work. AI-guided design lets you order 3-5 variants because the model has already done the equivalent of millions of in silico screens. Your $1,200 gene synthesis budget buys the same number of experiments — but each experiment is computationally optimized rather than random. The expected value per dollar spent is fundamentally different.

  3. Open Source + AI Biology = Genuinely New — The combination of open source ethos, AI-assisted biology, and citizen science has no real precedent. Open source software democratized code. AI biology is democratizing drug discovery. Open Enzyme sits at the intersection — an open source therapeutic enzyme library designed with the same AI tools that pharma uses, but distributed as a public good rather than a $300K/year prescription.

  4. The Window Is Now — Rosalind is free during research preview. Bio Discovery has a free trial. Claude's life sciences capabilities are expanding rapidly. The competitive landscape among AI providers means they're all racing to offer biology capabilities — and racing to give researchers free or cheap access during the land-grab phase. This won't last forever. The time to establish the project, run the first experiments, and build the dataset is now, while the tools are free and the providers are hungry for use cases.

Note: The bottom line: Two weeks ago, building an AI-designed therapeutic enzyme at home was a thought experiment. Today, you have a frontier biological reasoning model (Rosalind), a platform of 40+ specialized bio AI models with lab integration (Bio Discovery), an AI company investing $400M in biological reasoning (Anthropic/Coefficient), three potential PhD collaborators who understand the gut biology, and a $1,200 budget that buys 3-5 computationally optimized gene constructs. The tools exist. The science is sound. The access is open. Execute.


Appendix — Quick Reference

Commercial / Restricted:

Tool URL Status
GPT-Rosalind openai.com — Apply for Trusted Access Program Research preview (free), US enterprise only
Codex Life Sciences Plugin Available in Codex (GitHub) Free, available now to all GPT-5.4 users
Amazon Bio Discovery aws.amazon.com/biodiscovery Early access, $486/mo or free trial (5 experiments)
Claude for Life Sciences anthropic.com Available now (Benchling, PubMed connectors)
Hugging Science huggingscience.co/llms.txt Free curated index of open AI-for-science resources

Open Source (Available Now):

Tool URL License Hardware
ColabFold (AlphaFold 2) github.com/sokrypton/ColabFold MIT Free Colab
ESMFold huggingface.co/facebook/esmfold_v1 MIT Free Colab
ESM-2 (protein language model) github.com/facebookresearch/esm MIT Free Colab
ESM-3-open github.com/evolutionaryscale/esm Non-commercial Free Colab
Boltz-2 (complex prediction) github.com/jwohlwend/boltz MIT Colab Pro or local GPU
Protenix-v1 github.com/bytedance/protenix Open source Colab Pro or local GPU
RFdiffusion2 (enzyme design) github.com/RosettaCommons/RFdiffusion2 BSD Colab Pro or local GPU
ProteinMPNN (sequence design) github.com/dauparas/ProteinMPNN MIT Free Colab
SPURS (stability prediction) github.com/mj-hwang/SPURS Open source Free Colab
RaSP (rapid stability) github.com/KULL-Centre/papers Open source Free Colab
DDGemb (multi-mutation ΔΔG) github.com/PeppeL-G/DDGemb Open source Free Colab
FoldX (physics-based stability) foldxsuite.crg.eu Free academic CPU only
DiffDock (molecular docking) github.com/gcorso/DiffDock MIT Colab Pro or local GPU
CodonTransformer github.com/Adibvafa/CodonTransformer Open source Free Colab

Free Codon Optimization (Web Tools):

Tool URL
GenScript GenSmart genscript.com/tools/gensmart-codon-optimization
IDT Codon Optimization idtdna.com/pages/tools/codon-optimization-tool
VectorBuilder vectorbuilder.com/tool/codon-optimization

Prompt Execution Order

  1. Project Context Preamble — Paste at the start of every session. Sets the project constraints.
  2. Uricase Variant Selection — Pick your starting enzyme. Everything else depends on this choice.
  3. GI Survival Prediction — Understand the survival challenge before trying to fix it.
  4. Protein Engineering — Design mutations to address GI survival weaknesses.
  5. Codon Optimization + Expression Cassette — Convert your protein design into orderable DNA sequences.
  6. Cross-Check & Validate — Stress-test assumptions before spending money on synthesis.
  7. Parallel Tracks: Koji (6), NLRP3 (7), Digestive Enzymes (8) — Run these concurrently with the yeast primary track.

Key Databases (Connected via Codex Plugin)

Database What It Contains Open Enzyme Use
UniProt Protein sequences, annotations, function Uricase variant sequences, known mutations
PDB 3D protein structures Crystal structures for docking, stability analysis
NCBI/GenBank Gene sequences, genome data Gene sequences for codon optimization input
KEGG Metabolic pathways, enzyme reactions Uric acid degradation pathway, biosynthesis routes
ChEMBL Bioactive compounds, binding data NLRP3 inhibitor data, known binding affinities
PubMed Biomedical literature Everything — validation of every claim

Codex Life Sciences Plugin — Setup & Access

Quick reference for accessing OpenAI's Codex Life Sciences plugin and evaluating GPT-Rosalind access. The plugin is free and public; GPT-Rosalind is invitation-only.

Codex Life Sciences Plugin (free, available now)

Routes research queries to 50+ biology databases automatically — human genetics, functional genomics, protein structure, pathway biology, clinical evidence, public study discovery. Returns integrated answers with citations. Works with existing GPT-5.4; no model upgrade needed.

Install: - CLI: /plugins install github.com/openai/plugins/life-science-research - Codex App: Plugins → search "Life Sciences Research" → Install

Requires: Codex CLI or Desktop installed; valid OpenAI API key if using CLI. No special approval. Cost: free.

GPT-Rosalind (restricted — Trusted Access Program)

Domain-specific model (not a plugin) for drug discovery, protein engineering, genomics interpretation, hypothesis generation, multi-step experimental planning. Current partners: Amgen, Moderna, Thermo Fisher, Allen Institute, Los Alamos. US enterprise only.

Apply: https://openai.com/form/life-sciences-access/

Approval typically requires institutional affiliation, published research or credible plan, safety/ethics framework, and an OpenAI enterprise agreement. Not guaranteed and not fast.

For Open Enzyme: the free Codex plugin covers the majority of use cases. GPT-Rosalind adds specialized reasoning but isn't essential to start.

References


Open Enzyme Research Library

This document is part of the Open Enzyme research documentation. Explore related topics:

Project Information: - Author: Brian Abent (Open Enzyme Project) - Date: April 2026 - Status: Active research - License: Open source (see project repository)