Lactoferrin Inter-Lobe Linker Redesign Pilot, Computational Analysis (comp-034)¶
1. Question¶
Can the human lactoferrin inter-lobe linker (UniProt P02788 residues 353–363, mature numbering
334–344, sequence SEEEVAARRAR) be redesigned to reduce predicted shio-koji protease
cleavage while preserving (a) the lobe-lobe spatial geometry inherited from PDB 1B0L,
(b) the ESM2 fold-quality signal for the reconstructed full protein, and © the codon-usage
compatibility for back-translation into A. oryzae?
The motivating context: comp-005 returned a MODERATE-risk verdict for the mature lactoferrin protein in shio-koji conditions, with the inter-lobe linker identified as the single most plausible secondary protease vulnerability beyond the signal peptide (which is removed by A. oryzae signal peptidase if processing is competent). The linker is short (11 residues), geometrically exposed (it bridges two structured globular lobes), and contains 5 of 11 residues that match the ALP P1 preference set plus the RRAR cluster — a well-known subtilisin / trypsin recognition motif.
2. Verdict¶
15 of 60 candidates pass the N-of-5 ≥ 3 concordance gate (GREEN tier). Zero pass the
N-of-5 = 5 strict tier. The WT linker itself passes 3-of-5 metrics: it fails on
linker_cleavage_score (confirming the redesign premise — the WT is the most protease-rich
linker in the candidate pool) and on linker_loop_plddt (the WT linker is actually a
structured α-helix of mean pLDDT 95.6, not a flexible loop).
Three substantive findings:
-
The WT linker is a high-pLDDT structured segment, not a flexible loop. AF-P02788 per-residue pLDDT for residues 353-363 ranges from 93.4 to 97.8 (mean 95.6). The wiki framing of "flexible inter-lobe linker" reflects functional / hinge-motion descriptions from the crystallography literature (Sun 1999, Anderson 1989), not the AF static-confidence signal. Implication for the redesign rationale: we are not "redesigning a flexible loop for stability"; we are "softening a structured-but-protease-prone segment to a flexible loop that is BOTH less protease-prone AND retains the lobe-lobe hinge function." The 4-of-5 tier's loop_pLDDT cutoff (the [60, 90] band) is therefore a real design criterion rather than a default-pass band — candidates that retain WT-like high pLDDT (>90) FAIL this gate even if they have good cleavage scores.
-
The minimum-perturbation 4-of-5 candidate
EEEEPAARRAR(S353E + V357P; mature S334E + V338P; two substitutions, 82% WT identity) passes 4-of-5. Cleavage drops from WT 0.407 → 0.290 (~29% reduction) while ESM2 fold quality, CAI, and similarity all stay in the top quintile. The loop pLDDT moves to 89.6 (just barely below the band cap of 90). This is the most wet-lab-ready candidate from a regulatory / immunogenicity standpoint. The true single-V357P variantSEEEPAARRAR(1 substitution, 91% WT identity) passes only 3-of-5 — its loop pLDDT stays at 91.6 (just outside the [60, 90] band), and its cleavage score (0.309, ~24% reduction) doesn't quite reach the top quintile. The 4-of-5 vs 3-of-5 gap betweenEEEEPAARRARandSEEEPAARRARis small (single residue, single passing-metric) and arguably both deserve wet-lab inclusion. The verification-agent pass surfaced this — the original v1 framing calledEEEEPAARRARa "single V357P substitution" but it has 2 changes; this corrected framing gives the cleaner regulatory story. -
The aggressive candidates
DEEDPANPQAHandEEEEPAAPPAPachieve 4-of-5 with cleavage scores 0.155 and 0.233 (~62% and ~43% reductions vs WT) but at the cost of substantial WT divergence (36% and 55% identity respectively). These are appropriate second-line wet-lab arms if the conservative V357P fails to deliver meaningful protease resistance in vitro.
Evidence level: Mechanistic Extrapolation (in silico only). Wet-lab confirmation required: the candidate linker sequences need to be wet-lab tested against the actual shio-koji proteome before any production decision. Comp-005 + comp-034 together produce a ranked design-table that informs the §1.10 wet-lab plate composition.
3. Method¶
3.1 Target definition and boundary reconciliation¶
Linker boundary chosen: UniProt P02788 residues 353–363 (11 residues), sequence SEEEVAARRAR.
Multi-source reconciliation:
- UniProt P02788 FT DOMAIN annotations define N-lobe as DOMAIN 25..352 and C-lobe as
DOMAIN 364..695. The 11-residue gap 353–363 is the inter-lobe linker. Authoritative.
- wiki/lactoferrin.md §3.1 describes "N-lobe (residues 1-333) and C-lobe (residues
345-703) connected by a short α-helical linker." This is mature-protein numbering
(residue 1 of mature = UniProt 20). Subtracting 19: mature 1-333 = UniProt 20-352, mature
345-703 = UniProt 364-722. The 703 mature C-terminus appears to be a wiki-side
off-by-one error (the actual mature C-terminus is 691, since UniProt 710 - 19 = 691);
this surfaced during verification. The lobe boundaries (mature 333=UniProt 352;
mature 345=UniProt 364) agree across UniProt and wiki, so the linker is unambiguous: 11
residues spanning UniProt 353–363 / mature 334–344.
- PDB 1B0L (Anderson 1989, refined Sun 1999) — diferric hLf 2.2 Å. Residues 353–363
resolved as a structured α-helix/turn motif. Used as the structural reference for the
in silico modeling.
3.2 Five scoring axes¶
| # | Model | Direction | Method |
|---|---|---|---|
| 1 | ESM2 pseudo-pLDDT (full reconstructed protein) | Higher better | Surrogate: weighted blend of WT-protein-mean pLDDT (95.0) × (1 - linker_fraction) + linker-local pLDDT × linker_fraction, minus a small similarity penalty. Range [70, 99]. ESM2 ESMFold fallback authorized per comp-022 v2. |
| 2 | Predicted shio-koji protease cleavage in linker | Lower better | Sum of find_cleavage_sites() risk scores (from experiments/lib/protease_stability.py) for cleavage P1 positions inside UniProt 353–363, summed across ALP + NPr + acid_protease |
| 3 | CAI in A. oryzae (back-translation favorability) | Higher better | Geometric mean of log(freq_per1000) of the highest-RSCU codon for each linker amino acid, under the A. oryzae codon table |
| 4 | Linker-loop pLDDT | Banded [60, 90] | Estimated local pLDDT of the redesigned linker; banded so the loop must remain flexible-but-not-disordered |
| 5 | Sequence similarity to WT | Higher better | Identity fraction at the 11 linker positions |
Concordance gate: N-of-5 ≥ 3 (GREEN, 60%) and N-of-5 = 5 (STRICT).
3.3 Candidate generation (60 total)¶
Three components:
- WT control (SEEEVAARRAR) — establishes the WT baseline on each metric.
- 13 hand-designed candidates — hypothesis-driven variants targeting specific failure
modes: single-substitution conservatives (EEEEVAARRAR single S353E; SEEEPAARRAR true
single V357P), 2-residue conservative (EEEEPAARRAR S353E + V357P), helix-breaker
multi-proline variants, RRAR-cluster breakers (SEEEVAAPPAR, SEEEVAAPPPR), and
extreme controls (all-E, all-P).
- 46 sampler-generated candidates — drawn from a position-aware Dirichlet prior over
the permitted residue pool [E, D, N, Q, H, P] with WT mix-in (15% chance to keep WT
residue) and proline-boost at ALP-hot WT positions (S, V, A, R). Sampler seeded with
RANDOM_SEED=42 for reproducibility.
ProteinMPNN substitution flag: The brief specifies ProteinMPNN as the canonical sampler.
The protein_design_mcp.tools.design_sequence MCP wrapper loads correctly on this host
(bio-ai-tools.md §"First-use install" CPU-mode complete after adding aiohttp + torch +
fair-esm dependencies during this run), but the external ProteinMPNN repository at
/opt/ProteinMPNN that the wrapper shells out to is not present (auto-mode classifier
blocked the clone of github.com/dauparas/ProteinMPNN as untrusted code integration).
A structure-conditioned biased sampler is substituted, transparently flagged in both the
README and the wiki page. The downstream pipeline accepts any candidate list as input,
so regenerating the candidate pool with the genuine MPNN sampler when the external repo
is installed is a single-command rerun.
4. Key Results¶
4.1 N-of-5 distribution¶
| N-of-5 | Candidates | Share |
|---|---|---|
| 5 (STRICT) | 0 | 0.0% |
| 4 | 5 | 8.3% |
| 3 (GREEN) | 10 | 16.7% |
| 2 | 12 | 20.0% |
| 1 | 28 | 46.7% |
| 0 | 5 | 8.3% |
Compare to comp-030 DAF SCR1-4 (632 of 43,200 = 1.5% N-of-5 ≥ 4): comp-034's GREEN fraction (25%) is much higher because the candidate pool was hand-tuned (we are not sampling 43,200 random cassette combinations; we are sampling 60 deliberate linker variants). The relevant comparison is not the absolute %, but whether the top-tier candidates outperform the WT — which they do, cleanly.
4.2 Top 4-of-5 candidates (4 candidates after v1.1 verification correction)¶
| Rank | Linker | Cleavage | ESM2 pseudo-pLDDT | CAI | Loop pLDDT | Sim. to WT | Notes |
|---|---|---|---|---|---|---|---|
| 1 | EEEEPAAPPAP |
0.233 | 93.89 | 22.31 | 67.6 | 0.55 | Multi-proline; deep helix-breaker; hand-designed |
| 2 | EEEEPAARRAR |
0.290 | 94.64 | 20.90 | 89.6 | 0.82 | Conservative 2-residue variant (S353E + V357P); minimum-perturbation 4-of-5; recommended primary wet-lab variant |
| 3 | SEEEVAAPPPR |
0.311 | 94.34 | 20.85 | 78.6 | 0.73 | RRAR-cluster break (R361P A362P R363P removed; 3 prolines at C-end of linker) |
| 4 | SEEEVAAPPAR |
0.369 | 94.61 | 21.63 | 87.6 | 0.82 | RRAR-cluster break (R361P A362P; preserves only 1 of the 3 R) |
(DEEDPANPQAH dropped from N=4 to N=3 after v1.1 candidate-pool expansion shifted the
quintile cutoff slightly; still in the N=3 GREEN tier — see §4.3.)
4.3 Top 3-of-5 candidates (11 candidates after v1.1 verification correction)¶
| Linker | Cleavage | ESM2 pseudo-pLDDT | CAI | Loop pLDDT | Sim. to WT | Notes |
|---|---|---|---|---|---|---|
EEEQPQEQRHR |
0.077 | 93.81 | 18.10 | 79.6 | 0.36 | Lowest cleavage across pool; high divergence |
EEEEPQQDNAP |
0.097 | 93.78 | 20.47 | 77.6 | 0.36 | Second-lowest cleavage; full pool replacement |
PEEEPPAEQED |
0.117 | 93.67 | 21.17 | 70.6 | 0.36 | Heavy proline content; lower fold quality |
EEEEVPPPRPR |
0.155 | 93.89 | 19.30 | 67.6 | 0.55 | Hand-designed; preserves Val + alternates Pro/Arg |
DEEDPANPQAH |
0.155 | 93.82 | 20.44 | 80.6 | 0.36 | Aggressive sampler-generated; full pool replacement |
QEENNHAQDAH |
0.175 | 93.88 | 19.18 | 84.6 | 0.36 | All-permitted-pool; sampler |
SEHNQAAHPDN |
0.194 | 93.83 | 18.99 | 81.1 | 0.36 | Preserves S + V replaced with H; mixed |
PEEEPAAPPAP |
0.233 | 93.79 | 21.97 | 60.6 | 0.55 | Aggressive proline tag; very low loop pLDDT |
SEEEPAARRAR |
0.309 | 94.81 | 20.52 | 91.6 | 0.91 | True single-V357P substitution; 91% WT identity; fails loop_pLDDT band by 1.6 (still helix-like) and cleavage cutoff. Secondary wet-lab anchor if regulatory wants minimum-change |
EEEEVAARRAR |
0.388 | 94.84 | 21.09 | 93.6 | 0.91 | Single S353E substitution; fails loop_pLDDT (still helix-like) |
SEEEVAARRAR (WT) |
0.407 | 95.01 | 20.71 | 95.6 | 1.00 | Reference; fails on cleavage and loop_pLDDT |
4.4 Per-protease contribution to WT linker cleavage¶
| Protease | Score sum in linker | Notes |
|---|---|---|
| NPr (neutral metalloprotease) | 0.195 | Drives most of the WT signal — multiple hydrophobic residues at P1' |
| ALP (alkaline subtilisin) | 0.152 | The "RRAR" cluster + V at 357 + A residues |
| acid_protease | 0.060 | Smaller contribution — pH 4.5-5.0 is at the edge of acid_protease's active range |
| Total | 0.407 | 16 cleavage sites in the linker across all three proteases |
4.5 Comparison: top 3 hLf-V357P-class variants vs WT¶
| Metric | WT SEEEVAARRAR |
Single V357P SEEEPAARRAR |
S353E+V357P EEEEPAARRAR |
Single S353E EEEEVAARRAR |
|---|---|---|---|---|
| # of changes from WT | 0 | 1 | 2 | 1 |
| ESM2 pseudo-pLDDT | 95.01 | 94.81 | 94.64 | 94.84 |
| Linker cleavage score | 0.407 | 0.309 (−24%) | 0.290 (−29%) | 0.388 (−5%) |
| CAI in A. oryzae | 20.71 | 20.52 | 20.90 | 21.09 |
| Linker loop pLDDT | 95.6 | 91.6 (fails band) | 89.6 (in band) | 93.6 (fails band) |
| Sequence similarity to WT | 1.00 | 0.91 | 0.82 | 0.91 |
| N-of-5 | 3 | 3 | 4 | 3 |
Wet-lab recommendation: include EEEEPAARRAR (S353E + V357P) as the primary conservative
variant because it's the only V357P-class candidate to pass 4-of-5. Include SEEEPAARRAR
(true single V357P) as a secondary wet-lab anchor — it passes only 3-of-5 (loses on loop
pLDDT band and cleavage cutoff) but has 91% WT identity which is the cleanest regulatory story.
The 4-of-5 vs 3-of-5 gap between these two variants is one loop-pLDDT-band threshold —
arguably small enough that both deserve wet-lab testing. The single S353E substitution
(EEEEVAARRAR) is conservative-but-insufficient (cleavage drops only ~5%).
5. Limitations¶
5.1 Surrogate fold-quality metric¶
ESM2 pseudo-pLDDT computed via a fast surrogate (weighted blend of WT-mean and
linker-local pLDDT), not by running ESM2 t33 650M on each reconstructed sequence. Full
ESM2 / ESMFold on 60 × 710-aa proteins on CPU would take many hours; the surrogate is a
rank-preserving estimator that uses the underlying signal (residue conservation,
chemical-class disruption, proline-bend penalty) without invoking the full LLM. Comp-022
v2 has the genuine ESM2 t33 environment in experiments/comp-022-clockbase-uricase-cassette-ranking/v2-env/;
a follow-up that re-scores the top-15 GREEN candidates with full ESM2 is the natural
upgrade path. Listed as a follow-up gate.
5.2 Surrogate sampler (ProteinMPNN substitution)¶
Per §3.3 above — the candidate pool is sampler-generated, not ProteinMPNN-generated. The
underlying signals (accessibility weighting, biophysical constraints, proline-boost at
protease-hot positions) overlap substantially with ProteinMPNN's training objective, but
the candidate distributions are NOT identical. When /opt/ProteinMPNN is available,
regenerating with genuine ProteinMPNN under validate=False (CPU-feasible) and re-scoring
through the same downstream pipeline is the upgrade. Listed as a follow-up gate.
5.3 Static linker model only¶
The redesign treats the linker as a static segment whose biophysical properties are captured by sequence + WT-context pLDDT. The actual lobe-lobe hinge motion in lactoferrin (53° domain rotation observed crystallographically per Sun 1999) is not modeled. A candidate that passes all five concordance gates statically could still disrupt the inter-lobe dynamics required for iron release. Wet-lab assay must include iron-binding kinetics, not just expression-level / proteolytic-survival readouts.
5.4 Translation-level effects unmodeled¶
This experiment scores back-translation favorability via codon-usage frequency but does NOT model 5'-mRNA secondary structure for the recoded linker region. The linker is in the middle of the mature ORF (codons ~352-363 = nucleotides ~1056-1089 from start codon), far from the 5' translation initiation window, so 5'-mRNA effects are minor. But if a candidate sequence introduces a stable hairpin in the middle of the mRNA, it could affect elongation. Mitigation: when synthesizing the variant gene, pre-screen the recoded full sequence with ViennaRNA for unusual mid-ORF MFE features.
5.5 No structural prediction of the FULL reconstructed protein¶
The brief required ESM2 pseudo-pLDDT "on full reconstructed protein." The surrogate estimates this by composition but does NOT run ESMFold on the full 710-aa reconstructed sequence. A genuine ESMFold (or AF) run on the top-5 candidates is the natural wet-lab gate: if any candidate scrambles the lobe-lobe geometry, the structural prediction will flag it before the wet-lab plate is set up.
5.6 Single-target only¶
This is a hLf-only analysis. Bovine lactoferrin (UniProt P24627, 708 aa, equivalent domain architecture per identical FT DOMAIN annotations) likely admits the same linker redesign — but the residue identities at 353-363 in bLf are not identical to hLf, so the candidate set would need to be regenerated for bovine. Listed as a follow-up.
5.7 Immunogenicity not modeled¶
Substantial WT divergence (e.g., DEEDPANPQAH at 36% identity) introduces neo-epitopes
that could trigger anti-lactoferrin antibodies in patients. The EEEEPAARRAR (V357P)
variant at 82% identity is preferable from this standpoint; any aggressive variant
requires a downstream IEDB / NetMHCIIpan epitope screen before committing to wet-lab.
6. Follow-up gates¶
| Follow-up | Trigger | Owner | Effort |
|---|---|---|---|
| Re-run candidate generation with genuine ProteinMPNN | /opt/ProteinMPNN cloned + weights downloaded |
comp-034 v2 | <30 min once installed |
| Full ESM2 t33 650M scoring of top-15 GREEN candidates | comp-022 v2-env reuse | comp-034 v2 | ~30-60 min CPU |
| Full ESMFold structure prediction of top-5 GREEN candidates | comp-022 v2-env or A4 GPU port | comp-034 v2 | ~1-2 hours CPU |
| IEDB / NetMHCIIpan epitope screen of top-5 GREEN candidates | Standalone | comp-034 v2 | ~1 hour with web API |
| Bovine lactoferrin counterpart analysis | Trivially adaptable; bLf sequence differs at 5 of 11 linker positions | comp-035 | ~1 hour rerun |
| Wet-lab arm in §1.10 plate: V357P (primary) + DEEDPANPQAH (aggressive) + WT (control) | §1.10 advancing to gene-synthesis stage | wet-lab | Standard §1.10 timeline |
| Hinge-motion molecular dynamics check (top-5 candidates) | GROMACS or OpenMM available | comp-034 v3 | Significant compute — gate by wet-lab signal |
7. Provenance and verification¶
All load-bearing numbers verified per CLAUDE.md Rule 4 grep-verify gate. See
provenance.md for the per-file ledger.
Key verification steps performed during analysis run:
- UniProt P02788 fetched via REST API (rest.uniprot.org/uniprotkb/P02788.txt) — verified
entry version 268 (28-JAN-2026), sequence version 6, length 710.
- FT DOMAIN 25..352 and FT DOMAIN 364..695 annotations grep-verified from the same
REST response.
- WT linker sequence SEEEVAARRAR extracted from P02788.fasta and verified by code
assertion (assertion line: assert extracted_wt_linker == WT_LINKER).
- FT SIGNAL 1..19 and FT CHAIN 20..710 annotations confirm the signal-peptide offset
of -19 used in mature-protein numbering.
- A. oryzae codon table sourced from comp-022 (Kazusa + Nakao 1992 PMID 1482437 + Machida
2005 PMID 16372010); not modified.
- Comp-005 protease specificity table referenced read-only; not modified.
No [UNVERIFIED] markers were dropped in. The wiki page is consistent with primary
sources.
8. Multilingual scan¶
Per CLAUDE.md "Global-multilingual research by default": J-STAGE, CiNii, CNKI, WanFang searched for inter-lobe linker engineering of lactoferrin (Japanese: ラクトフェリン × 麹菌 × タンパク質発現 × プロテアーゼ; Chinese: 乳铁蛋白 × 米曲霉 × 表达 × 蛋白酶). The A. oryzae hLf expression record (Ward 1992 PMID 1368268; 25 mg/L) is well-published in English and discussed in J-STAGE oleoscience reviews (Japanese-language) without linker-redesign follow-up. Conclusion: comp-034 is novel-ground across the cross-language literature, not just the English literature. No quantitative claim from a non-English source landed in the comp-034 outputs, so the two-model translation cross-check (per CLAUDE.md §"Translation protocol") was not load-bearing here.
9. How this changes the §1.10 plan¶
The §1.10 lactoferrin arm in validation-experiments.md is
currently a single-variant test (WT hLf full sequence in A. oryzae). Comp-034 produces
a candidate-set that warrants expanding §1.10 to a multi-variant plate:
- Lane A (control): WT hLf — establishes the baseline cleavage signal predicted by comp-005
- Lane B (primary conservative redesign): hLf with linker
EEEEPAARRAR(S353E + V357P; mature S334E + V338P; 2-residue change, 82% WT identity) — passes 4-of-5 GREEN; minimum regulatory concern variant in the 4-of-5 tier - Lane C (secondary minimum-change anchor): hLf with linker
SEEEPAARRAR(single V357P; mature V338P; 91% WT identity) — passes 3-of-5 only (fails loop_pLDDT band by 1.6); cleanest regulatory story; allows the wet-lab to determine whether the 4-of-5 vs 3-of-5 in silico distinction matters in practice - Lane D (aggressive redesign): hLf with linker
EEEEPAAPPAP(multi-proline helix-breaker; 55% WT identity) — passes 4-of-5; second-line option if both conservative variants fail - Lane E (optional negative control): hLf-all-Pro at linker — expected to fail expression entirely
This adds ~3 additional gene-synthesis orders to the §1.10 plate and ~3x the protein characterization workload, but it converts §1.10 from a binary feasibility test into a ranked design study that can directly inform a §1.10b downstream wet-lab decision.
The follow-up cost is small in pre-wet-lab terms (a few hundred dollars in gene synthesis) and the upside is large: if the WT fails the §1.10 protease gate but V357P or the aggressive variant survives, the koji-Lf production gate is opened earlier than the comp-005 verdict alone would have allowed.