Manual Literature Mining Protocol — Paperclip MCP Discipline¶
A small but load-bearing methodology page. Codifies the verification discipline for using Paperclip MCP (the full-text PubMed / bioRxiv / medRxiv / arXiv corpus) safely in this project — surfaced 2026-05-05 by the wiki sweep daemon (Pass 2 Connection #3 + Priority Action #1). The motivation: Paperclip's index of ~11M full-text papers is genuinely valuable for deep-dive questions, but the map operator (the convenience-aggregation primitive) hallucinates — wrong organisms, fabricated kinetic numbers, made-up author affiliations — making automated unverified use unsafe. This page defines the safe-use workflow.
Why this protocol exists¶
Specific failure mode documented 2026-05-05: a Paperclip map query against a multi-paper question returned organism names not present in the source papers AND specific kinetic numbers (Km, IC50) that did not appear in any of the cited sources when grep-verified. The map primitive appears to do model-side aggregation of partial extracts, which the underlying model fills in plausibly but not faithfully. (Memory record: feedback_paperclip_map_unreliable.md, decided 2026-05-05.)
search, cat, head, grep, scan primitives DON'T exhibit this failure mode — they return raw paper content, line-anchored, verifiable. The protocol below confines automated use to those safe primitives and requires explicit verification before any quantitative claim derived from Paperclip enters the wiki.
The five-rule discipline¶
1. Use safe primitives only¶
Allowed: search, cat, head, grep, scan.
Forbidden in any automated workflow: map, reduce, or any aggregation primitive that produces synthesized output rather than raw paper content.
If a finding requires aggregation across multiple papers (e.g., "summarize the IC50 range for compound X across all NLRP3-relevant studies"), do the aggregation in two stages: (1) search + grep to surface the specific lines containing IC50 values; (2) human-verify each value against cat of the source paper; (3) the synthesis happens in the wiki page, not in Paperclip's primitive call.
2. Anchor identity to meta.json¶
Every paper Paperclip returns has a /papers/<id>/meta.json file with: title, authors, journal, year, PMID, doi. Cite from meta.json, not from inferred metadata in the paper body. A common failure mode is misattributing a paper based on a title-like string in the body that turns out to be a citation TO another paper, not the paper's own title.
Workflow: cat /papers/<id>/meta.json | jq .title (or read directly) before quoting a paper.
3. Grep-verify all numbers before they enter the wiki¶
For any quantitative claim sourced from Paperclip — IC50, Km, dose, sample size, percent change, p-value, cohort n — grep for the specific number in the source paper's content.lines file BEFORE writing the number into the wiki. The verification step takes ~30 seconds and catches the map-class hallucination directly.
# Example: verify a claimed IC50
grep -i "IC50" /papers/<id>/content.lines
grep "5\.18\|5.18 μM" /papers/<id>/content.lines
If the number doesn't appear in the source, it didn't come from the source. Do not propagate.
4. Never propagate map/reduce summaries¶
If you find yourself looking at output from Paperclip's map or reduce primitives (e.g., from a prior session, from a logged subagent task, from the daemon's intermediate state) — treat that output as un-validated and re-derive from the safe primitives. A map summary that LOOKS reasonable cannot be assumed reasonable.
This applies even when the map output is convenient. Convenience is the failure-mode signal — map is fast precisely because it skips verification.
5. Cite line-anchored, with the project's citation format¶
Paperclip provides line-anchored URLs of the form https://citations.gxl.ai/papers/<doc_id>#L<n>. Always cite at the line-anchor level, never just at the paper level. This makes verification trivial for a future reader (or for the sweep daemon's Pass 3 reviewer) and forces the citation to point at a specific claim, not a generalized invocation of "the paper."
Citation discipline (from existing wiki convention):
- Inline: [1], [2]
- References block at end of page:
--------
REFERENCES
[1] Authors. "Title." *Journal* vol, pages (year). doi:XX
https://citations.gxl.ai/papers/<doc_id>#L<n>
doc_id in prose (only in the URL).
Pre-commit verification gate (the rule that catches errors BEFORE the sweep, not after)¶
Rule 3 above ("grep-verify all numbers before they enter the wiki") covers Paperclip-sourced numbers specifically. This section generalizes the same discipline to every load-bearing quantitative claim in newly-authored wiki content — disulfide counts, residue counts, sequence lengths, IC50s, Kms, dose-response numbers, cohort sizes, percent changes, kinetic constants, evidence-tier counts. Whether the source is Paperclip, UniProt, ChEMBL, ClinicalTrials.gov, a PMC paper, or a database API response, the verification gate is the same.
The rule:
Before any commit lands a wiki page (especially
wiki/<comp-NNN>-*.mdinterpretive pages andwiki/hypotheses/H<NN>-*.mdcards) that introduces a load-bearing quantitative claim, the author must grep-verify the claim against its primary source and add a line-anchored citation inline. A claim that cannot be grep-verified is not committed; it is either re-derived, replaced with a bounded "TBD pending source verification" marker, or dropped.
Why this gate exists:
The wiki sweep daemon (Pass 1 Propagate → Pass 2 Synthesize → Pass 3 Review) is good at catching cross-page inconsistencies — exactly the failure mode that surfaced the DAF SCR1-4 disulfide-count error on 2026-05-06 (Sweep A Connection 2). But that means the discipline currently catches errors ~12–24 hours after they ship into the corpus, by which time:
- The wrong number has propagated to multiple pages (DAF SCR1-4: comp-012 → H05 stub).
- The wrong number has been ingested into the synthesizer's context for downstream reasoning (the chaperone-orthogonal triple-cassette synergy panic — predicting 17+12=29 disulfides, 1.8× Huynh — was based on a fabricated coefficient).
- A second-opinion synthesis pass and a hand-walkthrough are required to find and fix the propagation, instead of catching it at the source.
The sweep is a backstop, not a substitute. The pre-commit verification gate is the right moment to catch hallucinated numbers — at the moment they would enter the corpus, not after they've been laundered through the substrate.
Operational pattern:
When authoring new wiki content (especially comp-NNN interpretive pages, H-card stubs, scope pages, or any page making mechanistic / kinetic / structural claims), follow this micro-protocol per quantitative claim:
-
Identify the load-bearing numbers in the draft. Anything that downstream reasoning will depend on (cassette counts, disulfide counts, kinetic constants, evidence-tier verdicts, cohort sizes, percent-change magnitudes). Numbers used for color or rough order-of-magnitude framing are lower-stakes; load-bearing numbers feed into matrices, decision criteria, threshold gates, or other pages' calculations.
-
For each load-bearing number, name the primary source. UniProt accession + feature line, PMID + page/section, ChEMBL ID, NCT trial ID, etc. If you cannot name the primary source, the number is suspect — either find it or drop it.
-
Grep-verify the number against the primary source. For UniProt:
curl -s "https://rest.uniprot.org/uniprotkb/<ACC>.txt" | grep "<feature>". For PMC:grep -i "<number>" /papers/<id>/content.lines. For ChEMBL: pull the bioactivity record and check the value field. The verification should produce the number directly from the source, not be inferred from a summary. -
Cite line-anchored inline. Per Rule 5 above for Paperclip; for UniProt:
(per UniProt P08174 DISULFID feature: Cys36-Cys81)is line-anchored enough. For PMC:(PMID 12345678 Table 2)or(PMID 12345678 §Results para 3). The citation must let a future reader (or the sweep's Pass 3 reviewer) re-verify in <30 seconds. -
If the number cannot be verified, do not ship it. Options: re-derive from a different source, drop the claim and note the gap, or write a placeholder like
[TBD: pending UniProt verification]and keep it out of the load-bearing path until verified.
What counts as load-bearing — heuristics:
- Any number that appears in a downstream comparison, table, matrix, decision criterion, or quantitative threshold ("if synergy <0.7, then..." / "29 vs. 25 total disulfides").
- Any number that, if wrong, would change the evidence-tier verdict, the experimental design, or the platform decision.
- Any number that other wiki pages will cite or reuse.
- Any number that the sweep daemon's Pass 2 synthesizer might pull into a cross-doc connection (which, in practice, is almost any quantitative claim — the synthesizer reads everything).
What counts as low-stakes — exempt from the gate:
- Order-of-magnitude framing where precision doesn't matter ("on the order of millions of dollars," "weeks not months").
- Numbers used purely for color in narrative prose, where the surrounding text would still be correct if the number were ±20% off.
- Numbers explicitly tagged as estimates or projections (with the tag making the uncertainty visible).
The DAF SCR1-4 incident (2026-05-06) — provenance for this rule:
The 2026-05-05 Sonnet subagent that authored wiki/daf-cd55-scr14-truncated-computational.md (comp-012) asserted "3 conserved disulfide bonds per SCR domain → 12 total" in 4 places of prose narrative. The comp-012 pipeline (AlphaFold pLDDT-based protease stability) does not actually count disulfides; comp-012's own Limitations section explicitly says "Disulfide bonds not modelled." The "12" was hallucinated at write-time. The error then propagated into wiki/hypotheses/H05-daf-scr14-cp0-thesis.md (the CP0-closure thesis stub) without independent verification.
The 2026-05-05 sweep daemon caught the inconsistency the next day (Sweep A Connection 2: chaperone framework had "8 (4 SCRs × 2 disulfides each)," comp-012 + H05 had "12") and surfaced it as a Priority Action. Verification against UniProt P08174 during the 2026-05-06 walkthrough confirmed 8 DISULFID feature annotations across SCR1-4 (the canonical sushi/CCP fold: Cys1-Cys3 + Cys2-Cys4 motif, 2 per domain).
If the pre-commit verification gate had been in place when comp-012 was authored, the number would have been grep-verified against UniProt at write-time and the wrong claim would never have shipped. The sweep would still have run, but it wouldn't have needed to find this class of error — only genuine cross-doc synthesis findings.
This rule generalizes: the sweep daemon should be catching novel cross-doc connections, not catching fabricated coefficients. Numerical hygiene is upstream of synthesis.
Killshot tiering — pick the lowest-cost experiment that resolves the question¶
Sister discipline to the pre-commit verification gate: before declaring an experiment as the "killshot" for an open hypothesis, walk a cost ladder and pick the lowest-tier option that resolves the question above noise floor for the platform's purposes. The default "killshot" definition in institutional research is grant-tier ($50K–$500K mouse / cell-line / cohort study). For an open-source citizen-science project run by a CTO not a pharma lab, that default is wrong — both literally too expensive AND epistemically over-specified relative to the decision the experiment is supposed to inform.
The OE killshot tier menu:
| Tier | Cost | Time | Source / route | Examples |
|---|---|---|---|---|
| Tier 0 | $0 | hours-days | Public dataset mining; full-text re-read of papers comp-NNN got at abstract-level; published-data synthesis | GTEx + Human Protein Atlas sex-stratified expression mining; GWAS catalog lookup; UniProt feature-annotation queries; full-text retrieval via Sci-Hub / Anna's Archive of papers cited only at abstract-tier; Mendelian randomization via published MR-Base summary statistics |
| Tier 1 | $200–500 | days-weeks | Leverages existing self-experiment infrastructure | n=1 LabCorp / Quest panels (FEUA, hs-CRP, hormone panels); 16S microbiome OTU mining if the panel is already running; spot urinary biomarker measurements; serum-panel snapshots with pre-committed protocol |
| Tier 2 | $0–500 | weeks | Crowdsourced / community-leveraged | Recruiting men's-health forum cohorts to share LabCorp panels; Twitter/Reddit-based n>>1 cohort assembly; community-sourced patient-reported outcomes; OSF-style collaborative protocols on existing self-experimenter populations |
| Tier 3 | $2–5K | 4–12 weeks | Friendly bench (community college, undergrad thesis, sympathetic small lab) | Caco-2 / HEK293 / HepG2 cell-culture work with consumables-only budget; small-scale Western blot / qPCR / ELISA assay; tissue-culture collaboration where the academic partner has the equipment and the OE side funds reagents |
| Tier 4 | $5–15K (academic collaborator) / $30–60K (institutional) | 8 weeks–3 months | Academic-collaboration animal study / formal cell-line work / contracted assay | Mouse castration + replacement studies via academic IACUC-protocol-leveraged route; published-with co-authorship arrangement; CRO-tier work (the institutional baseline) is the last resort |
The discipline: before promoting an experiment to "killshot" status, walk the ladder Tier 0 → Tier 4 and ask at each tier "could I resolve this question above noise floor without spending more?" Default to the lowest tier that answers yes. The mouse experiment is rarely the right first move for an OE-tier question; it is sometimes the right last move.
Why this matters specifically for OE:
- Budget reality. OE is a CTO + AI-substrate research operation, not pharma. A daemon sweep costs ~$0.65; freaking out over a $50 sweep cost is the appropriate calibration. Proposing $30K+ experiments without walking the upstream cost ladder is a category error.
- Tier 0 + Tier 1 frequently resolve the question. comp-016's verdict (T → intestinal ABCG2 suppression WEAK / UNCONFIRMED) was reached entirely from public-literature scanning — the question was already answered by Klyushova 2023, MacLean 2008, Hoque 2020, Yu 2021. The institutional-default mouse experiment was unnecessary; the answer existed. Tier 0 caught what Tier 4 would have only confirmed.
- Existing self-experiment infrastructure makes Tier 1 nearly free. OE has running n=1 self-experiment protocols (
self-experiment-protocol.md) with established lab-panel workflows. Adding a new measurement to that workflow costs $50–100 per data point, not $30K. - Crowdsourced cohorts produce real n>>1 evidence at $0 marginal cost. The men's-health, gout, and microbiome communities on Twitter / Reddit / Hone routinely share lab panels publicly. Treating that as a usable data source (with appropriate methodological caveats) is the open-source-platform thesis applied to evidence gathering.
- The "killshot" framing should be: cheapest experiment that resolves the question, not biggest experiment that proves the answer beyond doubt. Falsification-card discipline (per
linter-design.md) is about attempting to falsify, not about overwhelming the question with budget. A $300 experiment that crosses a pre-committed threshold kills (or saves) a hypothesis as decisively as a $30K one.
Worked example — H07 Clomid intestinal-ER-antagonism thesis (hypotheses/H07-clomid-intestinal-er-antagonism.md):
- Tier 0: GTEx + HPA sex-stratified intestinal ABCG2 mining ($0); full-text re-read of the 4 anchor papers Klyushova 2023, MacLean 2008, Hoque 2020, Yu 2021 ($0). Estimated to resolve sub-claim 1 (does the PI3K/Akt → ABCG2 mechanism replicate in vivo?) and partially sub-claim 3 (is the renal arm enough?).
- Tier 1: n=1 FEUA tracking on Clomid dose changes (~$300). Resolves sub-claim 3 for one individual definitively.
- Tier 2: Crowdsourced clomiphene-vs-enclomiphene-vs-TRT cohort labs sharing ($0 in materials, weeks of community work). Resolves sub-claim 4 (the enclomiphene UA-direction question).
- Tier 3: Caco-2 + SERM treatment ($2–5K). Resolves sub-claim 2 (intestinal ER tissue-specificity under clomiphene).
- Tier 4: Mouse castration + T/E2 replacement + intestinal ABCG2 measurement ($5–15K via academic collaborator). Reserved as last resort.
The Tier 0 + Tier 1 combination probably closes the H07 thesis at >80% confidence for ~$300 + a week of analysis. Tier 4 is the institutional default; for OE it's the last resort, not the first move.
This discipline composes with the pre-commit verification gate. Both are "walk the checklist before shipping." The verification gate catches numerical hallucinations; the killshot tiering catches budget over-specification. Different failure modes; same shape of fix (named protocol, applied at the right moment in the workflow).
For falsification-card stub authoring (per hypotheses/README.md), the killshot menu in the stub should be tier-explicit: each killshot listed with its cost tier, so the stub-to-full-card promotion can pick the right starting tier rather than defaulting to institutional.
Specific OE questions worth Paperclip search-and-verify time¶
These are questions whose primary-literature grounding would meaningfully advance the platform but require non-trivial scanning. Listed here so they're discoverable as a queue (anyone reading this page can pick one up):
- Does Candida utilis uricase have published KEX2-site / cassette-compatibility data? — supports comp-011 (the C. utilis uricase compatibility analysis). Search:
Candida utilis uricase Aspergillus+ grep KEX2, signal peptide, secretion. - Aspergillus oryzae heterologous expression precedents beyond what comp-010 + Killshot #1 already surfaced — particularly any solid-state-format dual-protein work. Search:
Aspergillus oryzae solid-state heterologous protein+ grep dual cassette, multi-cassette, two genes. - ABCG2-probiotic intersections — does any published probiotic strain (engineered or natural) modulate intestinal ABCG2 expression in vivo? Informs the LBP track's butyrate-via-F. prausnitzii thesis. Search:
ABCG2 probiotic+ grep intestinal expression, fold-change. - GRAS-host complement-regulator expression — has anyone published heterologous expression of soluble complement regulators (sCR1, Factor H, DAF/CD55) in any GRAS organism? Informs comp-006 / comp-012 platform implications. Search:
complement regulator heterologous expression yeast Aspergillus+ grep specific protein names. - Si Miao San and related TCM gout formulas — modern Chinese clinical evidence. Per the global-multilingual default (CLAUDE.md §"Global-multilingual research by default"), search Chinese-language sources directly via CNKI / WanFang / ChiCTR, not just PubMed-indexed translations. Informs the TCM × rigor track (tcm-modern-rigor-intersection.md) Phase 2 P2-1.
- NSlD-ΔP10 strain availability outside the Maruyama lab — verify the
operations/ward-1995-lab-access.mdfinding that the strain isn't in any public repository by direct catalog query at JCM, NBRC, CGMCC, CBS-KNAW, ATCC, FGSC.
Query-framing for non-Western compound classes (added 2026-05-19)¶
The 2026-05-19 traditional-name re-scan (logs/lit-scan-query-framing-retrospective-audit-2026-05-19.md) closed three independent gaps that mechanism-name-only seeding had silently propagated through the corpus: comp-013 missed mangiferin / Zhi Mu + the Coix four-transporter mechanism + the Plantago acteoside/apigenin attribution; comp-014 declared NLRP3 + Caspase-1 + ASC as "empty chokepoints in fungi" when species-name + traditional-pathology framing surfaces ≥18 PubMed papers including ≥5 at the gout indication; the comp-014 ABCG2 hit list missed Poria cocos despite Sun 2021 PMID 33651969 showing Animal Model magnitude exceeding benzbromarone.
Discipline: for any compound-class scan touching natural products, TCM, Kampo, Ayurveda, or other non-Western traditional medicine subfields, seed queries from multiple framings, not just mechanism-name:
- Mechanism-name query (the default Western frame):
<target>+<compound-class>(e.g., "ABCG2 flavonoid", "XO inhibitor anthraquinone"). Catches the Western-curated database overlap. - Species-name query: scientific binomial + traditional name in original language (e.g.,
Phellinus igniariusAND桑黄;Wolfiporia cocosAND茯苓). Catches the species-anchored evidence regardless of which target the paper frames around. - Traditional-formula-name query: classical formula composition (e.g.,
Si Miao SanAND component herbs;Bai Hu Jia Gui Zhi TangANDmangiferin). Catches formula-level RCT evidence + the cardinal-herb attribution chain. - Traditional-pathology-term query: the original-language pathology framing (e.g.,
痛风,痹证,湿热痹,消渴). Catches indication-anchored evidence where the paper doesn't index against a Western target column.
When to use which: if a Pass 1 or comp-NNN scan returns "empty" for a fungal / botanical chokepoint after a mechanism-name-only query, treat this as a candidate for query-framing re-scan, not a confirmed empty chokepoint. The recovery rate from the 2026-05-19 audit was substantial — multiple "empty" verdicts were reversed by species-name + traditional-pathology re-seeding.
Pre-commit gate interaction: query-framing failures and the pre-commit grep-verify gate (§"Pre-commit verification gate" above) target different error modes. The pre-commit gate catches fabricated or wrong numbers inside a single paper's claims. Query-framing catches whole subfields of evidence the search never surfaced. A well-verified claim from a query-framing-incomplete scan can still mislead — the absence in the scan output isn't an absence in the literature. Apply both disciplines independently.
Cross-references:
- scripts/sweep-prompt-2-synthesize.md §"Query-framing discipline" — the Pass 2 prompt-level codification of this discipline
- logs/lit-scan-query-framing-retrospective-audit-2026-05-19.md — canonical retrospective with the empirical recovery rate
- CLAUDE.md §"Global-multilingual research by default" — the upstream discipline this operationalizes
When Paperclip is the wrong tool¶
- For ChEMBL bioactivity data — use the ChEMBL MCP directly. Paperclip's index is paper-level; ChEMBL is target-and-compound-level with curated quantitative bioactivity. See
chembl-cross-check.mdfor the cross-check discipline. - For patents — Paperclip indexes academic literature, not patent literature. Use Google Patents, Espacenet, Lens.org, JPlatPat, CNIPA directly via WebFetch.
- For non-English-language clinical trial registries — ChiCTR (Chinese), JPRN (Japanese), KISS (Korean) registries are not in Paperclip's index. Direct WebFetch of those registries is the right tool.
- For real-time updates — Paperclip's index has an indexing lag. For the very most recent findings (within the last 1–4 weeks), the journal websites or PubMed direct may have content Paperclip doesn't yet have.
Cross-references¶
paperclip-deep-dive.md— full Paperclip MCP capability + limitation audit (the upstream documentation this page operationalizes for OE)bio-ai-tools.md— broader AI-tool landscape for biology researchchembl-cross-check.md— sister discipline for ChEMBL bioactivity verificationopen-source-platform.md§"Multi-model synthesis as guard against epistemic homogenization" — the broader epistemic-rigor framework this protocol fits withinCLAUDE.md§"Global-multilingual research by default" + §"Translation protocol" — the multilingual + cross-vendor disciplines that combine with this protocol for non-English Paperclip-adjacent work