DCFN-Research — Changelog

What's changed in the engine's user-visible output, in reverse chronological order. Pre-1.0 versioning convention:

0.x.0 — feature additions, output-shape changes
0.x.Y — quality fixes, prompt refinements, copy updates
1.0.0 — reserved for first paid Tier 1 customer signing a contract

v0.3.10 — 2026-04-30

Citation walk: hybrid two-pass for multi-source corpora

v0.3.5's bidirectional citation walk shipped assuming all reference IDs were Semantic Scholar paperIds (40-char hex). In practice the merged corpus pulls from 4-6 sources and references[] is mixed-format: OpenAlex Work IDs, PubMed UIDs, arXiv IDs. v0.3.8's S2-only filter prevented the resulting 400-Bad-Request crash but at the cost of ~zero expansion on multi-source corpora (one local test: 749 non-S2 IDs filtered, 0 added).

v0.3.10 closes the gap with a hybrid two-pass design:

Pass 1 — DOI translation: for each non-S2 neighbor ID, look up its DOI via the source's own metadata API (OpenAlex /works, PubMed esummary). Reformat as S2's DOI:10.x prefix syntax and send through the existing S2 batch endpoint. Captures the ~75-85% of academic papers that have DOIs.
Pass 2 — Per-source fanout: for IDs that didn't resolve via DOI in Pass 1 (no DOI assigned, OR S2 batch failed), fetch full metadata directly from each source: OpenAlex /works?filter=ids.openalex:, PubMed efetch, arXiv query?id_list=. Reuses parsing logic from existing ingestion_* modules so the article record shape is identical to the standard pipeline.

Graceful degradation: if S2 batch returns 429 (free-tier rate limit, common) or any other error, the IDs that came in via DOI translation get re-attempted via per-source fetch. So the entire walk doesn't depend on S2 cooperating.

Metadata richness: expand_via_citation_walk return dict now includes ids_resolved_via_s2_native, ids_resolved_via_doi_translation, ids_resolved_via_per_source, ids_unresolved, per_source_breakdown. Operator (and Z reading the report) can see exactly where each neighbor came from and which sources contributed.

Empirical validation (local Single-Cell corpus, 100 OpenAlex neighbors, S2 rate-limited): 25 articles added in 8.6s via Pass 2 OpenAlex fallback alone. With production S2 API key cooperating, Pass 1 + Pass 2 combined would land substantially more.

New module id_translation.py centralizes paper-ID source recognition + DOI prefix formatting so downstream code doesn't sprinkle prefix-matching logic.

Open follow-on (tracked separately): some Pass 2 OpenAlex fetches return papers without abstracts (filtered out by the existing _to_article_record schema requirement). Worth a future Bio/Research investigation — abstract-less papers may still carry useful metadata if the engine is willing to operate without abstracts on those nodes.

v0.3.9 — 2026-04-30

Pipeline timing instrumentation in autonomous-scheduler path (Charter §16)

main.py's user-driven path already had per-stage timing via stage_timings. The autonomous-scheduler path (scheduler.py:_run_autonomous_pipeline) was missing it — only total elapsed was logged. Added per-stage capture for: qeb_encoding, concept_graph, cte_traversal, apriori, svw, hypothesis_generation, calibration, bridge_detection_and_rerank. Surfaces as a single [PIPELINE_TIMING] log line per run + persisted to the report's stage_timings field for downstream tooling.

Triggered by Charter §16 codification (Patents L1 ran 50 min and we had no per-step data to answer "should we upgrade Render tier?"). This closes the gap on the Research autonomous path so the same question is answerable empirically there too.

Note: the multi-source citation walk hybrid (DOI translation + per-source fanout) flagged in v0.3.8 is now tracked as v0.3.10 (next minor).

v0.3.8 — 2026-04-30

Critical fix: every autonomous run since v0.3.7 was crashing

Local Research validation surfaced two bugs in code I shipped earlier today.

Numpy truthiness crash (blocking). The topical-coherence term I added in v0.3.7 used d.get("v_unit") or d.get("v_seed") to pick a vector — classic Python+numpy gotcha: when v_unit is a numpy array, the or operator triggers numpy's __bool__, which raises ValueError: The truth value of an array with more than one element is ambiguous. Two sites in cte_operations.py:golden_token_pathfinding. Both now use explicit None-checks. Effect: every autonomous run since v0.3.7 deployed (2026-04-30) crashed silently after the CTE ops stage. The v0.3.5–v0.3.7 quality fixes never actually produced a usable report. Local re-validation post-fix: 4 succeeded, 0 failed (was 0/4).
Citation-walk batch endpoint rejecting all requests (silent). v0.3.5's bidirectional citation walk was sending paper IDs to Semantic Scholar's /paper/batch endpoint that the endpoint refused with {"error":"No valid paper ids given"}. Cause: in the multi-source merged corpus, references[] contains IDs in mixed formats (S2 hex, openalex:WXX, pmid:NNN, arxiv:XX.XX). S2's batch endpoint only accepts S2 hex IDs (or its own prefix-tagged syntax which we don't yet emit). Fix: filter neighbor IDs to S2-format (40-char hex) before batching; non-S2 IDs are dropped with a logged count. Also added 4xx response-body capture so future S2 errors show the actual error message on the first round-trip instead of an opaque 400 Bad Request.

Honest caveat on citation-walk effectiveness

The S2 ID filter prevents the crash but reveals a deeper architecture gap: in corpora dominated by non-S2 sources (OpenAlex, PubMed), MOST references are non-S2-format and get dropped. The walk produces near-zero expansion in that case. Test against a real Single-Cell corpus: 749 non-S2 IDs filtered, 0 S2 IDs walked, 0 articles added. The v0.3.5 architectural value (bidirectional cross-source neighbor expansion) doesn't yet land for multi-source corpora.

Follow-on work tracked separately for v0.3.9: translate non-S2 IDs to S2's prefix syntax (DOI:10.x, PMID:NNN) before batching, OR fan out per-source (OpenAlex API for openalex: IDs, PubMed E-utilities for pmid: IDs). Not a v0.3.8 ship — needs design.

v0.3.7 — 2026-04-30

Hypothesis-target granularity + trajectory anti-drift

Two coupled fixes for Perplexity's 2026-04-30 broad-vocabulary findings.

Hypothesis target prefers DOCUMENT successors over CONCEPT successors. On the Single-Cell run all five hypotheses said the gap "will unlock progress toward: Biology" — the field-level CONCEPT node was the first downstream successor and the framing code took it verbatim. Now the downstream successor list is sorted DOCUMENT-first and concept-level targets are only used when no paper-level successor exists. Same logic guards the blocked-cascade source_title lookup. Hypotheses now resolve to specific paper titles ("will unlock currently blocked progress toward: ") instead of field labels.
Topical-coherence term added to golden-token pathfinding (5th component, weight 0.15). Trajectory was hopping methodology-adjacent papers (Drosophila circRNA → RNA folding → RNA-seq stats → iLearn → GENCODE) because tooling papers have very high citation counts → high centrality + strategic-alignment scores. New term scores each node by cosine to a TOPICAL CENTROID derived from papers whose fields_of_study intersect the corpus's dominant fields (≥30% of DOCUMENT nodes). Methodology papers typically declare different fields (Computer Science, Bioinformatics) than the substantive research papers (Biology, Medicine), so the centroid stays anchored to subject matter, not tooling. The other four PATHFINDING_WEIGHTS dropped uniformly (0.25 → 0.2125) so the new mass doesn't compound.

v0.3.6 — 2026-04-30

Vocabulary-bleed suppression in convergence detection

SVW pair + anchor signals: distinctive-bigram requirement. Perplexity's 2026-04-30 Single-Cell RNA-Seq run review flagged a 1992 RNA-secondary-structure paper picking up 69 disconnected groups orbiting it as a "convergence anchor" — driven by shared "RNA" token alone, not actual conceptual proximity. Cosine similarity over abstract embeddings inflates on broad-vocabulary domains where many papers share a high-frequency single token. Both SVW signals (Signal 1 convergence pairs and Signal 2 anchor hubs) now require, in addition to the existing cosine + citation-independence filters, that the matched papers share at least one CORPUS-DISTINCTIVE concept bigram.
Corpus-relative, not hardcoded. "Broad-domain term" is derived per-run from title-token document frequency. Tokens appearing in ≥15% of titles are flagged as broad for THIS corpus; bigrams where both tokens are broad are excluded from the distinctive set. RNA is broad in a biology corpus but distinctive in a robotics corpus — the filter adapts per run rather than carrying a hardcoded biology stoplist that would fail on every other domain.
Effect on the named failure case. A 1992 RNA-folding paper and a 2024 scRNA-seq paper that share only "RNA" + general-method language now fail the distinctiveness check and don't form a convergence orbit. Two papers sharing genuine concept overlap ("clonal evolution", "immune infiltration", "trajectory inference") still pass.

v0.3.5 — 2026-04-30

Engine depth: bidirectional citation-walk corpus expansion (Charter §12 Pattern B)

New module session_corpus_pull.py. Discovery-driven topic runs (queue-managed via topic_queue_runtime) now expand the corpus with a 1-hop citation-graph walk after multi-source ingest completes. Takes the top 50 most-cited papers from the initial pull, collects both their references (backward — ancestral foundations) and citations (forward — downstream sub-communities), batch-fetches the metadata via Semantic Scholar's /paper/batch endpoint, dedupes against the existing corpus, and appends. Hard-capped at 400 net new neighbor IDs per run to bound API cost + wall-clock; typical add lands at 100-300 articles in 30-60 seconds.
Why bidirectional matters. Perplexity's 2026-04-30 Single-Cell RNA-Seq run review surfaced flat cluster structure (520 sources collapsing into one cluster + 2 in another + 2 in another) on a domain that genuinely has distinct sub-communities (clonal evolution, TME immunology, CTC, spatial transcriptomics, trajectory analysis). The substrate operations the patents describe — CTE backward traversal + forward cascade — depend on the corpus carrying both directions of citation-graph edge density. If the engine is meant to traverse structurally in two directions, the corpus it traverses needs both directions of links present. Forward-only or backward-only expansion misses half of the cluster-binding signal.
Scope. Fires only for queue-managed topics (those auto-promoted from pending_items.md proposals into topic_queue.json and registered into DOMAINS at runtime). Fixed-config domains have curated query sets and skip this step. Failure modes are non-fatal — on any S2 error the run continues with the unexpanded corpus.

v0.3.4 — 2026-04-30

Quality fix

Corpus fingerprint accuracy in autonomous runs. Perplexity's 2026-04-30 review of the Single-Cell RNA-Seq Tumor Heterogeneity run flagged Corpus fingerprint: no-corpus even though the run ingested 542 sources. Root cause: in the autonomous-scheduler code path, the article_index was being built AFTER the report was rendered, so the report's fingerprint check (report.get("article_index", {})) saw an empty dict and emitted "no-corpus" regardless of how many articles ingested. Fix: build article_index and assign it into the report dict BEFORE generate_article / generate_technical_report run. Receipts now show the real corpus signature. Note: this is the surface-symptom fix; the deeper architectural item (citation-graph 1-hop expansion to address flat-cluster structure on broad-vocabulary domains) is tracked separately and is multi-day work.

v0.3.3 — 2026-04-30

Quality fix

OBI classifier: software/tooling blind spot closed. Gemini's 2026-04-30 deep-research review flagged that the citation-velocity OBI classifier was misclassifying papers that introduce open-source software packages — concretely, the uniCATE R package paper, which has slowing academic citations but is actively integrated into clinical imaging models in the wild. Citation-velocity alone treats this as STALE; the actual signal is "tool got incorporated downstream without proportional re-citation," which is a distinct OBI mode. New _detect_tooling_obi_signals() runs between citation-velocity and the hidden-citation bigram check. Two-layer detector: a STRONG signal (a single GitHub URL, "R package" / "Python package" / Bioconductor / CRAN / PyPI mention, or "available at https://" link) is sufficient on its own; a WEAK signal (generic terms like "framework", "library", "implementation") only fires when co-occurring with a package-name pattern in the title (all-caps acronym like HTSeq/BLAST, or CamelCase like DESeq2/uniCATE). Designed for precision — generic prose like "in our framework we propose..." won't trip it.

v0.3.2 — 2026-04-30

Patent attribution accuracy

Footer's "Built on" line was undercounting: said "6 U.S. Patents Pending" and named CTE + QECO. Actual total since the Tesseract Composition supplemental landed (2026-04-20) is 8, and the engine rides more than two substrate patents. Updated to:

Total count corrected to 8.
"Built on" expanded to name CTE (App. No. 64/002,205), QECO (App. No. 63/993,979), and the Consolidated Supplemental (App. No. 64/043,294) — which specifically protects the structural-discovery substrate Research uses for cross-domain pattern transfer.

Same correction applied to the Firebase brand site's DCFN-Research card.

v0.3.1 — 2026-04-30

Quality fix

Apriori miner: filter cross-category metadata artifacts. Gemini 2026-04-30 deep-research review flagged that the Apriori Pattern Discovery module was surfacing rules like "when research involves cited papers, it involves economics research 75% of the time" as findings. That's a metadata artifact (economics papers tend to have long bibliographies, so they correlate with citation-count attributes) presented as scientific signal. New filter (Filter 6 in the tautology pruner) drops rules where one side is purely topic attributes and the other side is purely metadata attributes (impact / temporal). Boundary tested: keeps real signals like topic↔topic cross-disciplinary co-occurrence, method↔topic research-style associations, and mixed-context rules. Drops the metadata-artifact class Gemini named.

v0.3.0 — 2026-04-30

Discovery-driven autonomous runs

The Research engine's autonomous-run path now drives from a discovery-agent-fed queue instead of cycling fixed domains. A discovery agent identifies new research topics worth running by querying Semantic Scholar (with PubMed fallback) for substantive recent activity in curated seed areas, derives a topic configuration from the top results, and proposes it for human review. After a 7-day cooldown without rejection, the proposal auto-promotes into the live run queue, where the engine executes the full pipeline against it once or twice before going dormant.

Why this matters: it converts the autonomous path from "run the same three domains every day" (which produces noise) into "surface new research territory worth exploring" (which produces signal). Each run feeds the Bridge Inbox + LEF Ai Upstream telemetry channels — autonomous runs are the substrate's input.

Engine output

Solved Obliteration by Incorporation (OBI) — flagged by Gemini 2026-04-30 deep-research review of the v0.2 era output. Previously the engine was treating universally-adopted methods as "decayed" simply because they'd stopped being explicitly cited (their methods became the field's default vocabulary). The engine now distinguishes "Canonical Foundations (Absorbed by Incorporation)" from genuinely abandoned work. Concrete validation from Gemini: the engine correctly identifies the HTSeq Python framework — "22,482 lifetime citations but 0 in the last 5 years; hasn't decayed; it has just become structural canon" — instead of false-flagging it. This removes a major false-positive class from the engine's untested-foundation analysis.
High-signal convergence anchor detection — the engine surfaces single papers that multiple research clusters orbit without explicitly cross-citing. Gemini 2026-04-30 validated on the Trauma-Informed Care × Restorative Justice run: "353 independent research groups across 42 years converging on the exact same academic success metrics without sharing a direct citation path." Convergence anchors are the engine's strongest signal for "where the field is heading without anyone having named it yet."

Three-layer report architecture

Layer 1 (free, 5/month) — Article + paired Technical Report. Free tier produces structural analysis with concept graph, bridge detection, entropy gap analysis, traceable hypotheses, convergence detection.
Layer 2 ($15) — Gap Research. Engine ranks the entropy gaps and untested foundations from your Article by structural severity, formulates targeted queries, pulls fresh papers from all seven sources, and produces a before/after structural diff.
Layer 3 ($15, includes Layer 2) — Autonomous Trajectory. Engine stops following your query and follows the evidence — calibration memory, bridge priors, co-occurrence patterns guide autonomous exploration into territory no single query could reach.

v0.2.0 — 2026-04-08

Engine output

Bridge digest format — autonomous runs now produce a structured Bridge Digest containing all bridge intelligence (gaps, severity, gap types, abstracts) suitable for ingestion by future Bridge engines that sit between two DCFN builds.
Syntari Record (JSON twin) — every run now produces a structured JSON twin alongside the prose Article, suitable for downstream machine-readable consumption.

Sources

Seven data sources unified into one corpus per run: Semantic Scholar, PubMed, arXiv, OpenAlex, GitHub, Hugging Face, plus your own uploaded drafts. Citation graph + abstracts + repository signal merged into a single concept graph per session.

v0.1.0 — 2026-03-15

Initial deployment. Single-page intake → multi-source ingest → concept graph construction with typed edges → Cognitive Traversal Engine (5 operations: backward / forward / branch cataloging / entropy / golden token) → SVW convergence detection → Apriori pattern mining → Article + Technical Report generation. Free 5 runs / month / browser; $15 unlock for Layer 2 + Layer 3 deeper traversal.