DCFN - Research
Currently v0.3.10. ← Back to home
What's changed in the engine's user-visible output, in reverse chronological order. Pre-1.0 versioning convention:
0.x.0 — feature additions, output-shape changes0.x.Y — quality fixes, prompt refinements, copy updates1.0.0 — reserved for first paid Tier 1 customer signing a contractv0.3.5's bidirectional citation walk shipped assuming all reference IDs were Semantic Scholar paperIds (40-char hex). In practice the merged corpus pulls from 4-6 sources and references[] is mixed-format: OpenAlex Work IDs, PubMed UIDs, arXiv IDs. v0.3.8's S2-only filter prevented the resulting 400-Bad-Request crash but at the cost of ~zero expansion on multi-source corpora (one local test: 749 non-S2 IDs filtered, 0 added).
v0.3.10 closes the gap with a hybrid two-pass design:
/works, PubMed esummary). Reformat as S2's DOI:10.x prefix syntax and send through the existing S2 batch endpoint. Captures the ~75-85% of academic papers that have DOIs./works?filter=ids.openalex:, PubMed efetch, arXiv query?id_list=. Reuses parsing logic from existing ingestion_* modules so the article record shape is identical to the standard pipeline.Graceful degradation: if S2 batch returns 429 (free-tier rate limit, common) or any other error, the IDs that came in via DOI translation get re-attempted via per-source fetch. So the entire walk doesn't depend on S2 cooperating.
Metadata richness: expand_via_citation_walk return dict now includes ids_resolved_via_s2_native, ids_resolved_via_doi_translation, ids_resolved_via_per_source, ids_unresolved, per_source_breakdown. Operator (and Z reading the report) can see exactly where each neighbor came from and which sources contributed.
Empirical validation (local Single-Cell corpus, 100 OpenAlex neighbors, S2 rate-limited): 25 articles added in 8.6s via Pass 2 OpenAlex fallback alone. With production S2 API key cooperating, Pass 1 + Pass 2 combined would land substantially more.
New module id_translation.py centralizes paper-ID source recognition + DOI prefix formatting so downstream code doesn't sprinkle prefix-matching logic.
Open follow-on (tracked separately): some Pass 2 OpenAlex fetches return papers without abstracts (filtered out by the existing _to_article_record schema requirement). Worth a future Bio/Research investigation — abstract-less papers may still carry useful metadata if the engine is willing to operate without abstracts on those nodes.
main.py's user-driven path already had per-stage timing via stage_timings. The autonomous-scheduler path (scheduler.py:_run_autonomous_pipeline) was missing it — only total elapsed was logged. Added per-stage capture for: qeb_encoding, concept_graph, cte_traversal, apriori, svw, hypothesis_generation, calibration, bridge_detection_and_rerank. Surfaces as a single [PIPELINE_TIMING] log line per run + persisted to the report's stage_timings field for downstream tooling.
Triggered by Charter §16 codification (Patents L1 ran 50 min and we had no per-step data to answer "should we upgrade Render tier?"). This closes the gap on the Research autonomous path so the same question is answerable empirically there too.
Note: the multi-source citation walk hybrid (DOI translation + per-source fanout) flagged in v0.3.8 is now tracked as v0.3.10 (next minor).
Local Research validation surfaced two bugs in code I shipped earlier today.
Numpy truthiness crash (blocking). The topical-coherence term I added in v0.3.7 used d.get("v_unit") or d.get("v_seed") to pick a vector — classic Python+numpy gotcha: when v_unit is a numpy array, the or operator triggers numpy's __bool__, which raises ValueError: The truth value of an array with more than one element is ambiguous. Two sites in cte_operations.py:golden_token_pathfinding. Both now use explicit None-checks. Effect: every autonomous run since v0.3.7 deployed (2026-04-30) crashed silently after the CTE ops stage. The v0.3.5–v0.3.7 quality fixes never actually produced a usable report. Local re-validation post-fix: 4 succeeded, 0 failed (was 0/4).
Citation-walk batch endpoint rejecting all requests (silent). v0.3.5's bidirectional citation walk was sending paper IDs to Semantic Scholar's /paper/batch endpoint that the endpoint refused with {"error":"No valid paper ids given"}. Cause: in the multi-source merged corpus, references[] contains IDs in mixed formats (S2 hex, openalex:WXX, pmid:NNN, arxiv:XX.XX). S2's batch endpoint only accepts S2 hex IDs (or its own prefix-tagged syntax which we don't yet emit). Fix: filter neighbor IDs to S2-format (40-char hex) before batching; non-S2 IDs are dropped with a logged count. Also added 4xx response-body capture so future S2 errors show the actual error message on the first round-trip instead of an opaque 400 Bad Request.
The S2 ID filter prevents the crash but reveals a deeper architecture gap: in corpora dominated by non-S2 sources (OpenAlex, PubMed), MOST references are non-S2-format and get dropped. The walk produces near-zero expansion in that case. Test against a real Single-Cell corpus: 749 non-S2 IDs filtered, 0 S2 IDs walked, 0 articles added. The v0.3.5 architectural value (bidirectional cross-source neighbor expansion) doesn't yet land for multi-source corpora.
Follow-on work tracked separately for v0.3.9: translate non-S2 IDs to S2's prefix syntax (DOI:10.x, PMID:NNN) before batching, OR fan out per-source (OpenAlex API for openalex: IDs, PubMed E-utilities for pmid: IDs). Not a v0.3.8 ship — needs design.
Two coupled fixes for Perplexity's 2026-04-30 broad-vocabulary findings.
source_title lookup. Hypotheses now resolve to specific paper titles ("will unlock currently blocked progress toward: fields_of_study intersect the corpus's dominant fields (≥30% of DOCUMENT nodes). Methodology papers typically declare different fields (Computer Science, Bioinformatics) than the substantive research papers (Biology, Medicine), so the centroid stays anchored to subject matter, not tooling. The other four PATHFINDING_WEIGHTS dropped uniformly (0.25 → 0.2125) so the new mass doesn't compound.session_corpus_pull.py. Discovery-driven topic runs (queue-managed via topic_queue_runtime) now expand the corpus with a 1-hop citation-graph walk after multi-source ingest completes. Takes the top 50 most-cited papers from the initial pull, collects both their references (backward — ancestral foundations) and citations (forward — downstream sub-communities), batch-fetches the metadata via Semantic Scholar's /paper/batch endpoint, dedupes against the existing corpus, and appends. Hard-capped at 400 net new neighbor IDs per run to bound API cost + wall-clock; typical add lands at 100-300 articles in 30-60 seconds.pending_items.md proposals into topic_queue.json and registered into DOMAINS at runtime). Fixed-config domains have curated query sets and skip this step. Failure modes are non-fatal — on any S2 error the run continues with the unexpanded corpus.Corpus fingerprint: no-corpus even though the run ingested 542 sources. Root cause: in the autonomous-scheduler code path, the article_index was being built AFTER the report was rendered, so the report's fingerprint check (report.get("article_index", {})) saw an empty dict and emitted "no-corpus" regardless of how many articles ingested. Fix: build article_index and assign it into the report dict BEFORE generate_article / generate_technical_report run. Receipts now show the real corpus signature. Note: this is the surface-symptom fix; the deeper architectural item (citation-graph 1-hop expansion to address flat-cluster structure on broad-vocabulary domains) is tracked separately and is multi-day work._detect_tooling_obi_signals() runs between citation-velocity and the hidden-citation bigram check. Two-layer detector: a STRONG signal (a single GitHub URL, "R package" / "Python package" / Bioconductor / CRAN / PyPI mention, or "available at https://" link) is sufficient on its own; a WEAK signal (generic terms like "framework", "library", "implementation") only fires when co-occurring with a package-name pattern in the title (all-caps acronym like HTSeq/BLAST, or CamelCase like DESeq2/uniCATE). Designed for precision — generic prose like "in our framework we propose..." won't trip it.Footer's "Built on" line was undercounting: said "6 U.S. Patents Pending" and named CTE + QECO. Actual total since the Tesseract Composition supplemental landed (2026-04-20) is 8, and the engine rides more than two substrate patents. Updated to:
Same correction applied to the Firebase brand site's DCFN-Research card.
The Research engine's autonomous-run path now drives from a discovery-agent-fed queue instead of cycling fixed domains. A discovery agent identifies new research topics worth running by querying Semantic Scholar (with PubMed fallback) for substantive recent activity in curated seed areas, derives a topic configuration from the top results, and proposes it for human review. After a 7-day cooldown without rejection, the proposal auto-promotes into the live run queue, where the engine executes the full pipeline against it once or twice before going dormant.
Why this matters: it converts the autonomous path from "run the same three domains every day" (which produces noise) into "surface new research territory worth exploring" (which produces signal). Each run feeds the Bridge Inbox + LEF Ai Upstream telemetry channels — autonomous runs are the substrate's input.
Solved Obliteration by Incorporation (OBI) — flagged by Gemini 2026-04-30 deep-research review of the v0.2 era output. Previously the engine was treating universally-adopted methods as "decayed" simply because they'd stopped being explicitly cited (their methods became the field's default vocabulary). The engine now distinguishes "Canonical Foundations (Absorbed by Incorporation)" from genuinely abandoned work. Concrete validation from Gemini: the engine correctly identifies the HTSeq Python framework — "22,482 lifetime citations but 0 in the last 5 years; hasn't decayed; it has just become structural canon" — instead of false-flagging it. This removes a major false-positive class from the engine's untested-foundation analysis.
High-signal convergence anchor detection — the engine surfaces single papers that multiple research clusters orbit without explicitly cross-citing. Gemini 2026-04-30 validated on the Trauma-Informed Care × Restorative Justice run: "353 independent research groups across 42 years converging on the exact same academic success metrics without sharing a direct citation path." Convergence anchors are the engine's strongest signal for "where the field is heading without anyone having named it yet."
Bridge digest format — autonomous runs now produce a structured Bridge Digest containing all bridge intelligence (gaps, severity, gap types, abstracts) suitable for ingestion by future Bridge engines that sit between two DCFN builds.
Syntari Record (JSON twin) — every run now produces a structured JSON twin alongside the prose Article, suitable for downstream machine-readable consumption.
Initial deployment. Single-page intake → multi-source ingest → concept graph construction with typed edges → Cognitive Traversal Engine (5 operations: backward / forward / branch cataloging / entropy / golden token) → SVW convergence detection → Apriori pattern mining → Article + Technical Report generation. Free 5 runs / month / browser; $15 unlock for Layer 2 + Layer 3 deeper traversal.