fork: scaling fixes (index-only context + chunking + model wiring)

Fixes upstream issues #3/#5/#9 (whole-wiki in every prompt) and adds large-log chunking. Addresses the audit's P1 scaling findings (C1), the chunking requirement operator added on top, C8 explicit model wiring across all LLM call sites, and D3 single-event-loop refactor. ## compile.py - **Index-only context.** The `existing_articles_context` concatenation of every wiki article has been removed from the prompt. Instead the LLM receives only the index + schema + daily log and uses the Read tool (already in allowed_tools) to fetch specific articles it decides are relevant. Prompt size stays bounded regardless of KB growth — upstream's 250K-token prompts past ~100 articles are gone. - **Chunking.** `_split_log_into_chunks()` splits oversized daily logs along `### ` section boundaries. Threshold MAX_LOG_CHARS_PER_CHUNK (default 100K chars ≈ 25K tokens, configurable via MEMORIA_MAX_LOG_CHARS). Chunks compile via separate LLM calls that naturally merge through Edit on shared files. Oversized single sections emit as their own chunks rather than splitting mid-thought. - **Atomic state on chunked compile.** State is only written after ALL chunks succeed — partial-failure leaves the log flagged as uncompiled in state.json so the next run retries it cleanly. Was already correct for single-chunk logs (early return on SDK error) and now correct for multi-chunk too. - **Explicit model.** `model=COMPILE_MODEL` passed to ClaudeAgentOptions. Default "sonnet"; override via MEMORIA_COMPILE_MODEL env var. - **D3: single asyncio.run.** The per-file `asyncio.run()` in the compile loop is replaced with one outer call wrapping `_compile_all`. Avoids repeated event-loop setup/teardown and matches the pattern used for async resources in the SDK. ## query.py - **Index-only context.** `read_all_wiki_content()` replaced with `read_wiki_index()`. The LLM reads the index and uses its Read tool to fetch specific articles. Same rationale as compile.py — keeps prompt size bounded and cost predictable. - **Explicit model.** `model=QUERY_MODEL`, default "sonnet", override via MEMORIA_QUERY_MODEL. ## lint.py - **C9: skip qa/sources in missing-backlink check.** Articles under qa/ or sources/ no longer trigger a suggestion that every referenced concept should backlink to them. Concepts aren't expected to link back to every Q&A that mentions them — doing so would drown real relationships. - **Alias-aware backlink detection.** Uses `extract_wikilinks()` to parse the target's link list so `[[concepts/foo|Display]]` forms count as valid backlinks (previously required exact `[[foo]]` match, causing false positives on aliased forms). - **Explicit model.** `model=LINT_MODEL` in check_contradictions call, default "sonnet", override via MEMORIA_LINT_MODEL. ## Verified - Chunking: 120K-char 3-section log splits into 80K + 40K, reconstructs byte-exact. Oversized single section (150K) emits as its own chunk. Small log (<100K) returns as single chunk. - All patched modules import cleanly with expected config values. - compile_daily_log / query.run_query / flush.maybe_trigger_compilation / lint.check_missing_backlinks all callable post-patch.
2026-04-24 17:48:48 -04:00 · 2026-04-24 17:48:48 -04:00 · 03296be47a
commit 03296be47a
parent 39ab2a8b6f
3 changed files with 213 additions and 68 deletions
--- a/scripts/lint.py
+++ b/scripts/lint.py
@ -13,9 +13,15 @@ from __future__ import annotations

 import argparse
 import asyncio
+import os
 from pathlib import Path

 from config import KNOWLEDGE_DIR, REPORTS_DIR, now_iso, today_iso
+
+# Contradiction-check model. Kept as Sonnet for reasoning quality; override
+# via MEMORIA_LINT_MODEL (e.g. to use a cheaper model for structural runs
+# that happen to include the LLM check).
+LINT_MODEL = os.environ.get("MEMORIA_LINT_MODEL", "sonnet")
 from utils import (
    count_inbound_links,
    extract_wikilinks,
@ -105,20 +111,36 @@ def check_stale_articles() -> list[dict]:


 def check_missing_backlinks() -> list[dict]:
-    """Check for asymmetric links: A links to B but B doesn't link to A."""
+    """Check for asymmetric links: A links to B but B doesn't link to A.
+
+    Skips any source or target under `qa/` or `sources/`: Q&A articles
+    intentionally reference concepts without requiring a reciprocal link
+    (concepts would otherwise accumulate a backlink per question, which
+    drowns real relationships). Also handles pipe-aliased wikilinks via
+    the alias-aware extract_wikilinks helper.
+    """
    issues = []
+    # Aliased/nested variants of the source link that should count as a
+    # valid backlink on the target side: bare slug, and pipe-aliased form.
    for article in list_wiki_articles():
        content = article.read_text(encoding="utf-8")
        rel = article.relative_to(KNOWLEDGE_DIR)
-        source_link = str(rel).replace(".md", "").replace("\\", "/")
+        rel_str = str(rel).replace("\\", "/")
+
+        # Skip one-way source categories.
+        if rel_str.startswith("qa/") or rel_str.startswith("sources/"):
+            continue
+
+        source_link = rel_str.replace(".md", "")

        for link in extract_wikilinks(content):
-            if link.startswith("daily/"):
+            if link.startswith("daily/") or link.startswith("qa/") or link.startswith("sources/"):
                continue
            target_path = KNOWLEDGE_DIR / f"{link}.md"
            if target_path.exists():
                target_content = target_path.read_text(encoding="utf-8")
-                if f"[[{source_link}]]" not in target_content:
+                target_backlinks = extract_wikilinks(target_content)
+                if source_link not in target_backlinks:
                    issues.append({
                        "severity": "suggestion",
                        "check": "missing_backlink",
@ -185,6 +207,7 @@ Do NOT output anything else - no preamble, no explanation, just the formatted li
            prompt=prompt,
            options=ClaudeAgentOptions(
                cwd=str(ROOT_DIR),
+                model=LINT_MODEL,
                allowed_tools=[],
                max_turns=2,
            ),