M5: parallel enrichment fetch + deps/rdeps dependency-graph queries #1

Merged
dxtr merged 5 commits from m5-parallel-deps into main 2026-06-06 17:43:05 +02:00
Owner

Summary

Implements the two M5 stretch goals: parallelize Tier-2 enrichment and add
dependency-graph queries. No semantic-search work — the core stays lexical FTS.

Parallel enrichment (lparallel)

  • Split enrich-system into a network-only fetch-enrichment (no DB) plus a
    serial DB write. Fetches (ocicl list + oras pull + .asd parse + README)
    now run concurrently while cl-sqlite writes stay serialized on the main thread.
  • enrich-rows fans fetches across an lparallel kernel of --jobs workers,
    writing each result as it arrives off a channel. --jobs 1 (or a single row)
    falls back to the sequential path. *registry* and the output streams are
    propagated into workers.
  • Thread-safety fix: with-temp-dir used (random …) on the shared
    *random-state*, which is not thread-safe under concurrency — replaced with an
    atomic counter so parallel workers never collide on temp-dir names.
  • Adds lparallel to :depends-on (ocicl.csv lockfile updated).

Dependency-graph queries

  • db: deps-rows / rdeps-rows via SQLite json_each over the stored deps
    JSON. deps left-joins back to the catalog so indexed dependencies show their
    version/description; unindexed ones still list by name.
  • cli: new deps NAME and rdeps NAME subcommands, both supporting
    --names-only and --json, with proper exit codes (missing system → 1,
    missing arg → 2). Shared emit-rows output helper.

$ lispsearch deps drakma
$ lispsearch rdeps alexandria --names-only
$ lispsearch deps cl+ssl --json

snippet ⧉

Tests

  • test/m5-stretch.lisp — deterministic, network-free coverage of deps/rdeps
    (incl. json_each, registry-name lookup, sorting, empty/unknown cases) and both
    the parallel and sequential enrich paths (15 checks).
  • test/m4-cli.shdeps/rdeps wiring (--help) and error-path checks.
  • M1 and M3 still pass; M3 now exercises the parallel path. Binary rebuilds clean.

Note on the parallelism benefit

Early benchmarking looked like parallel was slower (jobs=8 591s vs jobs=1 510s).
That turned out to be a measurement artifact, not a regression:

  1. Warm/cold cache ordering — a cold ocicl list is ~7–20s, warm ~1s
    (expires after minutes). Whichever arm ran first warmed the cache for the
    second, so back-to-back A/B runs on the same systems are meaningless.
  2. A transient registry stall on one cold parallel batch that didn't reproduce.

A controlled test on two disjoint, equally-cold 8-system sets — biased toward
serial (serial ran second, with connections already warm) — confirms parallel wins:

Run Total
Parallel --jobs 8 (cold) 12.5s
Serial --jobs 1 (cold) 15.7s

Per-system latency is network/cold-cache bound (~1s → 100s+) and dwarfs the
parallel-vs-serial axis; the parallel total tracks the slowest single chain, as
expected.

Notes

  • program-op build still SBCL-only (uses sb-ext:atomic-incf, consistent with
    the existing build target).
  • deps/rdeps rely on SQLite's JSON1 (json_each), present in the linked
    libsqlite3.
## Summary Implements the two M5 stretch goals: parallelize Tier-2 enrichment and add dependency-graph queries. No semantic-search work — the core stays lexical FTS. ## Parallel enrichment (lparallel) - Split `enrich-system` into a network-only `fetch-enrichment` (no DB) plus a serial DB write. Fetches (`ocicl list` + `oras pull` + `.asd` parse + README) now run concurrently while cl-sqlite writes stay serialized on the main thread. - `enrich-rows` fans fetches across an `lparallel` kernel of `--jobs` workers, writing each result as it arrives off a channel. `--jobs 1` (or a single row) falls back to the sequential path. `*registry*` and the output streams are propagated into workers. - **Thread-safety fix:** `with-temp-dir` used `(random …)` on the shared `*random-state*`, which is not thread-safe under concurrency — replaced with an atomic counter so parallel workers never collide on temp-dir names. - Adds `lparallel` to `:depends-on` (`ocicl.csv` lockfile updated). ## Dependency-graph queries - `db`: `deps-rows` / `rdeps-rows` via SQLite `json_each` over the stored `deps` JSON. `deps` left-joins back to the catalog so indexed dependencies show their version/description; unindexed ones still list by name. - `cli`: new `deps NAME` and `rdeps NAME` subcommands, both supporting `--names-only` and `--json`, with proper exit codes (missing system → 1, missing arg → 2). Shared `emit-rows` output helper. $ lispsearch deps drakma $ lispsearch rdeps alexandria --names-only $ lispsearch deps cl+ssl --json snippet ⧉ ## Tests - `test/m5-stretch.lisp` — deterministic, network-free coverage of `deps`/`rdeps` (incl. `json_each`, registry-name lookup, sorting, empty/unknown cases) and both the parallel and sequential enrich paths (15 checks). - `test/m4-cli.sh` — `deps`/`rdeps` wiring (`--help`) and error-path checks. - M1 and M3 still pass; M3 now exercises the parallel path. Binary rebuilds clean. ## Note on the parallelism benefit Early benchmarking *looked* like parallel was slower (jobs=8 591s vs jobs=1 510s). That turned out to be a **measurement artifact**, not a regression: 1. **Warm/cold cache ordering** — a cold `ocicl list` is ~7–20s, warm ~1s (expires after minutes). Whichever arm ran first warmed the cache for the second, so back-to-back A/B runs on the same systems are meaningless. 2. **A transient registry stall** on one cold parallel batch that didn't reproduce. A controlled test on two **disjoint, equally-cold** 8-system sets — biased *toward* serial (serial ran second, with connections already warm) — confirms parallel wins: | Run | Total | |-----|-------| | Parallel `--jobs 8` (cold) | **12.5s** | | Serial `--jobs 1` (cold) | 15.7s | Per-system latency is network/cold-cache bound (~1s → 100s+) and dwarfs the parallel-vs-serial axis; the parallel total tracks the slowest single chain, as expected. ## Notes - `program-op` build still SBCL-only (uses `sb-ext:atomic-incf`, consistent with the existing build target). - `deps`/`rdeps` rely on SQLite's JSON1 (`json_each`), present in the linked libsqlite3.
Scaffold lispsearch: package, config specials, name/util helpers, and the
SQLite + FTS5 database layer (external-content index kept in sync via
triggers). Includes migrate (idempotent), tier-1 name upsert, full enriched
upsert, BM25 full-text search, GLOB name search, get-system, and stats.

test/m1-db.lisp exercises the layer end to end (FTS ranking, prefix queries,
insert/update/delete trigger sync) and all checks pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
catalog.lisp fetches the authoritative ocicl system list over HTTP (dexador),
parses it into deduped (registry . decoded-system) pairs, and hashes it
(catalog-sha) for change detection. index.lisp build-index bulk-loads every
name in one transaction and records catalog_sha/last_refresh; Tier-2 enrichment
is stubbed until M3. search.lisp adds run-search with a GLOB name path and the
FTS path.

test/m2-catalog.lisp loads the full live catalog (2753 systems) and verifies
name search ("trivial" --names-only) and idempotent re-build; all checks pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
asd.lisp parses an .asd safely as DATA (*read-eval* nil, permissive readtable,
never load) and extracts description/author/license/depends-on from the root
defsystem form. enrich.lisp gathers versions (ocicl list, raw name + explicit
registry), pulls the artifact (oras pull + tar), picks the root .asd, reads the
README, and upserts an enriched record; failures degrade gracefully (row stays
enriched=0). index.lisp wires build-index --enrich and adds refresh-index.

Search now uses prefix matching by default so "html" hits "html5" — fixing the
unicode61 tokenization gap so `search "html parser"` ranks cl-html5-parser and
plump as intended.

test/m3-enrich.lisp enriches real systems end to end (incl. parse-asd safety:
#. is never evaluated) and all checks pass; M2 test updated for build-index's
(values loaded enriched) return.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
cli.lisp defines the clingon command tree (index/refresh/search/show/stats) with
a global --db option; main.lisp is the program-op entry point. search.lisp gains
format-row/format-show plus jzon-backed --json for search and show. The asd adds
clingon + com.inuoe.jzon and builds the standalone `lispsearch` executable to the
project root (build-pathname ../lispsearch).

test/m4-cli.sh builds the binary and exercises every subcommand end to end (index
with a small enrich slice, search/show in text and JSON, error exit codes, and a
full Tier-1 load with name search); all checks pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Parallelize Tier-2 enrichment and add dependency-graph CLI queries.

Parallel enrichment (lparallel):
- Split enrich-system into a network-only fetch-enrichment (no DB) plus a
  serial DB write, so fetches can run concurrently while cl-sqlite writes
  stay serialized on the main thread.
- enrich-rows fans fetches across an lparallel kernel of --jobs workers,
  writing each result as it arrives off a channel; --jobs 1 or a single row
  falls back to the sequential path. *registry* and the output streams are
  propagated into workers.
- Fix a latent thread-safety bug: with-temp-dir used (random) on the shared
  *random-state* (not thread-safe under concurrency); replaced with an atomic
  counter so parallel workers never collide on temp-dir names.
- Add lparallel to :depends-on (ocicl.csv updated).

Dependency-graph queries:
- db: deps-rows / rdeps-rows via SQLite json_each over the stored deps JSON;
  deps left-joins back to the catalog so indexed deps show version/description.
- cli: deps NAME and rdeps NAME subcommands with --names-only / --json and
  proper exit codes; shared emit-rows helper.

Tests:
- test/m5-stretch.lisp: deterministic, network-free coverage of deps/rdeps and
  the parallel + sequential enrich paths.
- test/m4-cli.sh: deps/rdeps wiring and error-path checks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
dxtr merged commit a34ee6b365 into main 2026-06-06 17:43:05 +02:00
dxtr deleted branch m5-parallel-deps 2026-06-06 17:43:05 +02:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
dxtr/ocicl-search!1
No description provided.