Build a Custom Knowledge Base
The Alauda Hyperflux plugin ships with a built-in knowledge base covering Alauda Container Platform (ACP) and Alauda product documentation. For most deployments this is enough. You should follow this guide when you also want Hyperflux to answer questions grounded in:
- Internal runbooks, SRE playbooks, or design documents specific to your organisation.
- Versions or branches of Alauda product docs that don't ship in the bundled dump.
- Customer-facing documentation that lives in private Git repositories.
The output of this workflow is a PostgreSQL dump file that can be restored into the Hyperflux system KB
(docvec_sys_kb) — either replacing the bundled corpus or extending it.
TOC
How the embedding pipeline worksPrerequisitesStep 1 — Describe your corpusStep 2 — Set the builder environmentStep 3 — Generatedocuments.json (prepare phase)Step 4 — Embed into PostgreSQLStep 5 — Validate retrieval quality (optional but recommended)Step 6 — Deliver to productionMode A — Embed directly into the production PGMode B — Dump and restore via the chart's data-swap mechanism (recommended)B1. Produce the dumpB2. Ship the dump into the plugin imageB3. Configure the swap ruleMode C — Manual pg_restore (air-gapped or one-off)Re-roll cadenceCaveatsHow the embedding pipeline works
The smart-doc builder turns a list of Git repositories into a vectorised knowledge base in three stages:
- Prepare — clone the repos, split each
.md/.mdxdocument by heading and chunk size, then call an LLM to generate a one-paragraph summary and a few representative questions per document. Output:documents.json. - Embed — load the
gte-multilingual-baseembedding model, embed both the chunks and the per-document summaries, and write them to a PostgreSQL + ParadeDB instance as a LangChain PGVector collection. A BM25 index is created alongside the vector index. - Dump —
pg_dumpthe resulting collection so it can be shipped to production.
The resulting dump is restored on the production cluster either by the init container (preferred) or with
pg_restore directly.
Prerequisites
- A workstation with Python 3.13+,
uv, andgit. - Read access to every Git repository you want to ingest (HTTPS token or SSH key).
- An LLM endpoint with sufficient quota — the prepare phase calls the LLM once per source document (roughly 1,000–3,000 calls per ACP-sized knowledge base; ~$5–20 on Azure GPT-5-mini at the time of writing).
- The
gte-multilingual-baseembedding model. Download once from HuggingFace: The same model is baked into the production image at/opt/gte-multilingual-base. If your custom KB uses any other embedding model the production server will not be able to query it. - A PostgreSQL + ParadeDB instance reachable from the workstation. The simplest options:
- Run the same
mlops/paradedb:0.22.6-pg18image locally with Docker. - Connect the builder directly to the production PG (skip the dump-and-restore step).
- Run the same
- A clone of the smart-doc repository for the builder CLI:
Step 1 — Describe your corpus
Create a JSON manifest that lists every Git repository you want to ingest. Save it under
builder/data/<your-name>.json.
Field reference (full list in builder/README.md):
Step 2 — Set the builder environment
Copy builder/run.sh and fill in the credentials. You only need the LLM and embedding sections for this
workflow:
Then source it:
Step 3 — Generate documents.json (prepare phase)
Clone the repos, split into chunks, and have the LLM produce summaries:
Useful flags:
--dryrun— print the document list without calling the LLM. Use this to confirm yoursub_dirsanddoc_typefilters match what you expect.
This step is idempotent on the LLM side — repeated runs reuse cached LLM responses keyed by document content, so re-running after editing the manifest only costs LLM calls for the changed docs.
Step 4 — Embed into PostgreSQL
What the flags do:
The embed step does not call the LLM; it can be re-run cheaply to try a different chunk size or to extend an existing collection with new documents.
Step 5 — Validate retrieval quality (optional but recommended)
If you maintain test cases under evaluator/cases/retrieval/, point the evaluator at your new collection:
Look for Recall@5 ≥ 0.80 and MRR@5 ≥ 0.55 as a sanity baseline; the bundled ACP corpora hit
Recall@5 ≈ 0.82 on hybrid retrieval.
Step 6 — Deliver to production
Choose one of the two delivery modes below. Mode A is simpler for a first cut; Mode B is the production-ready path.
Mode A — Embed directly into the production PG
If your workstation can reach the production PostgreSQL, point --pg-conn-str at it and re-run Step 4 with
--collection-name set to whatever you want the production server to read. Then update the chart value
pgconnect.pgCollectionName to that name and roll the smart-doc deployment.
This bypasses the dump-and-restore round-trip but couples the build environment to production. Suitable for single-cluster deployments where you control both ends.
Mode B — Dump and restore via the chart's data-swap mechanism (recommended)
This mirrors how the bundled ACP dumps are shipped. The init container handles the swap atomically and idempotently, so it is safe across pod restarts and multi-replica deployments.
B1. Produce the dump
The dump file name must equal the collection name (without the .dump suffix). The init container's
upgrade rule parser uses the file name as the collection's internal name when restoring.
B2. Ship the dump into the plugin image
The bundled dumps live at /workspace-smart-doc/dumps/ inside the smart-doc container, baked into the
mlops/smart-doc image at build time (see dumps/ in the smart-doc repository). To add a custom dump:
- Drop your
<your-collection-name>.dumpintodumps/in your smart-doc fork. - Rebuild the smart-doc container image and push it to your registry.
- Update the chart
global.images.smartdoc.tagto your new tag.
If you cannot rebuild the image, fall back to Mode A or to manual pg_restore (Mode C below).
B3. Configure the swap rule
In your values.yaml override (or in the install form's advanced YAML editor), set:
<current_collection_name> is whatever the running server is using — typically the bundled
gte-multilingual-base_20260410 for fresh v1.4.0 deployments. After the swap, the chart silently keeps the
old collection name so pgconnect.pgCollectionName can stay unchanged on the server side.
Apply the chart upgrade. The init container will:
- Acquire an advisory lock (multi-replica safe).
- Record the in-flight swap into
kb_swap_state(crash-safe — recovery on restart). - Drop the existing collection's tables in
docvec_sys_kb. pg_restoreyour dump into the same database.- Rename the new collection to
<current_collection_name>so the server keeps querying the same name. - Stamp
schema_migrationsand clearkb_swap_state.
You can confirm by tailing the init log:
Mode C — Manual pg_restore (air-gapped or one-off)
If neither rebuilding the image nor reaching production from the workstation is possible, the dump can be restored by hand into the running PG and the chart told to query it directly:
This path skips the chart-level idempotency stamp; if you re-roll the chart it will not re-apply your custom data. Switch to Mode B as soon as the constraint that forced this path is gone.
Re-roll cadence
Custom knowledge bases drift faster than product documentation. Plan to re-run prepare → embed → dump on a schedule that matches your source repos:
- Daily-changing runbooks → nightly cron, swap with a fresh
acpKbDataVersioneach run. - Quarterly architecture docs → manual re-roll on each release boundary.
The init container only re-runs the swap when acpKbDataVersion changes, so re-deploying the chart with the
same version is a no-op even if the dump file on disk has been replaced.
Caveats
- The custom KB occupies the same
docvec_sys_kbdatabase as any bundled ACP dump. Mode B replaces the bundled dump entirely. If you want to keep ACP product knowledge alongside your internal docs, ingest both the ACP repos and your internal repos in the samepreparerun so the resulting dump contains both. - The user-facing BYO Knowledge Tool (introduced in v1.3.1) writes to
docvec_user_kbinstead and is unaffected by this workflow. Use BYO Knowledge for end-user document uploads, and use this guide for admin-curated corpora baked into the deployment. - Embedding-model mismatch is the most common failure mode: vectors built with a non-
gte-multilingual-basemodel will be silently retrievable but always score near zero, so the answer quality collapses without an obvious error. If you see "I don't have enough information" answers across the board after a custom-KB swap, double-checkEMB_MODELwas the same path the production server uses (/opt/gte-multilingual-base).