Import articles in bulk from CSV or JSON
Import a list of articles into the knowledge base from any structured source — CSV export, JSON dump, scraped help-center pages. The hard part isn't calling kb_create_document; it's making the import idempotent so a partial run can resume, and applying the right chunking so search performs well later.
Read first:
skill://kb/import-from-google-docscovers the chunking strategy (whole-doc vs split-by-H2), tagging conventions, and what the embedding pipeline actually does. This skill assumes you've absorbed that and focuses on the source-format and idempotency layers.
TL;DR
- Pick or create the target space (
kb_list_spaces/kb_create_space). - Validate the source: required columns/keys are
title+body;tagsandpublicare optional. - Decide an idempotency key (an external id you put into a tag like
import:<source-id>). - For each row:
kb_searchfor the import key tag → if hit, skip; otherwise apply chunking andkb_create_document. - Spot-check with
kb_searchon a known phrase to confirm embeddings are indexed.
Step 1 — pick the space
{ "name": "kb_list_spaces", "arguments": {} }
If none fits the source's audience, kb_create_space first (or guide through skill://kb/create-first-space).
Step 2 — validate the source
For CSV:
- Required:
title(≤300 chars),body(markdown, non-empty). - Optional:
tags(comma-separated → array, ≤32 entries, each ≤64 chars),public(boolean → defaults tofalse). - Watch out for: quoted fields containing newlines, escaped commas inside
tags, smart-quote / em-dash artifacts from copy-paste.
For JSON:
- Same fields. Arrays-of-objects is the easiest shape. Reject anything that doesn't have at least
title+body.
If body is HTML rather than markdown, convert before submission — the KB's renderer expects markdown. (Use a known converter; don't roll your own.)
Step 3 — idempotency key
There is no built-in "external id" field on kb_documents. Use a tag as the key:
tags: ["import:<source-id>", ...other tags]
Where <source-id> is the source system's stable id (Zendesk article id, Confluence page id, row index in the CSV).
Before creating, check whether the import key already exists:
{ "name": "kb_search", "arguments": { "query": "import:<source-id>", "limit": 3 } }
kb_search is full-text + semantic — exact-string matches surface in the FTS layer reliably, especially for an unusual prefix like import:. If the search returns the same article, update is not supported via tags — you'd need to delete and recreate, or skip. For most imports, idempotency means "skip if exists".
Step 4 — chunk + create
Apply the chunking rules from skill://kb/import-from-google-docs:
- ≤2000 words → one document, whole-doc as body.
-
2000 words → split by H2 headings, one document per section, each tagged with the source title so they cluster.
For each chunk:
{
"name": "kb_create_document",
"arguments": {
"spaceId": "<spaceId>",
"title": "<chunk title>",
"body": "<markdown body>",
"tags": ["import:<source-id>", "<source-name>", "<topic-tag>"],
"public": false
}
}
Default audiences: ['admin'] unless the source is explicitly customer-facing material. End-user agents only see docs whose audiences includes 'self_service'. Embeddings are generated server-side asynchronously — you don't wait, but search results for new docs become accurate within ~5 seconds.
Pace: a few documents per second is fine. If you're importing thousands of rows, throttle to leave room for other traffic.
Step 5 — verify
After the loop, spot-check with a known phrase from the source:
{ "name": "kb_search", "arguments": { "query": "<distinctive phrase>", "limit": 5 } }
Then count: search by the import-tag prefix and confirm the count matches your source row count (minus skips).
What NOT to do
- Don't ignore
kb_create_documenterrors. Title-too-long and empty-body errors are common in CSV exports; logging them per-row lets you fix the source data and re-run. - Don't import with
audiences: ['admin', 'self_service']by default. That exposes content to end-user agents viakb_search. Only include'self_service'for explicitly customer-facing knowledge. - Don't import the same source twice without the import-tag idempotency key. Duplicate articles dilute search quality and confuse end-users.
- Don't set hundreds of identical tags across the import. Tags are filterable; that means each tag is part of someone's mental model. Use tags as facets, not as labels.
Related
skill://kb/import-from-google-docs— chunking + embedding details (read first).skill://kb/create-first-space— creating the space if you don't have one.