Knowledge Ingestion
Enumerate sources, chunk documents, and ingest content through CLI or API with deterministic idempotency.
Supported Source Types
Kind | Description | Example keys |
---|---|---|
markdown_glob | Expand local Markdown files via glob pattern | glob , optional ignore |
url | Fetch and cache a small remote document (PDF, HTML, text, …) | url , optional etag_header |
cloud_storage | (Planned) Enumerate objects from S3/GCS prefixes | provider , bucket , prefix , credentials via env |
media_transcript | (Planned) Convert .srt /.vtt transcripts to documents | path , optional language metadata |
Only markdown_glob
and url
ship in the MVP. Mention other kinds as planned features unless your deployment explicitly enables them.
Chunking & Preprocessing
- Strategy –
token
(default) balances embedding quality and throughput. - Size – Default from
config.knowledge.chunk_size
(a practical starting value is512
). - Overlap – Default from
config.knowledge.chunk_overlap
(a practical starting value is64
). - Normalization – Hashes are stored per chunk to make ingestion idempotent; unchanged content is skipped automatically.
Ingest Modes
manual
(default) keeps ingestion as an operator-triggered task via CLI or API.on_start
instructs the server to run ingestion immediately after project indexing completes for both project- and workflow-scoped knowledge bases; startup fails fast if any pipeline returns an error so issues surface before the API becomes available.
Running Ingestion
# Apply resource definitions first
compozy knowledge apply --file compozy.yaml
# Trigger ingestion
compozy knowledge ingest --id quickstart_docs
# Re-run to pick up changes; unchanged chunks are skipped
compozy knowledge ingest --id quickstart_docs
Job Lifecycle
Enumerate
Resolve sources, expand globs, and detect remote downloads. Skips files larger than 100KB by default.
Chunk & Embed
Apply chunking policy, generate embeddings in batches (respecting config.knowledge.embedder_batch_size
), and retry provider throttles with jitter.
Persist & Commit
Write vectors and metadata to the configured store, hash results, and emit knowledge_ingest_duration_seconds
and knowledge_chunks_total
metrics.
Monitoring Progress
- CLI commands print structured logs (start, completion, error) sourced from
logger.FromContext(ctx)
. - The API responds with an updated ETag for the knowledge base; poll
GET /knowledge-bases/{kb_id}
until the response returns the new value. - Observability metrics are detailed in the Knowledge Observability guide.
Quickstart Walkthrough
The Quickstart Markdown Glob example demonstrates the full ingestion workflow:
- Clone example assets and ensure
OPENAI_API_KEY
is present in.env
. - Run
compozy knowledge apply
to register resources. - Execute
compozy knowledge ingest --id quickstart_docs
. - Launch
compozy run workflows/qa.yaml --input '{"question":"What is Compozy knowledge?"}'
to see retrieval in action.
Because ingestion is idempotent, you can edit Markdown files and repeat the step to refresh embeddings without truncating the vector store manually.