Knowledge Ingestion
Enumerate sources, chunk documents, and ingest content through CLI or API with deterministic idempotency.
Supported Source Types
| Kind | Description | Example keys |
|---|---|---|
markdown_glob | Expand local Markdown files via glob pattern | glob, optional ignore |
url | Fetch and cache a small remote document (PDF, HTML, text, …) | url, optional etag_header |
cloud_storage | (Planned) Enumerate objects from S3/GCS prefixes | provider, bucket, prefix, credentials via env |
media_transcript | (Planned) Convert .srt/.vtt transcripts to documents | path, optional language metadata |
Only markdown_glob and url ship in the MVP. Mention other kinds as planned features unless your deployment explicitly enables them.
Chunking & Preprocessing
- Strategy –
token(default) balances embedding quality and throughput. - Size – Default from
config.knowledge.chunk_size(a practical starting value is512). - Overlap – Default from
config.knowledge.chunk_overlap(a practical starting value is64). - Normalization – Hashes are stored per chunk to make ingestion idempotent; unchanged content is skipped automatically.
Ingest Modes
manual(default) keeps ingestion as an operator-triggered task via CLI or API.on_startinstructs the server to run ingestion immediately after project indexing completes for both project- and workflow-scoped knowledge bases; startup fails fast if any pipeline returns an error so issues surface before the API becomes available.
Running Ingestion
# Apply resource definitions first
compozy knowledge apply --file compozy.yaml
# Trigger ingestion
compozy knowledge ingest --id quickstart_docs
# Re-run to pick up changes; unchanged chunks are skipped
compozy knowledge ingest --id quickstart_docsJob Lifecycle
Enumerate
Resolve sources, expand globs, and detect remote downloads. Skips files larger than 100KB by default.
Chunk & Embed
Apply chunking policy, generate embeddings in batches (respecting config.knowledge.embedder_batch_size), and retry provider throttles with jitter.
Persist & Commit
Write vectors and metadata to the configured store, hash results, and emit knowledge_ingest_duration_seconds and knowledge_chunks_total metrics.
Monitoring Progress
- CLI commands print structured logs (start, completion, error) sourced from
logger.FromContext(ctx). - The API responds with an updated ETag for the knowledge base; poll
GET /knowledge-bases/{kb_id}until the response returns the new value. - Observability metrics are detailed in the Knowledge Observability guide.
Quickstart Walkthrough
The Quickstart Markdown Glob example demonstrates the full ingestion workflow:
- Clone example assets and ensure
OPENAI_API_KEYis present in.env. - Run
compozy knowledge applyto register resources. - Execute
compozy knowledge ingest --id quickstart_docs. - Launch
compozy run workflows/qa.yaml --input '{"question":"What is Compozy knowledge?"}'to see retrieval in action.
Because ingestion is idempotent, you can edit Markdown files and repeat the step to refresh embeddings without truncating the vector store manually.