Knowledge

Knowledge Ingestion

Enumerate sources, chunk documents, and ingest content through CLI or API with deterministic idempotency.

Supported Source Types

KindDescriptionExample keys
markdown_globExpand local Markdown files via glob patternglob, optional ignore
urlFetch and cache a small remote document (PDF, HTML, text, …)url, optional etag_header
cloud_storage(Planned) Enumerate objects from S3/GCS prefixesprovider, bucket, prefix, credentials via env
media_transcript(Planned) Convert .srt/.vtt transcripts to documentspath, optional language metadata

Only markdown_glob and url ship in the MVP. Mention other kinds as planned features unless your deployment explicitly enables them.

Chunking & Preprocessing

  • Strategytoken (default) balances embedding quality and throughput.
  • Size – Default from config.knowledge.chunk_size (a practical starting value is 512).
  • Overlap – Default from config.knowledge.chunk_overlap (a practical starting value is 64).
  • Normalization – Hashes are stored per chunk to make ingestion idempotent; unchanged content is skipped automatically.

Ingest Modes

  • manual (default) keeps ingestion as an operator-triggered task via CLI or API.
  • on_start instructs the server to run ingestion immediately after project indexing completes for both project- and workflow-scoped knowledge bases; startup fails fast if any pipeline returns an error so issues surface before the API becomes available.

Running Ingestion

# Apply resource definitions first
compozy knowledge apply --file compozy.yaml

# Trigger ingestion
compozy knowledge ingest --id quickstart_docs

# Re-run to pick up changes; unchanged chunks are skipped
compozy knowledge ingest --id quickstart_docs

Job Lifecycle

1

Enumerate

Resolve sources, expand globs, and detect remote downloads. Skips files larger than 100KB by default.

2

Chunk & Embed

Apply chunking policy, generate embeddings in batches (respecting config.knowledge.embedder_batch_size), and retry provider throttles with jitter.

3

Persist & Commit

Write vectors and metadata to the configured store, hash results, and emit knowledge_ingest_duration_seconds and knowledge_chunks_total metrics.

Monitoring Progress

  • CLI commands print structured logs (start, completion, error) sourced from logger.FromContext(ctx).
  • The API responds with an updated ETag for the knowledge base; poll GET /knowledge-bases/{kb_id} until the response returns the new value.
  • Observability metrics are detailed in the Knowledge Observability guide.

Quickstart Walkthrough

The Quickstart Markdown Glob example demonstrates the full ingestion workflow:

  1. Clone example assets and ensure OPENAI_API_KEY is present in .env.
  2. Run compozy knowledge apply to register resources.
  3. Execute compozy knowledge ingest --id quickstart_docs.
  4. Launch compozy run workflows/qa.yaml --input '{"question":"What is Compozy knowledge?"}' to see retrieval in action.

Because ingestion is idempotent, you can edit Markdown files and repeat the step to refresh embeddings without truncating the vector store manually.

Common Pitfalls