Document Ingestion Guide

How to create Knowledge Bases, configure ingestion settings, and get documents into Lancy.

Overview

Ingestion is the process of reading your source documents, splitting them into chunks, embedding each chunk into a vector representation, and writing those embeddings into a vector store. Once indexed, chunks are retrieved during queries using a combination of semantic search and BM25 keyword matching.

Lancy organises documents into Knowledge Bases (KBs). Each KB has its own embedding model, vector store, and ingestion settings — they are fully independent and can be switched at runtime without restarting the backend, provided they share the same embedding model.

Creating a Knowledge Base

Via the UI

Log in as admin and open the RAG Parameters panel (right side of the interface)
Click + next to the knowledge base selector
Enter a name, a source directory path (e.g. data/), and choose your embedding backend
Save — the KB is registered but not yet indexed
Click Re-index to run the first ingestion

Via the API

curl -X POST http://localhost:8080/api/v1/kb \
  -H "Content-Type: application/json" \
  -d '{
    "name": "My KB",
    "data_dirs": ["/absolute/path/to/docs"],
    "embedding_backend": "local",
    "embedding_model": "nomic-ai/nomic-embed-text-v1"
  }'

The response includes the generated id — keep it for subsequent API calls.

Note: If you are using the API to ingest files, the source directory path is irrelevant.

KB Settings Reference

These are set at KB creation and stored in knowledge_bases.json. Most can be edited in the RAG Parameters panel. Changing the embedding model on an existing KB requires a full re-index with reset=true — the vector store dimension is fixed at creation time.

Embedding

Setting	Default	Notes
`embedding_backend`	`local`	`local` / `ollama` / `litellm` / `custom`
`embedding_model`	`nomic-ai/nomic-embed-text-v1`	Must match installed weights for `local`; must be available on the configured server for others
`embedding_ollama_host`	(localhost:11434)	Only used when `embedding_backend=ollama`
`embedding_custom_base_url`	—	OpenAI-compatible base URL; used when `embedding_backend=custom`
`embedding_custom_api_key`	—	API key for the custom embedding endpoint
`nomic_prefix`	`true`	Prepend task prefix to queries and chunks — required for Nomic models, harmless for others
`embedding_batch_size`	`50`	Chunks per embedding call. Higher = faster but more VRAM. Reduce if you see OOM errors during ingestion.

Chunking

Setting	Default	Notes
`max_chunk_tokens`	`0`	Token ceiling per chunk. `0` = auto-size (each chunker picks a sensible default, typically 512–1024 tokens). Override only if retrieval quality suffers.
`max_file_size_mb`	`20`	Files exceeding this are skipped with a warning. Raise for large PDFs; lower to prevent accidental ingestion of binaries.
`pdf_ocr_enabled`	`true`	Enable OCR for scanned PDFs. Adds significant processing time per page. Disable for native-text PDFs to speed up ingestion.

Image support

Setting	Default	Notes
`image_indexing_enabled`	`false`	Extract and embed images found in documents. Requires a GPU for reasonable performance.
`image_retrieval_enabled`	`false`	Allow image similarity search in queries. Only meaningful if `image_indexing_enabled` is on.
`image_embedding_model`	`Qwen/Qwen3-VL-Embedding-2B`	Vision embedding model. Must be pre-downloaded before the backend starts.
`image_captioning_enabled`	`false`	Generate LLM captions for each image during ingestion. Captions are indexed alongside the image embedding, improving text-based retrieval of visual content. Uses the session LLM.

Vector store

Setting	Default	Notes
`vs_type`	`chromadb`	`chromadb` (embedded, no external service) or `pgvector` (external PostgreSQL)
`vs_connection_string`	—	PostgreSQL connection string; required when `vs_type=pgvector`. E.g. `postgresql://user:pass@host:5432/lancy`

Supported File Formats

Format	Extensions	Notes
PDF	`.pdf`	Full layout extraction via docling. OCR available for scanned pages (`pdf_ocr_enabled`). Image extraction available (`image_indexing_enabled`).
Excel	`.xlsx` `.xls`	Each sheet becomes separate chunks; row structure preserved.
Word	`.docx`	Processed via MarkItDown for layout preservation.
Text / Markdown	`.txt` `.md`	Markdown-aware splitting that respects heading hierarchy.
Images	`.png` `.jpg` `.jpeg` `.gif` `.tiff` `.bmp` `.webp`	Only ingested when `image_indexing_enabled=true`. Direct upload via UI or API only — the batch script skips image files.

Files with unsupported extensions are logged as warnings and skipped. Warnings appear in logs/backend.log and are also recorded in the ingest event history (see Troubleshooting).

Ingestion Methods

Method 1 — Folder scan (UI or API)

The simplest approach when documents are on the same machine as the backend or accessible to it (local disk, NFS mount, etc.). Set data_dirs to one or more absolute paths, then trigger indexing from the UI:

Incremental indexing — processes only new or changed files. Files already present in the vector store are skipped (deduplication via SHA-256 hash). Use this for routine updates.
Re-index — clears the entire vector store first, then re-embeds everything from scratch. Use this after changing the embedding model, or to recover from a corrupted store.

You can also trigger these actions via the API:

# Incremental — only re-embeds files not already in the vector store
curl -X POST http://localhost:8080/api/v1/rag/reindex \
  -H "Content-Type: application/json" \
  -d '{"reset": false}'

# Full reset — clears the vector store before re-embedding everything
curl -X POST http://localhost:8080/api/v1/rag/reindex \
  -H "Content-Type: application/json" \
  -d '{"reset": true}'

Method 2 — HTTP upload (no shared filesystem)

Use this when documents live on a different host than the backend (e.g. Profile 2 or 3 split deployments).

Batch upload — using the provided script:

scripts/upload-docs.sh http://<backend-host>:8080 <kb-id> /path/to/docs/

upload-docs.sh is a working reference implementation rather than a polished tool. It recursively finds all supported files in the given directory, uploads them one at a time via the document upload API, and polls the reindex-status endpoint after each file before sending the next. It waits up to 40 minutes per file and tolerates up to 10 minutes of backend unavailability before giving up. For production use, you may want to adapt it — for example, to parallelise uploads, handle authentication headers, or integrate with a document management system. The script is intentionally simple enough to read and modify.

Single file:

curl -X POST http://<backend-host>:8080/api/v1/kb/<kb-id>/documents \
  -F "file=@/path/to/document.pdf" \
  -F 'metadata={"document_id": "my-doc", "source_file": "document.pdf"}'

document_id is required and must be unique within the KB — the backend will reject duplicates. Use the filename stem as a safe default. The source_file field is stored as metadata and surfaced in retrieval results so users can trace a chunk back to its origin.

For the full ingestion API reference including request/response schemas, see 03-API-endpoints.md.

Monitoring Ingestion

Poll the status endpoint to track progress:

curl http://localhost:8080/api/v1/rag/reindex-status

Key fields in the response:

Field	Meaning
`indexing`	`true` while a job is running
`phase`	Current step: `loading` → `chunking` → `embedding` → `captioning`
`current_file`	Filename being processed right now
`file_index` / `total_files`	Progress through the file list
`chunks_so_far`	Cumulative chunks produced in this run
`embed_batch` / `embed_total_batches`	Embedding progress within the current file
`last_result.files_skipped_store`	Files skipped because they were already in the vector store (dedup)
`last_result.files_skipped_batch`	Files skipped because duplicate content was detected within this run
`queued`	Files in the upload queue waiting to start

The UI's ingestion panel reads this endpoint live and displays the same information.

To cancel a running job:

curl -X POST http://localhost:8080/api/v1/rag/reindex-cancel

Cancellation is cooperative — the backend finishes the current chunk batch before stopping.

Troubleshooting

Check the backend log for per-file warnings and errors:

tail -f logs/backend.log

Check ingest history — the backend records the outcome of every ingestion run:

curl http://localhost:8080/api/admin/ingest-events

Common issues:

Symptom	Likely cause
File skipped with no error	Exceeds `max_file_size_mb`, or unsupported format
Embedding OOM during ingestion	Reduce `embedding_batch_size` (e.g. to 10–20)
OCR very slow	Expected for scanned PDFs — set `pdf_ocr_enabled=false` if pages have native text
Re-index doesn't pick up changed files	File content unchanged (same hash) — modify the file or use `reset=true`
`reset=true` required after changing model	Embedding dimensions are fixed at collection creation; changing the model requires clearing the store