HelixML

Add RAG knowledge bases

Attach documents, web content, and data sources to an agent app so it can answer questions grounded in your content.

A knowledge base is a collection of documents that Helix indexes and searches at query time. When the agent receives a message, Helix finds the most relevant chunks and injects them into the context. This is Retrieval-Augmented Generation (RAG).

This guide covers knowledge bases for agent apps (the chatbot builder). For indexing codebases used by spec tasks, see Build an internal knowledge base.

Add a knowledge source

In your agent app, open the Knowledge tab and click Add Knowledge. Give it a name and description — the description helps the agent decide when to query this source versus another.

Each agent app can have multiple knowledge bases. The agent queries each one independently and combines the results.

Source types

Web URLs

Index one or more public URLs. Helix fetches and chunks the page content.

source:
  web:
    urls:
      - https://docs.example.com/
      - https://blog.example.com/release-notes

Enable Crawler to follow links recursively:

source:
  web:
    urls:
      - https://docs.example.com/
    crawler:
      enabled: true
      max_depth: 3        # link depth from the seed URL
      max_pages: 200      # hard cap on pages fetched
      readability: true   # strip nav/footer noise, keep article body

For password-protected sites:

source:
  web:
    urls:
      - https://intranet.example.com/
    auth:
      username: ${INTRANET_USER}
      password: ${INTRANET_PASS}

File uploads

Drag-and-drop or browse files in the Knowledge tab. Supported formats: PDF, DOCX, PPTX, XLSX, CSV, Markdown, plain text.

Uploaded files go into the agent's Helix Drive storage and are indexed immediately.

Helix Drive path

Reference a folder already in Helix Drive (the built-in file storage):

source:
  helix_drive:
    path: /my-team/product-docs

S3

source:
  s3:
    bucket: my-company-docs
    path: knowledge-base/         # prefix, optional

Helix uses the AWS credentials configured in your installation. For Helix Cloud, contact support to configure S3 access.

Google Cloud Storage

source:
  gcs:
    bucket: my-company-docs
    path: knowledge-base/

SharePoint

source:
  sharepoint:
    site_id: <sharepoint-site-id>
    drive_id: <drive-id>          # optional; defaults to the default drive
    folder_path: /Documentation   # optional
    oauth_provider_id: <your-sharepoint-oauth-provider>
    filter_extensions:
      - .pdf
      - .docx
    recursive: true

SharePoint requires an OAuth provider configured under Organisation → OAuth Connections.

Inline text

For small, stable content that doesn't need an external source:

source:
  text: |
    Our return policy: items may be returned within 30 days
    with a receipt for a full refund...

Vision RAG

By default, Helix indexes the text content of documents. Enable Vision to also index images and scanned PDFs using a multimodal embedding model:

rag_settings:
  enable_vision: true

With vision enabled, the agent can answer questions about diagrams, screenshots, charts, and scanned pages that contain no machine-readable text.

Vision indexing is slower and costs more tokens than text-only indexing. Use it when your documents contain meaningful visual content.

Refresh schedule

Keep the index current with a cron schedule:

refresh_enabled: true
refresh_schedule: "0 */6 * * *"   # every 6 hours

Standard cron syntax. The agent serves the previous index while the refresh runs; there is no downtime.

Tuning RAG retrieval

Under Advanced Settings in the Knowledge editor:

SettingDefaultEffect
Results count4Number of chunks retrieved per query. More chunks = more context but higher cost.
Chunk size1024Maximum tokens per chunk when splitting documents. Smaller = more precise retrieval; larger = more context per result.
Chunk overflow64Token overlap between adjacent chunks, to avoid splitting mid-sentence.

Start with the defaults and adjust if the agent gives answers that seem out of context (increase results count) or too verbose (reduce chunk size).

Monitoring index state

Each knowledge source shows its current state:

  • Indexing — currently being processed
  • Ready — indexed and searchable
  • Error — indexing failed (check the message for details)

You can trigger a manual re-index at any time with the Re-index button.

Multiple knowledge bases

One agent can have multiple knowledge bases. During a conversation, Helix queries all of them and combines the results. Use separate knowledge bases when you have logically distinct content that the agent should search independently — for example, a product manual and a customer FAQ.