Text Chunker

The Chunker interface and built-in strategies for splitting documents into embeddable chunks.

The chunker splits raw document text into smaller ChunkResult objects before embedding. Weave ships with five built-in strategies and a pluggable interface.

Chunker interface

// Package chunker
type Chunker interface {
    Chunk(ctx context.Context, text string, opts *Options) ([]ChunkResult, error)
}

type Options struct {
    ChunkSize    int    // target chunk size in tokens
    ChunkOverlap int    // overlapping tokens between adjacent chunks
    Strategy     string // "recursive" | "fixed" | "sliding" | "semantic" | "code"
}

Register a custom chunker with engine.WithChunker(myChunker). If none is provided, Weave uses the recursive chunker by default.

ChunkResult

type ChunkResult struct {
    Content     string            // text content of the chunk
    Index       int               // zero-based position in the document
    StartOffset int               // byte offset of chunk start in original text
    EndOffset   int               // byte offset of chunk end in original text
    TokenCount  int               // estimated number of tokens
    Metadata    map[string]string // chunker-specific metadata
}

Built-in strategies

Strategy	Package	Description
`recursive`	`chunker/recursive`	Default. Splits on paragraph, sentence, then word boundaries to stay within `ChunkSize`
`fixed`	`chunker/fixed`	Splits at exact token boundaries with `ChunkOverlap` tokens of context overlap
`sliding`	`chunker/sliding`	Sliding window over the text — every chunk advances by `ChunkSize - ChunkOverlap` tokens
`semantic`	`chunker/semantic`	Groups sentences by semantic similarity — chunks are topic-coherent
`code`	`chunker/code`	Splits on function / class / block boundaries — preserves code structure

Per-collection configuration

Chunk settings are stored on the collection and applied automatically during ingestion:

col := &collection.Collection{
    ChunkStrategy: "fixed",
    ChunkSize:     1024, // tokens
    ChunkOverlap:  100,  // tokens
}

The engine's DefaultChunkSize (512) and DefaultChunkOverlap (50) apply when fields are zero.

Custom chunker

Implement chunker.Chunker to split text with your own logic:

type MyChunker struct{}

func (c *MyChunker) Chunk(_ context.Context, text string, opts *chunker.Options) ([]chunker.ChunkResult, error) {
    // split text into sentences or paragraphs
    var results []chunker.ChunkResult
    for i, sentence := range splitSentences(text) {
        results = append(results, chunker.ChunkResult{
            Content:    sentence,
            Index:      i,
            TokenCount: estimateTokens(sentence),
        })
    }
    return results, nil
}

Token counting

Weave's built-in chunkers estimate token counts using a simple word-based heuristic (≈ 1 token per 4 characters). For precise token counts, implement assembler.TokenCounter:

type TokenCounter interface {
    Count(text string) int
}

Pass it to the assembler via assembler.WithTokenCounter(myCounter) for accurate token budgeting during context assembly.