Biscuit是一种用于快速模式匹配LIKE查询的PostgreSQL索引。
Biscuit is a specialized PostgreSQL index for fast pattern matching LIKE queries

原始链接: https://github.com/CrystallineCore/Biscuit

## Biscuit:一种用于LIKE查询的快速PostgreSQL索引 Biscuit是一种新的PostgreSQL索引访问方法 (IAM),旨在显著提高使用 `LIKE` 和 `ILIKE` 查询进行模式匹配的速度,尤其是在包含大量通配符的情况下。它通过利用位图索引技术,避免了传统三元组索引(如 `pg_trgm`)代价高昂的重新检查开销。 **主要特性:** * **速度:** 在包含大量通配符的搜索中提供显著的性能提升。 * **多列支持:** 原生支持跨多列搜索,并根据选择性自动优化查询顺序。 * **自动优化:** 采用 12 种性能优化,包括通配符跳过、提前终止和高效的位图处理。 * **构建与内省:** 提供 SQL 函数来检查构建配置、CRoaring 支持(用于增强性能)和整体状态。 * **数据类型支持:** 适用于文本、数字、日期/时间以及布尔类型。 **工作原理:** Biscuit 构建正向和负向字符位置位图,从而能够基于模式匹配快速过滤候选对象。 **安装:** 需要 `gcc`、`make` 和 `pg_config`。可选地,可以使用 CRoaring 库以进一步提高速度。安装过程包括克隆仓库、使用 `make` 构建,以及使用 `CREATE EXTENSION biscuit;` 在 PostgreSQL 中启用扩展。 **使用场景:** Biscuit 在频繁、复杂的 `LIKE` 查询场景中表现出色,尤其是在电子商务搜索、日志分析和 CRM 系统等应用中。

## Biscuit:一种用于LIKE查询的新PostgreSQL索引 一种名为“Biscuit”的新PostgreSQL索引旨在加速`LIKE`模式匹配查询,尤其是在传统全文搜索(tsvector)可能过于繁重的情况下。Biscuit发布在Hacker News上,专为类似关键词搜索的场景设计,在这些场景中,全文搜索中的词干提取可能会产生负面影响。 用户指出它与三元索引相似,后者对`ILIKE %something%`搜索有效,但建议比较索引大小会很有价值。该索引利用Roaring位图,正如Daniel Lemire所强调的,以实现高效的存储和检索。 当精确关键词匹配优先于自然语言处理时,Biscuit提供了一种替代全文搜索的潜在方案,可能提高特定用例的性能。该项目在GitHub上可用:[github.com/crystallinecore](github.com/crystallinecore)。
相关文章

原文

License: MIT PostgreSQL: 12+ Read the Docs

Biscuit is a specialized PostgreSQL index access method (IAM) designed for blazing-fast pattern matching on LIKE and ILIKE queries, with native support for multi-column searches. It eliminates the recheck overhead of trigram indexes while delivering significant performance improvements on wildcard-heavy queries. It stands for Bitmap Indexed Searching with Comprehensive Union and Intersection Techniques.


🛠️ Build & Packaging

  • Improved Makefile detection logic for CRoaring bitmap support by checking multiple common installation paths, increasing portability across systems and build environments.

Build and configuration introspection

Added SQL functions to inspect Biscuit build-time configuration, useful for debugging, reproducibility, and deployment verification.

  • biscuit_version() → text

Returns the Biscuit extension version string.

  • biscuit_build_info() → table

Returns detailed build-time configuration information.

  • biscuit_build_info_json() → text

Returns build configuration as a JSON string for automation and scripting.

Roaring Bitmap support introspection

Added built-in SQL functions to inspect CRoaring bitmap support in Biscuit.

  • biscuit_has_roaring() → boolean

Checks whether the extension was compiled with CRoaring bitmap support.

  • biscuit_roaring_version() → text

Returns the CRoaring library version if available.

Added a built-in diagnostic view for quick inspection of Biscuit status and configuration.

  • biscuit_status
    A single-row view providing an overview of:
    • extension version
    • CRoaring enablement
    • bitmap backend in use
    • total number of Biscuit indexes
    • combined on-disk index size

  • Build tools: gcc, make, pg_config
  • Optional: CRoaring library for enhanced performance
# Clone repository
git clone https://github.com/Crystallinecore/biscuit.git
cd biscuit

# Build and install
make
sudo make install

# Enable in PostgreSQL
psql -d your_database -c "CREATE EXTENSION biscuit;"
pgxn install biscuit
psql -d your_database -c "CREATE EXTENSION biscuit;"

-- Create a Biscuit index
CREATE INDEX idx_users_name ON users USING biscuit(name);

-- Query with wildcard patterns
SELECT * FROM users WHERE name LIKE '%john%';
SELECT * FROM users WHERE name NOT LIKE 'a%b%c';
SELECT COUNT(*) FROM users WHERE name LIKE '%test%';
-- Create multi-column index
CREATE INDEX idx_products_search 
ON products USING biscuit(name, description, category);

-- Multi-column query (optimized automatically)
SELECT * FROM products 
WHERE name LIKE '%widget%' 
  AND description LIKE '%blue%'
  AND category LIKE 'electronics%'
ORDER BY score DESC 
LIMIT 10;

Biscuit automatically converts various types to searchable text:

-- Text types (native)
CREATE INDEX ON logs USING biscuit(message);

-- Numeric types (converted to sortable strings)
CREATE INDEX ON events USING biscuit(user_id, event_code);

-- Date/Time types (converted to sortable timestamps)
CREATE INDEX ON orders USING biscuit(order_date, customer_name);

-- Boolean (converted to 't'/'f')
CREATE INDEX ON flags USING biscuit(is_active, status);

Core Concept: Bitmap Position Indices

Biscuit builds two types of character-position bitmaps for every string:

1. Positive Indices (Forward)

Tracks which records have character c at position p:

String: "Hello"
Bitmaps:
  H@0 → {record_ids...}
  e@1 → {record_ids...}
  l@2 → {record_ids...}
  l@3 → {record_ids...}
  o@4 → {record_ids...}

2. Negative Indices (Backward)

Tracks which records have character c at position -p from the end:

String: "Hello"
Bitmaps:
  o@-1 → {record_ids...}  (last char)
  l@-2 → {record_ids...}  (second to last)
  l@-3 → {record_ids...}
  e@-4 → {record_ids...}
  H@-5 → {record_ids...}

3. Positive Indices (Case-insensitive)

Tracks which records have character c at position p:

String: "Hello"
Bitmaps:
  h@0 → {record_ids...}
  e@1 → {record_ids...}
  l@2 → {record_ids...}
  l@3 → {record_ids...}
  o@4 → {record_ids...}

4. Negative Indices (Case-insensitive)

Tracks which records have character c at position -p from the end:

String: "Hello"
Bitmaps:
  o@-1 → {record_ids...}  (last char)
  l@-2 → {record_ids...}  (second to last)
  l@-3 → {record_ids...}
  e@-4 → {record_ids...}
  h@-5 → {record_ids...}

Two types for fast length filtering:

  • Exact length: length[5] → all 5-character strings
  • Minimum length: length_ge[3] → all strings ≥ 3 characters

Pattern Matching Algorithm

Step 1: Parse pattern into parts

Parts: ["abc", "def"]
Starts with %: YES
Ends with %: NO

Step 2: Match first part as prefix

-- "abc" must start at position 0
Candidates = pos[a@0] ∩ pos[b@1] ∩ pos[c@2]

Step 3: Match last part at end (negative indexing)

-- "def" must end at string end
Candidates = Candidates ∩ neg[f@-1] ∩ neg[e@-2] ∩ neg[d@-3]

Step 4: Apply length constraint

-- String must be at least 6 chars (abc + def)
Candidates = Candidates ∩ length_ge[6]

Result: Exact matches, zero false positives


1. Pure Bitmap Operations

// Traditional approach (pg_trgm)
for each trigram in pattern:
    candidates = scan_trigram_index(trigram)
    for each candidate:
        if !heap_fetch_and_recheck(candidate):  // SLOW: Random I/O
            remove candidate

// Biscuit approach
for each character at position:
    candidates &= bitmap[char][pos]  // FAST: In-memory AND
// No recheck needed!

Compressed bitmap representation:

  • Sparse data: array of integers
  • Dense data: bitset
  • Automatic conversion for optimal memory

3. Negative Indexing Optimization

-- Pattern: '%xyz'
-- Traditional: Scan all strings, check suffix
-- Biscuit: Direct lookup in neg[z@-1] ∩ neg[y@-2] ∩ neg[x@-3]

12 Performance Optimizations

1. Skip Wildcard Intersections

// Pattern: "a_c" (underscore = any char)
// OLD: Intersect all 256 chars at position 1
// NEW: Skip position 1 entirely, only check a@0 and c@2

2. Early Termination on Empty

result = bitmap[a][0];
result &= bitmap[b][1];
if (result.empty()) return empty;  // Don't process remaining chars

3. Avoid Redundant Bitmap Copies

// OLD: Copy bitmap for every operation
// NEW: Operate in-place, copy only when branching

4. Optimized Single-Part Patterns

Fast paths for common cases:

  • Exact: 'abc' → Check position 0-2 and length = 3
  • Prefix: 'abc%' → Check position 0-2 and length ≥ 3
  • Suffix: '%xyz' → Check negative positions -3 to -1
  • Substring: '%abc%' → Check all positions, OR results

5. Skip Unnecessary Length Operations

// Pure wildcard patterns
if (pattern == "%%%___%%")  // 3 underscores
    return length_ge[3];     // No character checks needed!

6. TID Sorting for Sequential Heap Access

// Sort TIDs by (block_number, offset) before returning
// Converts random I/O into sequential I/O
// Uses radix sort for >5000 TIDs, quicksort for smaller sets
// For bitmap scans, insert TIDs in chunks
for (i = 0; i < num_results; i += 10000) {
    tbm_add_tuples(tbm, &tids[i], batch_size, false);
}

8. Direct Roaring Iteration

// OLD: Convert bitmap to array, then iterate
// NEW: Direct iterator, no intermediate allocation
roaring_uint32_iterator_t *iter = roaring_create_iterator(bitmap);
while (iter->has_value) {
    process(iter->current_value);
    roaring_advance_uint32_iterator(iter);
}

9. Batch Cleanup on Threshold

// After 1000 deletes, clean tombstones from all bitmaps
if (tombstone_count >= 1000) {
    for each bitmap:
        bitmap &= ~tombstones;  // Batch operation
    tombstones.clear();
}

10. Aggregate Query Detection

// COUNT(*), EXISTS, etc. don't need sorted TIDs
if (!scan->xs_want_itup) {
    skip_sorting = true;  // Save sorting time
}

11. LIMIT-Aware TID Collection

// If LIMIT 10 in query, don't collect more than needed
if (limit_hint > 0 && collected >= limit_hint)
    break;  // Early termination

12. Multi-Column Query Optimization

Analyzes each column's pattern and executes in order of selectivity:

-- Query:
WHERE name LIKE '%common%'           -- Low selectivity
  AND sku LIKE 'PROD-2024-%'         -- High selectivity (prefix)
  AND description LIKE '%rare_word%' -- Medium selectivity

-- Execution order (Biscuit automatically reorders):
1. sku LIKE 'PROD-2024-%'         (PREFIX, priority=20, selectivity=0.02)
2. description LIKE '%rare_word%' (SUBSTRING, priority=35, selectivity=0.15)
3. name LIKE '%common%'           (SUBSTRING, priority=55, selectivity=0.60)

Selectivity scoring formula:

score = 1.0 / (concrete_chars + 1)
      - (underscore_count × 0.05)
      + (partition_count × 0.15)
      - (anchor_strength / 200)

Priority tiers:

  1. 0-10: Exact matches, many underscores
  2. 10-20: Non-% patterns with underscores
  3. 20-30: Strong anchored patterns (prefix/suffix)
  4. 30-40: Weak anchored patterns
  5. 40-50: Multi-partition patterns
  6. 50-60: Substring patterns (lowest priority)

-- Create 1M row test table
CREATE TABLE benchmark (
    id SERIAL PRIMARY KEY,
    name TEXT,
    description TEXT,
    category TEXT,
    score FLOAT
);

INSERT INTO benchmark (name, description, category, score)
SELECT 
    'Name_' || md5(random()::text),
    'Description_' || md5(random()::text),
    'Category_' || (random() * 100)::int,
    random() * 1000
FROM generate_series(1, 1000000);

-- Create indexes
CREATE INDEX idx_trgm ON benchmark 
    USING gin(name gin_trgm_ops, description gin_trgm_ops);

CREATE INDEX idx_biscuit ON benchmark 
    USING biscuit(name, description, category);

ANALYZE benchmark;
-- Single column, simple pattern
EXPLAIN ANALYZE
SELECT * FROM benchmark WHERE name LIKE '%abc%' LIMIT 100;

-- Multi-column, complex pattern
EXPLAIN ANALYZE
SELECT * FROM benchmark 
WHERE name LIKE '%a%b' 
  AND description LIKE '%bc%cd%'
ORDER BY score DESC 
LIMIT 10;

-- Aggregate query (COUNT)
EXPLAIN ANALYZE
SELECT COUNT(*) FROM benchmark 
WHERE name LIKE 'a%l%' 
  AND category LIKE 'f%d';

-- Complex multi-part pattern
EXPLAIN ANALYZE
SELECT * FROM benchmark 
WHERE description LIKE 'u%dc%x'
LIMIT 50;
-- Show internal statistics
SELECT biscuit_index_stats('idx_biscuit'::regclass);

Output:

----------------------------------------------------
 Biscuit Index Statistics (FULLY OPTIMIZED)        +
 ==========================================        +
 Index: idx_biscuit                                +
 Active records: 1000002                           +
 Total slots: 1000002                              +
 Free slots: 0                                     +
 Tombstones: 0                                     +
 Max length: 44                                    +
 ------------------------                          +
 CRUD Statistics:                                  +
   Inserts: 0                                      +
   Updates: 0                                      +
   Deletes: 0                                      +
 ------------------------                          +
 Active Optimizations:                             +
   ✓ 1. Skip wildcard intersections                +
   ✓ 2. Early termination on empty                 +
   ✓ 3. Avoid redundant copies                     +
   ✓ 4. Optimized single-part patterns             +
   ✓ 5. Skip unnecessary length ops                +
   ✓ 6. TID sorting for sequential I/O             +
   ✓ 7. Batch TID insertion                        +
   ✓ 8. Direct bitmap iteration                    +
   ✓ 9. Parallel bitmap scan support               +
   ✓ 10. Batch cleanup on threshold                +
   ✓ 11. Skip sorting for bitmap scans (aggregates)+
   ✓ 12. LIMIT-aware TID collection                +
Index Command Build Time
pg_trgm CREATE INDEX idx_trgm ... 20,358.655 ms
biscuit CREATE INDEX idx_biscuit ... 2,734.310 ms

1. Full-Text Search Applications

-- E-commerce product search
CREATE INDEX idx_products ON products 
    USING biscuit(name, brand, description);

SELECT * FROM products 
WHERE name LIKE '%laptop%' 
  AND brand LIKE 'ABC%'
  AND description LIKE '%gaming%'
ORDER BY price DESC 
LIMIT 20;
-- Search error logs
CREATE INDEX idx_logs ON logs 
    USING biscuit(message, source, level);

SELECT * FROM logs 
WHERE message LIKE '%ERROR%connection%timeout%'
  AND source LIKE 'api.%'
  AND timestamp > NOW() - INTERVAL '1 hour'
LIMIT 100;

3. Customer Support / CRM

-- Search tickets by multiple fields
CREATE INDEX idx_tickets ON tickets 
    USING biscuit(subject, description, customer_name);

SELECT * FROM tickets 
WHERE subject LIKE '%refund%'
  AND customer_name LIKE 'John%'
  AND status = 'open'
ORDER BY created_at DESC;

4. Code Search / Documentation

-- Search code repositories
CREATE INDEX idx_files ON code_files 
    USING biscuit(filename, content, author);

SELECT * FROM code_files 
WHERE filename LIKE '%.py'
  AND content LIKE '%def%parse%json%'
  AND author LIKE 'team-%';

5. Analytics with Aggregates

-- Fast COUNT queries (no sorting overhead)
CREATE INDEX idx_events ON events 
    USING biscuit(event_type, user_agent, referrer);

SELECT COUNT(*) FROM events 
WHERE event_type LIKE 'click%'
  AND user_agent LIKE '%Mobile%'
  AND referrer LIKE '%google%';

Enable CRoaring for better performance:

Currently, Biscuit doesn't expose tunable options. All optimizations are automatic.


Limitations and Trade-offs

What Biscuit Does NOT Support

  1. Regular expressions - Only LIKE / ILIKE patterns with % and _
  2. Locale-specific collations - String comparisons are byte-based
  3. Amcanorder = false - Cannot provide ordered scans directly (but see below)

ORDER BY + LIMIT Behavior

Biscuit doesn't support ordered index scans (amcanorder = false), BUT:

PostgreSQL's planner handles this efficiently:

SELECT * FROM table WHERE col LIKE '%pattern%' ORDER BY score LIMIT 10;

Execution plan:

Limit
  -> Sort (cheap, small result set)
    -> Biscuit Index Scan (fast filtering)

Why this works:

  • Biscuit filters candidates extremely fast
  • Result set is small after filtering
  • Sorting 100-1000 rows in memory is negligible (<1ms)
  • Net result: Still much faster than pg_trgm with recheck overhead

Biscuit stores bitmaps in memory:

  • Use REINDEX to rebuild if index grows too large
  • INSERT: Similar to B-tree (must update bitmaps)
  • UPDATE: Two operations (remove old, insert new)
  • DELETE: Marks as tombstone, batch cleanup at threshold

Feature Biscuit pg_trgm (GIN)
Wildcard patterns ✔ Native, exact ✔ Approximate
Recheck overhead ✔ None (deterministic) ✗ Always required
Multi-column ✔ Optimized ⚠️ Via btree_gist
Aggregate queries ✔ Optimized ✗ Same cost
ORDER BY + LIMIT ✔ Works well ✔ Ordered scans
Regex support ✗ No ✔ Yes
Similarity search ✗ No ✔ Yes
ILIKE support ✔ Full ✔ Native

When to use Biscuit:

  • Wildcard-heavy LIKE / ILIKE queries (%, _)
  • Multi-column pattern matching
  • Need exact results (no false positives)
  • COUNT(*) / aggregate queries
  • High query volume, can afford memory

When to use pg_trgm:

  • Fuzzy/similarity search (word <-> pattern)
  • Regular expressions
  • Memory-constrained environments
  • Write-heavy workloads

git clone https://github.com/Crystallinecore/biscuit.git
cd biscuit

# Development build with debug symbols
make clean
CFLAGS="-g -O0 -DDEBUG" make

# Run tests
make installcheck

# Install
sudo make install
# Unit tests
make installcheck

# Manual testing
psql -d testdb

CREATE EXTENSION biscuit;

-- Create test table
CREATE TABLE test (id SERIAL, name TEXT);
INSERT INTO test (name) VALUES ('hello'), ('world'), ('test');

-- Create index
CREATE INDEX idx_test ON test USING biscuit(name);

-- Test queries
EXPLAIN ANALYZE SELECT * FROM test WHERE name LIKE '%ell%';

Enable PostgreSQL debug logging:

SET client_min_messages = DEBUG1;
SET log_min_messages = DEBUG1;

-- Now run queries to see Biscuit's internal logs
SELECT * FROM test WHERE name LIKE '%pattern%';

BiscuitIndex
├── num_columns: int
├── column_indices[]: ColumnIndex[]
│   ├── pos_idx[256]: CharIndex    // Forward position bitmaps
│   │   └── entries[]: PosEntry[]
│   │       ├── pos: int
│   │       └── bitmap: RoaringBitmap
│   ├── neg_idx[256]: CharIndex    // Backward position bitmaps
│   ├── char_cache[256]: RoaringBitmap  // Character existence
│   ├── length_bitmaps[]: RoaringBitmap[]  // Exact lengths
│   └── length_ge_bitmaps[]: RoaringBitmap[]  // Min lengths
├── insensitive_column_indices[]: ColumnIndex[]
│   ├── insensitive_pos_idx[256]: CharIndex    // Forward position bitmaps
│   │   └── entries[]: PosEntry[]
│   │       ├── pos: int
│   │       └── bitmap: RoaringBitmap
│   ├── insensitive_neg_idx[256]: CharIndex    // Backward position bitmaps
│   └── insensitive_char_cache[256]: RoaringBitmap  // Character existence
├── tids[]: ItemPointerData[]      // Record TIDs
├── column_data_cache[][]: char**  // Cached string data
└── tombstones: RoaringBitmap      // Deleted records
1. biscuit_rescan()
   ├─> Parse LIKE pattern into parts
   ├─> Analyze pattern selectivity (multi-column)
   ├─> Reorder predicates by priority
   └─> For each predicate:
       ├─> biscuit_query_column_pattern()
       │   ├─> Check fast paths (empty, %, pure wildcards)
       │   ├─> Match pattern parts using bitmaps
       │   └─> Return candidate bitmap
       └─> Intersect with previous candidates

2. biscuit_collect_tids_optimized()
   ├─> Detect aggregate vs. regular query
   ├─> Estimate LIMIT hint
   ├─> Collect TIDs from final bitmap
   ├─> Sort if needed (skip for aggregates)
   └─> Apply LIMIT early termination

3. biscuit_gettuple() or biscuit_getbitmap()
   └─> Return results to PostgreSQL executor

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing)
  3. Make your changes with tests
  4. Submit a pull request

MIT License - See LICENSE file for details.


Sivaprasad Murali


  • PostgreSQL community for the extensible index AM framework
  • CRoaring library for efficient bitmap operations
  • Inspired by the need for faster LIKE query performance in production systems


Happy pattern matching! Grab a biscuit 🍪 when pg_trgm feels half-baked!


联系我们 contact @ memedata.com