Show HN:CocoIndex——面向AI的开源数据框架,专注数据新鲜度
Show HN: CocoIndex – Open-Source Data framework for AI, built for data freshness

原始链接: https://github.com/cocoindex-io/cocoindex

CocoIndex是一个开源引擎,旨在高效地进行数据索引,支持自定义转换和增量更新。用户可以定义转换逻辑,然后CocoIndex会自动创建和维护一个基于源数据变化的最新索引。它支持的功能包括将文档拆分成块,并使用Transformer模型对其进行嵌入。 入门非常简单,我们提供了现成的文档、快速入门指南和视频教程。CocoIndex可以作为Python库安装,并通过pgvector扩展与PostgreSQL集成。用户可以使用声明式API定义索引流程,指定数据源、转换和导出目标(例如PostgreSQL)。示例流程演示了如何处理文档、创建嵌入和构建向量索引。 我们鼓励社区贡献,包括代码改进、文档编写和功能建议。欢迎加入Discord社区进行讨论和寻求支持。CocoIndex采用Apache 2.0许可证发布。

CocoIndex,一个针对AI数据新鲜度优化的开源数据ETL框架,正在Hacker News上获得关注。它由一位前谷歌工程师创建,旨在解决预打包RAG解决方案之外的可定制数据转换管道的需求。CocoIndex允许用户像定义电子表格公式一样定义数据流,确保派生数据能够对源数据变化做出反应。 该框架强调增量处理,类似于物化视图,通过跟踪管道状态和处理变更数据捕获来最大限度地减少重新计算。这确保了对目标存储(如Qdrant、Postgres和Neo4j)的有效更新。用户可以像搭积木一样轻松地插入自定义逻辑,以构建自定义AI管道和数据处理工作流。 早期反馈突出显示了它在更新向量嵌入等任务中的易用性,以及它在RAG之外的适用性,例如解析文档并将它们加载到数据库中。创建者鼓励用户通过GitHub提供反馈和支持请求。CocoIndex采用Apache 2.0许可证。

原文

CocoIndex

Extract, Transform, Index Data. Easy and Fresh. 🌴

CocoIndex is the world's first open-source engine that supports both custom transformation logic and incremental updates specialized for data indexing.

CocoIndex

With CocoIndex, users declare the transformation, CocoIndex creates & maintains an index, and keeps the derived index up to date based on source update, with minimal computation and changes.

If you're new to CocoIndex 🤗, we recommend checking out the 📖 Documentation and ⚡ Quick Start Guide. We also have a ▶️ quick start video tutorial for you to jump start.

  1. Install CocoIndex Python library
  1. Setup Postgres with pgvector extension; or bring up a Postgres database using docker compose:

    • Make sure Docker Compose is installed: docs
    • Start a Postgres SQL database for cocoindex using our docker compose config:
    docker compose -f <(curl -L https://raw.githubusercontent.com/cocoindex-io/cocoindex/refs/heads/main/dev/postgres.yaml) up -d

Start your first indexing flow!

Follow Quick Start Guide to define your first indexing flow. A common indexing flow looks like:

@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    # Add a data source to read files from a directory
    data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))

    # Add a collector for data to be exported to the vector index
    doc_embeddings = data_scope.add_collector()

    # Transform data of each document
    with data_scope["documents"].row() as doc:
        # Split the document into chunks, put into `chunks` field
        doc["chunks"] = doc["content"].transform(
            cocoindex.functions.SplitRecursively(),
            language="markdown", chunk_size=2000, chunk_overlap=500)

        # Transform data of each chunk
        with doc["chunks"].row() as chunk:
            # Embed the chunk, put into `embedding` field
            chunk["embedding"] = chunk["text"].transform(
                cocoindex.functions.SentenceTransformerEmbed(
                    model="sentence-transformers/all-MiniLM-L6-v2"))

            # Collect the chunk into the collector.
            doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
                                   text=chunk["text"], embedding=chunk["embedding"])

    # Export collected data to a vector index.
    doc_embeddings.export(
        "doc_embeddings",
        cocoindex.storages.Postgres(),
        primary_key_fields=["filename", "location"],
        vector_indexes=[
            cocoindex.VectorIndexDef(
                field_name="embedding",
                metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])

It defines an index flow like this: Flow diagram

Play with existing example and demo

Go to the examples directory to try out with any of the examples, following instructions under specific example directory.

More coming and stay tuned! If there's any specific examples you would like to see, please let us know in our Discord community 🌱.

For detailed documentation, visit CocoIndex Documentation, including a Quickstart guide.

We love contributions from our community ❤️. For details on contributing or running the project for development, check out our contributing guide.

Welcome with a huge coconut hug 🥥⋆。˚🤗. We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests, and discussions in our Discord.

Join our community here:

CocoIndex is Apache 2.0 licensed.

联系我们 contact @ memedata.com