LightlyStudio – 一款开源的多模态数据整理和标注工具

LightlyStudio – 一款开源的多模态数据整理和标注工具
LightlyStudio – an open-source multimodal data curation and labeling tool

原始链接: https://github.com/lightly-ai/lightly-studio

## LightlyStudio：AI 领域开源数据管理工具 LightlyStudio 是一个使用 Rust 构建的开源工具，旨在简化机器学习的数据流程——从整理和标注到管理。它支持流行的格式，如 COCO 和 YOLO，并且可以在标准硬件（如 Macbook Pro）上高效运行。安装简单，使用 `pip install lightly-studio` 即可。该工具提供了一个 Python 接口，用于索引数据集（包括来自 S3 和 GCS 等云存储），查询和操作样本。用户可以轻松添加数据，访问样本属性（标签、元数据、文件路径），并使用表达式执行复杂的查询，用于过滤、排序和切片。 LightlyStudio 还具有自动数据选择功能，利用典型性和多样性来识别最有价值的标注样本，从而可能降低成本并提高模型质量。目前处于预览阶段，LightlyStudio 欢迎社区通过 GitHub 上的 issue 页面贡献代码。提供示例数据集和脚本，以便快速开始图像、目标检测、实例分割和图像描述等任务。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 LightlyStudio – 一个开源的多模态数据整理和标注工具 (github.com/lightly-ai) 53 分，由 masakljun 1天前发布 | 隐藏 | 过去 | 收藏 | 2 条评论 jononor 1天前 | 下一个 [–] 不错。看起来至少可以作为一个不错的图像标注工具。但是不支持音频或时间序列数据吗？:( 也许未来会支持？回复 toddmorey 1天前 | 上一个 [–] labelstud.io 也很棒，开源 + 高度可配置，并且支持音频和时间序列数据。我一直在用它做一个视频标注项目。回复考虑申请YC冬季2026批次！申请截止日期为11月10日指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系方式搜索：

原文

Curate, Annotate, and Manage Your Data in LightlyStudio.

We at Lightly created LightlyStudio, an open-source tool designed to unify your data workflows from curation, annotation and management in a single tool. Since we're big fans of Rust we used it to speed things up. You can work with COCO and ImageNet on a Macbook Pro with M1 and 16GB of memory!

Curate, Annotate, and Manage Your Data in LightlyStudio.

Runs on Python 3.8 or higher on Windows, Linux and MacOS.

pip install lightly-studio

Download example datasets by cloning the example repository or directly use your own YOLO/COCO dataset:

git clone https://github.com/lightly-ai/dataset_examples dataset_examples

To run an example using an image-only dataset, create a file named example_image.py with the following contents in the same directory that contains the dataset_examples/ folder:

import lightly_studio as ls

# Indexes the dataset, creates embeddings and stores everything in the database. Here we only load images.
dataset = ls.Dataset.create()
dataset.add_samples_from_path(path="dataset_examples/coco_subset_128_images/images")

# Start the UI server on localhost:8001.
# Use env variables LIGHTLY_STUDIO_HOST and LIGHTLY_STUDIO_PORT to customize it.
ls.start_gui()

Run the script with python example_image.py. Now you can inspect samples in the app.

To run an object detection example using a YOLO dataset, create a file named example_yolo.py with the following contents in the same directory that contains the dataset_examples/ folder:

import lightly_studio as ls

dataset = ls.Dataset.create()
dataset.add_samples_from_yolo(
   data_yaml="dataset_examples/road_signs_yolo/data.yaml",
)

ls.start_gui()

Run the script with python example_yolo.py. Now you can inspect samples with their assigned annotations in the app.

COCO Instance Segmentation

To run an instance segmentation example using a COCO dataset, create a file named example_coco.py with the following contents in the same directory that contains the dataset_examples/ folder:

import lightly_studio as ls

dataset = ls.Dataset.create()
dataset.add_samples_from_coco(
   annotations_json="dataset_examples/coco_subset_128_images/instances_train2017.json",
   images_path="dataset_examples/coco_subset_128_images/images",
   annotation_type=ls.AnnotationType.INSTANCE_SEGMENTATION,
)

ls.start_gui()

Run the script via python example_coco.py. Now you can inspect samples with their assigned annotations in the app.

To run a caption example using a COCO dataset, create a file named example_coco_captions.py with the following contents in the same directory that contains the dataset_examples/ folder:

import lightly_studio as ls

dataset = ls.Dataset.create()
dataset.add_samples_from_coco_caption(
   annotations_json="dataset_examples/coco_subset_128_images/captions_train2017.json",
   images_path="dataset_examples/coco_subset_128_images/images",
)

ls.start_gui()

Run the script with python example_coco_captions.py. Now you can inspect samples with their assigned captions in the app.

LightlyStudio has a powerful Python interface. You can not only index datasets but also query and manipulate them using code.

The dataset is the main entity of the python interface. It is used to setup the dataset, start the GUI, run queries and perform selections. It holds the connection to the database file.

import lightly_studio as ls

# Different loading options:
dataset = ls.Dataset.create()

# You can load data also from cloud storage
dataset.add_samples_from_path(path="s3://my-bucket/path/to/images/")

# And at any given time you can append more data (even across sources)
dataset.add_samples_from_path(path="gcs://my-bucket-2/path/to/more-images/")
dataset.add_samples_from_path(path="local-folder/some-data-not-in-the-cloud-yet")

# Load existing .db file
dataset = ls.Dataset.load()

A sample is a single data instance, a dataset holds the reference to all samples. One can access samples individually and read or write on a samples attributes.

# Iterating over the data in the dataset
for sample in dataset:
   # Access the sample: see next section

# Get all samples as list
samples = list(dataset)

# Access sample attributes
s = samples[0]
s.sample_id        # Sample ID (UUID)
s.file_name        # Image file name (str), e.g. "img1.png"
s.file_path_abs    # Full image file path (str), e.g. "full/path/img1.png"
s.tags             # The list of sample tags (list[str]), e.g. ["tag1", "tag2"]
s.metadata["key"]  # dict-like access for metadata (any)

# Set sample attributes
s.tags = {"tag1", "tag2"}
s.metadata["key"] = 123

# Adding/removing tags
s.add_tag("some_tag")
s.remove_tag("some_tag")

...

Dataset queries are a combination of filtering, sorting and slicing operations. For this the Expressions are used.

from lightly_studio.core.dataset_query import AND, OR, NOT, OrderByField, SampleField 

# QUERY: Define a lazy query, composed by: match, order_by, slice
# match: Find all samples that need labeling plus small samples (< 500px) that haven't been reviewed. 
query = dataset.match(
    OR(
        AND(
            SampleField.width < 500,
            NOT(SampleField.tags.contains("reviewed"))
        ),
        SampleField.tags.contains("needs-labeling")
    )
)

# order_by: Sort the samples by their width descending.
query.order_by(
    OrderByField(SampleField.width).desc()
)

# slice: Extract a slice of samples.
query[10:20]

# chaining: The query can also be constructed in chained way
query = dataset.match(...).order_by(...)[...]

# Ways to consume the query
# Tag this subset for easy filtering in the UI.
query.add_tag("needs-review")

# Iterate over resulting samples
for sample in query:
    # Access the sample: see previous section

# Collect all resulting samples as list
samples = query.to_list()

# Export all resulting samples in coco format
query.export().to_coco_object_detections()

LightlyStudio offers a premium feature to perform automatized data selection. Selecting the right subset of your data can save labeling cost and training time while improving model quality. Selection in LightlyStudio automatically picks the most useful samples - those that are both representative (typical) and diverse (novel).

You can balance these two aspects to fit your goal: stable core data, edge cases, or a mix of both.

from lightly_studio.selection.selection_config import (
    MetadataWeightingStrategy,
    EmbeddingDiversityStrategy,
)

...

# Compute typicality and store it as `typicality` metadata
dataset.compute_typicality_metadata(metadata_name="typicality")

# Select 10 samples by combining typicality and diversity, diversity
dataset.query().selection().multi_strategies(
    n_samples_to_select=10,
    selection_result_tag_name="multi_strategy_selection",
    selection_strategies=[
        MetadataWeightingStrategy(metadata_key="typicality", strength=1.0),
        EmbeddingDiversityStrategy(embedding_model_name="my_model_name", strength=2.0),
    ],
)

[0.4.0] - 2025-10-21 LightlyStudio released as preview version

We welcome contributions! Please check our issues page for current tasks and improvements, or propose new issues yourself.

LightlyStudio – 一款开源的多模态数据整理和标注工具 LightlyStudio – an open-source multimodal data curation and labeling tool

COCO Instance Segmentation

LightlyStudio – 一款开源的多模态数据整理和标注工具
LightlyStudio – an open-source multimodal data curation and labeling tool