展示HN:30k个宜家商品,以纯文本格式(CommerceTXT)。比JSON小24%。
Show HN: 30k IKEA items in flat text (CommerceTXT). 24% smaller than JSON

原始链接: https://huggingface.co/datasets/tsazan/ikea-us-commercetxt

## IKEA美国电商TXT数据集摘要 该数据集包含来自IKEA美国的30,511件产品,采用CommerceTXT v1.0.1格式 – 一种针对电商数据,以token优化、人类可读的JSON替代方案。数据于2025年7月15日发布,组织成632个类别,每个类别包含单独的产品文件。 CommerceTXT专为高效的AI/LLM使用而设计,与JSON相比,token数量减少了**24%(节省360万个)**,包括目录结构。这在使用GPT-4o等模型时可以转化为显著的成本节省 – 每天100次查询,可能节省高达每月26,900美元。该格式也易于阅读和解析,简化了调试和版本控制。 该数据集包括根目录和类别索引文件,提供标准JSON实现中缺失的结构化导航。可通过`datasets`库和直接文件访问获取。示例数据包括产品名称、SKU、价格和规格。 **重要提示:**这是一个*非官方*研究数据集,与IKEA无关,仅供非商业、教育目的使用。数据是2025年7月的静态快照,可能已过时。

一位开发者将整个宜家美国产品目录(30,511件商品)转换为一种新的纯文本格式,名为CommerceTXT,并在Hugging Face上分享。该项目的目的是探索更简单的数据结构是否可以通过减少token使用量来提高大型语言模型(LLM)的效率。 结果表明,CommerceTXT比等效的压缩JSON版本小约**24%**,节省了360万个token。数据按类别分层组织,便于测试LLM的检索方法。 开发者还在GitHub上提供了解析代码,并欢迎大家提问关于转换过程的问题。该举措探索了优化LLM处理电商数据性能的替代数据格式。
相关文章

原文

Standard AI Context

30,511 IKEA US products in CommerceTXT v1.0.1 format - A token-optimized, human-readable alternative to JSON for e-commerce data.

📊 Dataset Statistics

Metric Value
Products 30,511
Categories 632
Format CommerceTXT v1.0.1
Data Date 2025-07-15
Token Savings 24% vs JSON
Tokens Saved 3.6M

🎯 What is CommerceTXT?

CommerceTXT is a lightweight, text-based protocol designed for AI/LLM consumption of e-commerce data. It eliminates JSON overhead while maintaining structure and readability.

Key Benefits:

  • 24% fewer tokens than JSON (3.6M saved including catalog structure)
  • Human-readable - easy to debug and version control
  • AI-optimized - clean format for RAG and LLM processing
  • Structured - parseable with simple rules

📁 Dataset Structure

ikea-us-commercetxt/
├── commerce.txt                    # Root with @CATALOG (632 categories)
├── products/                       # 30,511 files organized by category
│   ├── frames/
│   │   ├── 00263858.txt
│   │   └── ...
│   ├── tables-and-desks/
│   │   └── ...
│   └── ... (632 category folders)
├── categories/                     # 632 category index files
│   ├── frames.txt
│   ├── tables-and-desks.txt
│   └── ...

🚀 Usage

Load with datasets library

from datasets import load_dataset


dataset = load_dataset("tsazan/ikea-us-commercetxt")


commerce_txt = dataset['train'][0]['commerce.txt']
product_files = dataset['train'][0]['products']

Direct file access


with open("commerce.txt") as f:
    catalog = f.read()
    print(catalog)


with open("products/frames/00263858.txt") as f:
    product = f.read()
    print(product)


with open("categories/frames.txt") as f:
    category = f.read()
    print(category)

Parse with CommerceTXT parser

from commercetxt import parse_file


result = parse_file("products/frames/00263858.txt")


product = result.directives.get('PRODUCT', {})
offer = result.directives.get('OFFER', {})

print(f"Product: {product.get('Name')}")
print(f"Price: ${offer.get('Price')}")
print(f"Brand: {product.get('Brand')}")

📝 File Format Example

# @PRODUCT
Name: KNOPPÄNG frame, black
SKU: 00263858
Brand: IKEA
LastUpdated: 2025-07-15T00:00:00Z
URL: https://www.ikea.com/us/en/p/knoppaeng-frame-black-00263858/
Category: Frames

# @OFFER
Price: 5.99
Currency: USD
Availability: InStock
Condition: New
TaxIncluded: False

# @SPECS
Materials: Wood
Dimensions: Width: 12", Height: 16"
Care: Wipe clean with a cloth

# @IMAGES
- https://www.ikea.com/us/en/images/products/knoppaeng-frame-black__0638237_pe698788_s5.jpg

💰 Token Efficiency

Full Dataset Comparison (including catalog structure):

Clarification: Disclaimer section is not included in any of the token counts or savings calculations.

Component JSON Tokens CommerceTXT Tokens Savings
Products (30,511) 14,894,623 10,212,452 31.44%
Categories (632) N/A* 1,073,051 -
Root Catalog N/A* 11,180 -
TOTAL 14,894,623 11,296,683 24.16%

* JSON has no built-in catalog structure (requires separate database/index)

Per Product Average:

  • JSON: 488 tokens/product
  • CommerceTXT: 370 tokens/product (including catalog overhead)
  • Savings: 118 tokens/product (24%)

Cost Impact (GPT-4o at $2.50/1M input tokens):

  • 1 query/day: $269/month saved
  • 10 queries/day: $2,690/month saved
  • 100 queries/day: $26,900/month saved

Note: CommerceTXT includes structured navigation via @CATALOG and category files, which JSON lacks. Categories list all products, adding ~1.08M tokens. Even with this catalog overhead, CommerceTXT saves 3.6M tokens (24%)!

🔍 Use Cases

1. RAG (Retrieval-Augmented Generation)





2. Product Search




3. AI Shopping Assistant




📊 Token Savings Distribution

Product-level savings distribution (30,511 products):

When comparing individual products (JSON → CommerceTXT), before adding catalog overhead:

  0-10%:    111 products (0.4%)
 10-20%:  5,934 products (19.4%)
 20-30%: 10,018 products (32.8%)  ← Most common
 30-40%: 10,433 products (34.2%)  ← Most common
 40-50%:  3,239 products (10.6%)
   >50%:    776 products (2.5%)

Product average: ~31% savings per product
Dataset total (with catalog): 24% savings overall

Note: Individual products save ~31% on average, but the full dataset (including 632 category files with product listings) saves 24% overall. The catalog structure adds navigation value that JSON lacks.

⚖️ Legal & Disclaimer

Important: This is an unofficial research dataset for demonstrating CommerceTXT protocol.

  • NOT affiliated with IKEA Systems B.V.
  • ⚠️ Static snapshot from July 2025 - data may be outdated
  • 🔒 Research/educational use only - not for commercial purposes
  • ™️ IKEA® is a registered trademark of Inter IKEA Systems B.V.

No warranty provided. Use at your own risk.

📚 Resources

🛠️ Generation

This dataset was generated from IKEA US Product Dataset (July 2025 by converting it to CommerceTXT v1.0.1 format.

Conversion process:

  1. Parsed JSON from source dataset
  2. Extracted clean product names (removed measurements, IKEA US suffix)
  3. Organized products into 632 category folders
  4. Converted to CommerceTXT structured format
  5. Generated category index files with full product listings
  6. Created root @CATALOG with all 632 categories
  7. Validated all 30,511 product files for spec compliance

📜 Citation

If you use this dataset, please cite:

@dataset{ikea_us_commercetxt_2025,
  title = {IKEA US CommerceTXT Dataset},
  author = {Tsanko Zanov},
  year = {2026},
  url = {https://huggingface.co/datasets/tsazan/ikea-us-commercetxt}
}

Original data source:

@misc{ikea_us_products_2025,
  title = {IKEA US Product Dataset (July 2025)},
  author = {Jeffrey Zhou},
  year = {2025},
  url = {https://huggingface.co/datasets/jeffreyszhou/ikea-us-products-2025}
}

⚖️ Legal & Disclaimer

License: CC0 1.0 (Public Domain Dedication)

Important: This is an unofficial research dataset for demonstrating CommerceTXT protocol.

📬 Contact


Built with ❤️ for the AI & e-commerce community

联系我们 contact @ memedata.com