Show HN: Misata – 使用 LLM 和向量化 NumPy 的合成数据引擎

Show HN: Misata – 使用 LLM 和向量化 NumPy 的合成数据引擎
Show HN: Misata – synthetic data engine using LLM and Vectorized NumPy

原始链接: https://github.com/rasinmuhammed/misata

## Misata：逼真合成数据生成 Misata 是一款工具，可直接从自然语言描述生成逼真的多表数据集——无需模式设计或训练数据。只需描述您需要的数据（例如“具有产品和订单的电子商务”），Misata 就会自动生成具有适当模式、关系和业务约束的关系数据库。主要功能包括自动模式生成、关系完整性、对大型数据集的支持（通过流式传输实现 1000 万+ 行），以及定义自定义业务规则的能力。Misata 利用大型语言模型 (LLM) 通过 Groq、OpenAI 和 Ollama（用于本地、私有生成）等提供商来智能解析描述。用户可以通过诸如行数、重现性的种子以及注入噪声以提高真实性等选项来定制数据生成。高级功能包括时间漂移模拟和自定义列覆盖。Misata 可作为命令行工具使用，并为复杂场景和与现有管道集成提供企业解决方案。它采用 MIT 许可，由 Muhammed Rasin 构建。

## Misata：基于LLM的合成数据生成 Misata是一个新的开源合成数据引擎，旨在克服Faker和Mimesis等现有工具在关系和时间数据完整性方面的局限性。由rasinmuhammed（github.com/rasinmuhammed）创建，Misata采用两层方法：使用LLM（Groq/Llama-3.3）解释定义数据关系的自然语言规则，并使用高性能的向量化NumPy模拟层生成数据本身。目前处于早期alpha阶段，Misata在M1 Air上可以生成大约25万行数据每秒。它通过构建表的依赖关系图来确保引用完整性。作者正在寻求关于架构的反馈，特别是关于使用内存Pandas数据框扩展到当前1000万行限制的方法，DuckDB被认为是潜在的解决方案。一个独特且实验性的功能允许从图表描述中生成数据。

原文

Generate realistic multi-table datasets from natural language.

No schema writing. No training data. Just describe what you need.

✨ What Makes Misata Different

Feature	Faker	SDV	Misata
Natural language input	❌	❌	✅
Auto schema generation	❌	❌	✅
Relational integrity	❌	✅	✅
Business constraints	❌	❌	✅
No training data needed	✅	❌	✅
Streaming (10M+ rows)	❌	❌	✅

export GROQ_API_KEY=your_key  # Get free: https://console.groq.com
misata generate --story "A SaaS with 50K users, subscriptions, and payments" --use-llm

export OPENAI_API_KEY=your_key
misata generate --story "E-commerce with products and orders" --use-llm --provider openai

With Ollama (Local, Free, Private)

ollama run llama3  # Start Ollama first
misata generate --story "Fitness app with workouts" --use-llm --provider ollama

$ misata generate --story "A fitness app with 50K users" --use-llm

🧠 Using Groq (llama-3.3-70b-versatile) for intelligent parsing...
✅ LLM schema generated successfully!

📋 Schema: FitnessApp
   Tables: 5
   Relationships: 4

🔧 Generating 5 table(s)...

   ✓ exercises     (10 rows)
   ✓ plans         (5 rows)
   ✓ users         (50,000 rows)
   ✓ subscriptions (45,000 rows)
   ✓ workouts      (500,000 rows)

⏱️  Generation time: 2.34 seconds
🚀 Performance: 213,675 rows/second
💾 Data saved to: ./generated_data

from misata import DataSimulator, SchemaConfig
from misata.llm_parser import LLMSchemaGenerator

# Generate schema from story
llm = LLMSchemaGenerator(provider="groq")  # or "openai", "ollama"
config = llm.generate_from_story(
    "A mobile fitness app with 50K users, workout tracking, "
    "premium subscriptions, and January signup spikes"
)

# Generate data
for table_name, batch in DataSimulator(config).generate_all():
    print(f"Generated {len(batch)} rows for {table_name}")

# Basic generation (rule-based, no API key needed)
misata generate --story "SaaS company with users and subscriptions"

# LLM-powered generation
misata generate --story "..." --use-llm

# Specify provider and model
misata generate --story "..." --use-llm --provider ollama --model llama3

# Custom output directory
misata generate --story "..." --use-llm --output-dir ./my_data

# Set row count
misata generate --story "..." --use-llm --rows 100000

# Reproducible with seed
misata generate --story "..." --use-llm --seed 42

🎯 Business Rule Constraints

Define rules like "employees can't log >8 hours/day":

from misata import Constraint, Table

timesheets = Table(
    name="timesheets",
    row_count=10000,
    constraints=[
        Constraint(
            name="max_daily_hours",
            type="sum_limit",
            group_by=["employee_id", "date"],
            column="hours",
            value=8.0,
            action="redistribute"
        )
    ]
)

Provider	Env Variable	Free Tier	Notes
Groq	`GROQ_API_KEY`	✅ 30 req/min	Fastest, recommended
OpenAI	`OPENAI_API_KEY`	❌	Best quality
Ollama	None	✅ Local	Private, no internet

📈 Extending Data Pools

from misata import TextGenerator

# Add custom names
TextGenerator.extend_pool("first_names", ["Arjun", "Priya", "Rahul"])

# Load from file
TextGenerator.load_pools_from_file("custom_pools.json")

# Save for reuse
TextGenerator.save_pools_to_file("expanded_pools.json")

Make your synthetic data indistinguishable from real-world data with noise injection:

from misata import add_noise, NoiseInjector

# Quick noise injection
noisy_df = add_noise(df,
    null_rate=0.05,      # 5% missing values
    outlier_rate=0.02,   # 2% statistical outliers
    typo_rate=0.01,      # 1% typos in text
    duplicate_rate=0.03, # 3% duplicate rows
    seed=42
)

# Advanced: Temporal distribution drift
injector = NoiseInjector(seed=42)
df = injector.apply_temporal_drift(df, 
    date_column="created_at",
    value_column="revenue", 
    drift_rate=0.15,      # 15% increase over time
    drift_direction="up"
)

from misata import Customizer, ColumnOverride
import numpy as np

customizer = Customizer(seed=42)

# Custom age distribution (realistic, not uniform)
customizer.add_override("users", ColumnOverride(
    name="age",
    generator=lambda n: np.random.normal(35, 12, n).clip(18, 80).astype(int)
))

# Conditional values based on other columns
customizer.add_conditional("orders", "shipping_cost", {
    "country": {"US": 5.99, "UK": 9.99, "DE": 7.99}
})

# Apply to generated data
df = customizer.apply(df, "users")

Rows	Time	Speed
10K	0.03s	333K rows/sec
100K	0.26s	385K rows/sec
1M	2.6s	390K rows/sec
10M	26s	390K rows/sec (streaming)

Try Misata in your browser without installing anything!

💼 Enterprise & Consulting

Need help with complex scenarios?

🏢 Custom enterprise data schemas (10M+ rows)
🔧 Integration with your existing pipelines
📊 Industry-specific realistic data generation
🎓 Training and onboarding for your team

📧 Contact: [email protected]

MIT License

Built by Muhammed Rasin

Misata - From story to synthetic database in one command.