Show HN: Misata – 使用 LLM 和向量化 NumPy 的合成数据引擎
Show HN: Misata – synthetic data engine using LLM and Vectorized NumPy

原始链接: https://github.com/rasinmuhammed/misata

## Misata:逼真合成数据生成 Misata 是一款工具,可直接从自然语言描述生成逼真的多表数据集——无需模式设计或训练数据。只需描述您需要的数据(例如“具有产品和订单的电子商务”),Misata 就会自动生成具有适当模式、关系和业务约束的关系数据库。 主要功能包括自动模式生成、关系完整性、对大型数据集的支持(通过流式传输实现 1000 万+ 行),以及定义自定义业务规则的能力。Misata 利用大型语言模型 (LLM) 通过 Groq、OpenAI 和 Ollama(用于本地、私有生成)等提供商来智能解析描述。 用户可以通过诸如行数、重现性的种子以及注入噪声以提高真实性等选项来定制数据生成。高级功能包括时间漂移模拟和自定义列覆盖。Misata 可作为命令行工具使用,并为复杂场景和与现有管道集成提供企业解决方案。它采用 MIT 许可,由 Muhammed Rasin 构建。

## Misata:基于LLM的合成数据生成 Misata是一个新的开源合成数据引擎,旨在克服Faker和Mimesis等现有工具在关系和时间数据完整性方面的局限性。由rasinmuhammed(github.com/rasinmuhammed)创建,Misata采用两层方法:使用LLM(Groq/Llama-3.3)解释定义数据关系的自然语言规则,并使用高性能的向量化NumPy模拟层生成数据本身。 目前处于早期alpha阶段,Misata在M1 Air上可以生成大约25万行数据每秒。它通过构建表的依赖关系图来确保引用完整性。作者正在寻求关于架构的反馈,特别是关于使用内存Pandas数据框扩展到当前1000万行限制的方法,DuckDB被认为是潜在的解决方案。一个独特且实验性的功能允许从图表描述中生成数据。
相关文章

原文

Generate realistic multi-table datasets from natural language.

No schema writing. No training data. Just describe what you need.

Version License Python

✨ What Makes Misata Different

Feature Faker SDV Misata
Natural language input
Auto schema generation
Relational integrity
Business constraints
No training data needed
Streaming (10M+ rows)
export GROQ_API_KEY=your_key  # Get free: https://console.groq.com
misata generate --story "A SaaS with 50K users, subscriptions, and payments" --use-llm
export OPENAI_API_KEY=your_key
misata generate --story "E-commerce with products and orders" --use-llm --provider openai

With Ollama (Local, Free, Private)

ollama run llama3  # Start Ollama first
misata generate --story "Fitness app with workouts" --use-llm --provider ollama
$ misata generate --story "A fitness app with 50K users" --use-llm

🧠 Using Groq (llama-3.3-70b-versatile) for intelligent parsing...
✅ LLM schema generated successfully!

📋 Schema: FitnessApp
   Tables: 5
   Relationships: 4

🔧 Generating 5 table(s)...

   ✓ exercises     (10 rows)
   ✓ plans         (5 rows)
   ✓ users         (50,000 rows)
   ✓ subscriptions (45,000 rows)
   ✓ workouts      (500,000 rows)

⏱️  Generation time: 2.34 seconds
🚀 Performance: 213,675 rows/second
💾 Data saved to: ./generated_data
from misata import DataSimulator, SchemaConfig
from misata.llm_parser import LLMSchemaGenerator

# Generate schema from story
llm = LLMSchemaGenerator(provider="groq")  # or "openai", "ollama"
config = llm.generate_from_story(
    "A mobile fitness app with 50K users, workout tracking, "
    "premium subscriptions, and January signup spikes"
)

# Generate data
for table_name, batch in DataSimulator(config).generate_all():
    print(f"Generated {len(batch)} rows for {table_name}")
# Basic generation (rule-based, no API key needed)
misata generate --story "SaaS company with users and subscriptions"

# LLM-powered generation
misata generate --story "..." --use-llm

# Specify provider and model
misata generate --story "..." --use-llm --provider ollama --model llama3

# Custom output directory
misata generate --story "..." --use-llm --output-dir ./my_data

# Set row count
misata generate --story "..." --use-llm --rows 100000

# Reproducible with seed
misata generate --story "..." --use-llm --seed 42

🎯 Business Rule Constraints

Define rules like "employees can't log >8 hours/day":

from misata import Constraint, Table

timesheets = Table(
    name="timesheets",
    row_count=10000,
    constraints=[
        Constraint(
            name="max_daily_hours",
            type="sum_limit",
            group_by=["employee_id", "date"],
            column="hours",
            value=8.0,
            action="redistribute"
        )
    ]
)
Provider Env Variable Free Tier Notes
Groq GROQ_API_KEY ✅ 30 req/min Fastest, recommended
OpenAI OPENAI_API_KEY Best quality
Ollama None ✅ Local Private, no internet

📈 Extending Data Pools

from misata import TextGenerator

# Add custom names
TextGenerator.extend_pool("first_names", ["Arjun", "Priya", "Rahul"])

# Load from file
TextGenerator.load_pools_from_file("custom_pools.json")

# Save for reuse
TextGenerator.save_pools_to_file("expanded_pools.json")

Make your synthetic data indistinguishable from real-world data with noise injection:

from misata import add_noise, NoiseInjector

# Quick noise injection
noisy_df = add_noise(df,
    null_rate=0.05,      # 5% missing values
    outlier_rate=0.02,   # 2% statistical outliers
    typo_rate=0.01,      # 1% typos in text
    duplicate_rate=0.03, # 3% duplicate rows
    seed=42
)

# Advanced: Temporal distribution drift
injector = NoiseInjector(seed=42)
df = injector.apply_temporal_drift(df, 
    date_column="created_at",
    value_column="revenue", 
    drift_rate=0.15,      # 15% increase over time
    drift_direction="up"
)
from misata import Customizer, ColumnOverride
import numpy as np

customizer = Customizer(seed=42)

# Custom age distribution (realistic, not uniform)
customizer.add_override("users", ColumnOverride(
    name="age",
    generator=lambda n: np.random.normal(35, 12, n).clip(18, 80).astype(int)
))

# Conditional values based on other columns
customizer.add_conditional("orders", "shipping_cost", {
    "country": {"US": 5.99, "UK": 9.99, "DE": 7.99}
})

# Apply to generated data
df = customizer.apply(df, "users")
Rows Time Speed
10K 0.03s 333K rows/sec
100K 0.26s 385K rows/sec
1M 2.6s 390K rows/sec
10M 26s 390K rows/sec (streaming)

Open In Colab

Try Misata in your browser without installing anything!

💼 Enterprise & Consulting

Need help with complex scenarios?

  • 🏢 Custom enterprise data schemas (10M+ rows)
  • 🔧 Integration with your existing pipelines
  • 📊 Industry-specific realistic data generation
  • 🎓 Training and onboarding for your team

📧 Contact: [email protected]

MIT License

Built by Muhammed Rasin


Misata - From story to synthetic database in one command.

联系我们 contact @ memedata.com