展示HN：使用朴素贝叶斯算法的Go语言垃圾邮件分类器

展示HN：使用朴素贝叶斯算法的Go语言垃圾邮件分类器
Show HN: Spam classifier in Go using Naive Bayes

原始链接: https://github.com/igomez10/nspammer

## Go语言实现的朴素贝叶斯垃圾邮件分类器这个Go包，`nspammer`，实现了一个基于文本的垃圾邮件检测的朴素贝叶斯分类器。它利用贝叶斯定理，并带有朴素独立性假设，并结合拉普拉斯平滑来防止遇到未见词时出现零概率问题。该分类器在带标签的消息数据集（垃圾邮件/非垃圾邮件）上进行训练，并提供了一个简单的API来对新的文本输入进行分类。在训练期间，它计算垃圾邮件和非垃圾邮件类别的先验概率和词频。分类涉及计算对数概率，以确定消息更有可能被识别为垃圾邮件还是非垃圾邮件。该包包含使用简单示例和Kaggle垃圾邮件数据集的测试，以便在真实数据上评估准确性。它可以通过`go get github.com/igomez10/nspammer`获得。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录展示 HN：使用朴素贝叶斯算法的 Go 垃圾邮件分类器 (github.com/igomez10) 18 分，作者 igomeza 4 小时前 | 隐藏 | 过去 | 收藏 | 3 条评论 cipherself 1 小时前 | 下一个 [–] 12 (13?) 年前我也用 Perl 写过一个朴素贝叶斯分类器 https://github.com/cipherself/NaiveBayes_perl 如果我记得没错，下一步 TODO 列表是添加向量化。同样（和作者一样），它使用对数概率来避免浮点数下溢。回复esafak 2 小时前 | 上一个 | 下一个 [–] https://www.paulgraham.com/better.html 回复leetrout 1 小时前 | 上一个 [–] 你能添加一个许可证，以便我们知道如何使用它吗？回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

原文

A Naive Bayes spam classifier implementation in Go, enabling text classification system using the Naive Bayes algorithm with Laplace smoothing to classify messages as spam or not spam.

Naive Bayes Classification: Uses probabilistic classification based on Bayes' theorem with naive independence assumptions
Laplace Smoothing: Implements additive smoothing to handle zero probabilities for unseen words
Training & Classification: Simple API for training on labeled datasets and classifying new messages
Real Dataset Testing: Includes tests with actual spam/ham email datasets

go get github.com/igomez10/nspammer

package main

import (
    "fmt"
    "github.com/igomez10/nspammer"
)

func main() {
    // Create training dataset (map[string]bool where true = spam, false = not spam)
    trainingData := map[string]bool{
        "buy viagra now":           true,
        "get rich quick":           true,
        "meeting at 3pm":           false,
        "project update report":    false,
    }

    // Create and train classifier
    classifier := nspammer.NewSpamClassifier(trainingData)

    // Classify new messages
    isSpam := classifier.Classify("buy now")
    fmt.Printf("Is spam: %v\n", isSpam)
}

`NewSpamClassifier(dataset map[string]bool) *SpamClassifier`

Creates a new spam classifier and trains it on the provided dataset. The dataset is a map where keys are text messages and values indicate whether the message is spam (true) or not spam (false).

`(*SpamClassifier).Classify(input string) bool`

Classifies the input text as spam (true) or not spam (false) based on the trained model.

The classifier uses the Naive Bayes algorithm:

Training Phase:
- Calculates prior probabilities: P(spam) and P(not spam)
- Builds a vocabulary from all training messages
- Counts word occurrences in spam and non-spam messages
- Stores word frequencies for likelihood calculations
Classification Phase:
- Calculates log probabilities to avoid numerical underflow
- Computes: log(P(spam)) + Σ log(P(word|spam))
- Computes: log(P(not spam)) + Σ log(P(word|not spam))
- Returns true (spam) if the spam score is higher
Laplace Smoothing:
- Adds a smoothing constant to avoid zero probabilities for unseen words
- Formula: P(word|class) = (count + α) / (total + α × vocabulary_size)
- Default α = 1.0

The project includes support for the Kaggle Spam Mails Dataset. To download it:

This script requires the Kaggle CLI to be installed and configured.

Run the test suite:

The tests include:

Simple classification examples
Real-world email dataset evaluation
Accuracy measurements on train/test splits