我尝试构建一个生产就绪的最近邻系统时学到的东西
What I learned while trying to build a production-ready nearest neighbor system

原始链接: https://github.com/thatipamula-jashwanth/smart-knn

## SmartKNN:一种增强的最近邻算法 SmartKNN是经典K最近邻(KNN)算法的现代改进版本,旨在提高准确性、鲁棒性和可扩展性。与传统KNN算法平等对待所有特征不同,SmartKNN使用诸如MSE相关性、互信息或随机森林重要性等方法**学习特征重要性**,从而有效抑制不相关的维度。 它利用**自适应邻居搜索**,为处理大型数据集提供暴力搜索和近似最近邻(ANN)选项,并具有潜在的GPU加速能力。SmartKNN支持**回归和分类**任务,采用距离加权投票和自动数据预处理(处理NaN值、归一化等)。 SmartKNN采用与scikit-learn兼容的API并使用NumPy/Numba进行优化,优先考虑实际性能,并保持稳定的API(v2.x系列)。它采用MIT许可证发布,并鼓励研究合作。

一位开发者分享了构建生产级最近邻系统的心路历程,挑战了最初认为scikit-learn的KNN实现已经“解决”的假设。他们发现,在基础KNN之外,存在许多变体和优化,专注于CPU性能、可预测性和可部署性。 主要收获包括特征重要性与剪枝的重要性、维度灾难的实际影响,以及仔细缩放/归一化的必要性。他们发现,在实际应用中,推理时间和内存占用通常比微小的准确性提升更重要。 该项目涉及学习特征加权、向量化距离计算和近似邻搜索。开发者强调,不存在普遍“最佳”算法,性能高度依赖于数据特征和约束。 他们正在寻求关于CPU向量化、大规模部署经验以及度量学习/可扩展距离方法相关研究的反馈,并在GitHub上分享他们的实现以促进讨论。
相关文章

原文

SmartKNN logo

A modern, weighted nearest-neighbor learning algorithm with learned feature importance and adaptive neighbor search.

Website PyPI version Python versions CI status MIT License Hugging Face Demo

SmartKNN is a nearest-neighbor–based learning method that belongs to the broader KNN family of algorithms.

It is designed to address common limitations observed in classical KNN approaches, including:

  • uniform treatment of all features
  • sensitivity to noisy or weakly informative dimensions
  • limited scalability as dataset size grows

SmartKNN incorporates data-driven feature importance estimation, dimension suppression, and adaptive neighbor search strategies. Depending on dataset characteristics, it can operate using either a brute-force search or an approximate nearest-neighbor (ANN) backend, while exposing a consistent, scikit-learn–compatible API.

The method supports both regression and classification tasks and prioritizes robustness, predictive accuracy, and practical inference latency across a range of dataset sizes.


  • Learned feature weighting
    • MSE relevance
    • Mutual Information
    • Random Forest importance
      (method configurable depending on task and dataset)
  • Automatic preprocessing
    • normalization
    • NaN / Inf handling
    • feature masking
  • Distance-weighted neighbor voting
  • Brute-force and ANN backends
    • designed to scale to large datasets (hardware and tuning dependent)
    • optional GPU-accelerated neighbor search
  • Vectorized NumPy with Numba acceleration
  • Scikit-learn–compatible API


Detailed documentation and design notes are maintained externally. This repository README is intentionally kept concise.


Runnable examples are available in the examples/ directory:

python examples/regression_example.py
python examples/classification_example.py

  • Comprehensive benchmark suites for regression and classification
  • GitHub Actions CI for tests and benchmarks
  • Reproducible, engineering-focused evaluation

Benchmark details are documented in benchmarks/README.md.


  • SmartKNN v2 is stable
  • API is frozen for the v2.x series (backward-compatible improvements only)
  • Actively maintained
  • Open to research and engineering collaboration

SmartKNN is released under the MIT License. See LICENSE for details.

联系我们 contact @ memedata.com