我尝试构建一个生产就绪的最近邻系统时学到的东西

我尝试构建一个生产就绪的最近邻系统时学到的东西
What I learned while trying to build a production-ready nearest neighbor system

原始链接: https://github.com/thatipamula-jashwanth/smart-knn

## SmartKNN：一种增强的最近邻算法 SmartKNN是经典K最近邻（KNN）算法的现代改进版本，旨在提高准确性、鲁棒性和可扩展性。与传统KNN算法平等对待所有特征不同，SmartKNN使用诸如MSE相关性、互信息或随机森林重要性等方法**学习特征重要性**，从而有效抑制不相关的维度。它利用**自适应邻居搜索**，为处理大型数据集提供暴力搜索和近似最近邻（ANN）选项，并具有潜在的GPU加速能力。SmartKNN支持**回归和分类**任务，采用距离加权投票和自动数据预处理（处理NaN值、归一化等）。 SmartKNN采用与scikit-learn兼容的API并使用NumPy/Numba进行优化，优先考虑实际性能，并保持稳定的API（v2.x系列）。它采用MIT许可证发布，并鼓励研究合作。

一位开发者分享了构建生产级最近邻系统的心路历程，挑战了最初认为scikit-learn的KNN实现已经“解决”的假设。他们发现，在基础KNN之外，存在许多变体和优化，专注于CPU性能、可预测性和可部署性。主要收获包括特征重要性与剪枝的重要性、维度灾难的实际影响，以及仔细缩放/归一化的必要性。他们发现，在实际应用中，推理时间和内存占用通常比微小的准确性提升更重要。该项目涉及学习特征加权、向量化距离计算和近似邻搜索。开发者强调，不存在普遍“最佳”算法，性能高度依赖于数据特征和约束。他们正在寻求关于CPU向量化、大规模部署经验以及度量学习/可扩展距离方法相关研究的反馈，并在GitHub上分享他们的实现以促进讨论。

原文

A modern, weighted nearest-neighbor learning algorithm with learned feature importance and adaptive neighbor search.

SmartKNN is a nearest-neighbor–based learning method that belongs to the broader KNN family of algorithms.

It is designed to address common limitations observed in classical KNN approaches, including:

uniform treatment of all features
sensitivity to noisy or weakly informative dimensions
limited scalability as dataset size grows

SmartKNN incorporates data-driven feature importance estimation, dimension suppression, and adaptive neighbor search strategies. Depending on dataset characteristics, it can operate using either a brute-force search or an approximate nearest-neighbor (ANN) backend, while exposing a consistent, scikit-learn–compatible API.

The method supports both regression and classification tasks and prioritizes robustness, predictive accuracy, and practical inference latency across a range of dataset sizes.

Learned feature weighting
- MSE relevance
- Mutual Information
- Random Forest importance
  (method configurable depending on task and dataset)
Automatic preprocessing
- normalization
- NaN / Inf handling
- feature masking
Distance-weighted neighbor voting
Brute-force and ANN backends
- designed to scale to large datasets (hardware and tuning dependent)
- optional GPU-accelerated neighbor search
Vectorized NumPy with Numba acceleration
Scikit-learn–compatible API

Detailed documentation and design notes are maintained externally. This repository README is intentionally kept concise.

Runnable examples are available in the examples/ directory:

python examples/regression_example.py
python examples/classification_example.py

Comprehensive benchmark suites for regression and classification
GitHub Actions CI for tests and benchmarks
Reproducible, engineering-focused evaluation

Benchmark details are documented in benchmarks/README.md.

SmartKNN v2 is stable
API is frozen for the v2.x series (backward-compatible improvements only)
Actively maintained
Open to research and engineering collaboration

SmartKNN is released under the MIT License. See LICENSE for details.

我尝试构建一个生产就绪的最近邻系统时学到的东西 What I learned while trying to build a production-ready nearest neighbor system

我尝试构建一个生产就绪的最近邻系统时学到的东西
What I learned while trying to build a production-ready nearest neighbor system