KNN 特征提取

KNN 特征提取
Feature Extraction with KNN

原始链接: https://davpinto.github.io/fastknn/articles/knn-extraction.html

`fastknn` 包提供了一个名为 `knnExtract` 的函数，用于使用 K 最近邻进行特征提取。该技术基于数据点与其*k*个最近邻（*在每个类别内*）之间的距离生成新特征。具体来说，它计算每个类别到第 1、第 2、第 3…第 *k* 个最近邻的距离，从而产生 *k* * *c* 个新特征（其中 *c* 是类别的数量）。为了防止过拟合，训练特征是使用交叉验证生成的。该过程还可以通过 `nthread` 参数进行并行化。这种方法受到 Kaggle Otto Group 产品分类挑战赛获胜方案的启发，有效地将非线性数据映射到线性空间，从而提高可分性。通过 Ionosphere、chess 和 spirals 数据集进行演示，使用 KNN 提取的特征可以显著提高模型准确性（例如，在 Ionosphere 数据上使用 GLM 时，从 83.81% 提高到 95.24%），与仅使用原始特征相比，尤其是在具有复杂关系的数据集上。`knnDecision` 函数可视化了 KNN 特征如何创建更清晰的决策边界。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录使用 KNN 进行特征提取 (davpinto.github.io) 7 分，由 RicoElectrico 发表于 2 小时前 | 隐藏 | 过去 | 收藏 | 讨论指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

原文

The fastknn provides a function to do feature extraction using KNN. It generates k * c new features, where c is the number of class labels. The new features are computed from the distances between the observations and their k nearest neighbors inside each class, as follows:

The first test feature contains the distances between each test instance and its nearest neighbor inside the first class.
The second test feature contains the sums of distances between each test instance and its 2 nearest neighbors inside the first class.
The third test feature contains the sums of distances between each test instance and its 3 nearest neighbors inside the first class.
And so on.

This procedure repeats for each class label, generating k * c new features. Then, the new training features are generated using a n-fold CV approach, in order to avoid overfitting. Parallelization is available. You can specify the number of threads via nthread parameter.

The feature extraction technique proposed here is based on the ideas presented in the winner solution of the Otto Group Product Classification Challenge on Kaggle.

The following example shows that the KNN features carry information about the original data that can not be extracted by a linear learner, like a GLM model:

library("mlbench")
library("caTools")
library("fastknn")
library("glmnet")

#### Load data
data("Ionosphere", package = "mlbench")
x <- data.matrix(subset(Ionosphere, select = -Class))
y <- Ionosphere$Class

#### Remove near zero variance columns
x <- x[, -c(1,2)]

#### Split data
set.seed(123)
tr.idx <- which(sample.split(Y = y, SplitRatio = 0.7))
x.tr <- x[tr.idx,]
x.te <- x[-tr.idx,]
y.tr <- y[tr.idx]
y.te <- y[-tr.idx]

#### GLM with original features
glm <- glmnet(x = x.tr, y = y.tr, family = "binomial", lambda = 0)
yhat <- drop(predict(glm, x.te, type = "class"))
yhat1 <- factor(yhat, levels = levels(y.tr))

#### Generate KNN features
set.seed(123)
new.data <- knnExtract(xtr = x.tr, ytr = y.tr, xte = x.te, k = 3)

#### GLM with KNN features
glm <- glmnet(x = new.data$new.tr, y = y.tr, family = "binomial", lambda = 0)
yhat <- drop(predict(glm, new.data$new.te, type = "class"))
yhat2 <- factor(yhat, levels = levels(y.tr))

#### Performance
sprintf("Accuracy with original features: %.2f", 100 * (1 - classLoss(actual = y.te, predicted = yhat1)))
sprintf("Accuracy with KNN features: %.2f", 100 * (1 - classLoss(actual = y.te, predicted = yhat2)))

## [1] "Accuracy with original features: 83.81"

## [1] "Accuracy with KNN features: 95.24"

For a more complete example, take a look at this Kaggle Kernel showing how knnExtract() peforms on a large dataset.

KNN makes a nonlinear mapping of the original space and project it into a linear one, in which the classes are linearly separable.

Mapping the chess dataset

library("caTools")
library("fastknn")
library("ggplot2")
library("gridExtra")

## Load data
data("chess")
x <- data.matrix(chess$x)
y <- chess$y

## Split data
set.seed(123)
tr.idx <- which(sample.split(Y = y, SplitRatio = 0.7))
x.tr <- x[tr.idx,]
x.te <- x[-tr.idx,]
y.tr <- y[tr.idx]
y.te <- y[-tr.idx]

## Feature extraction with KNN
set.seed(123)
new.data <- knnExtract(x.tr, y.tr, x.te, k = 1)

## Decision boundaries
g1 <- knnDecision(x.tr, y.tr, x.te, y.te, k = 10) +
   labs(title = "Original Features")
g2 <- knnDecision(new.data$new.tr, y.tr, new.data$new.te, y.te, k = 10) +
   labs(title = "KNN Features")
grid.arrange(g1, g2, ncol = 2)

Mapping the spirals dataset

## Load data
data("spirals")
x <- data.matrix(spirals$x)
y <- spirals$y

## Split data
set.seed(123)
tr.idx <- which(sample.split(Y = y, SplitRatio = 0.7))
x.tr <- x[tr.idx,]
x.te <- x[-tr.idx,]
y.tr <- y[tr.idx]
y.te <- y[-tr.idx]

## Feature extraction with KNN
set.seed(123)
new.data <- knnExtract(x.tr, y.tr, x.te, k = 1)

## Decision boundaries
g1 <- knnDecision(x.tr, y.tr, x.te, y.te, k = 10) +
   labs(title = "Original Features")
g2 <- knnDecision(new.data$new.tr, y.tr, new.data$new.te, y.te, k = 10) +
   labs(title = "KNN Features")
grid.arrange(g1, g2, ncol = 2)

KNN 特征提取 Feature Extraction with KNN

KNN 特征提取
Feature Extraction with KNN