机器学习的视觉介绍

机器学习的视觉介绍
A Visual Introduction to Machine Learning (2015)

原始链接: https://r2d3.us/visual-intro-to-machine-learning-part-1/

## 决策树与寻找最佳边界本文解释了决策树如何在机器学习中用于分类数据，特别是区分纽约和旧金山的房屋。核心思想是找到最佳“边界”——或*分割点*——根据海拔和价格等特征有效地将两组数据分开。最初，考虑了240英尺的海拔边界，但直方图显示大多数房屋位于*较低*的海拔。选择分割点涉及权衡：较高的分割点有导致*假阴性*的风险（将旧金山的房屋错误分类为纽约），而较低的分割点则会产生*假阳性*。 “最佳”分割点最大化每个分支内的同质性——这意味着每个组尽可能“纯粹”。这个过程不是一次性的；*递归*允许算法使用不同的特征（如每平方英尺的价格）重复分割数据集，以完善分类，最终构建更准确的决策树。即使是最佳分割点也不是完美的，这突显了数据分离的复杂性。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 工作 | 提交登录机器学习的视觉介绍 (r2d3.us) 23 分，由 vismit2000 发布 41 分钟前 | 隐藏 | 过去 | 收藏 | 1 条评论帮助 ayhanfuat 30 分钟前 | 下一个 [–] 这篇来自 2015 年。在技术和概念上都超前于时代。回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

原文

Finding better boundaries

Let's revisit the 240-ft elevation boundary proposed previously to see how we can improve upon our intuition.

Clearly, this requires a different perspective.

By transforming our visualization into a histogram, we can better see how frequently homes appear at each elevation.

While the highest home in New York is ~240 ft, the majority of them seem to have far lower elevations.

Your first fork

A decision tree uses if-then statements to define patterns in data.

For example, if a home's elevation is above some number, then the home is probably in San Francisco.

In machine learning, these statements are called forks, and they split the data into two branches based on some value.

That value between the branches is called a split point. Homes to the left of that point get categorized in one way, while those to the right are categorized in another. A split point is the decision tree's version of a boundary.

Tradeoffs

Picking a split point has tradeoffs. Our initial split (~240 ft) incorrectly classifies some San Francisco homes as New York ones.

Look at that large slice of green in the left pie chart, those are all the San Francisco homes that are misclassified. These are called false negatives.

However, a split point meant to capture every San Francisco home will include many New York homes as well. These are called false positives.

The best split

At the best split, the results of each branch should be as homogeneous (or pure) as possible. There are several mathematical methods you can choose between to calculate the best split.

As we see here, even the best split on a single feature does not fully separate the San Francisco homes from the New York ones.

Recursion

To add another split point, the algorithm repeats the process above on the subsets of data. This repetition is called recursion, and it is a concept that appears frequently in training models.

The histograms to the left show the distribution of each subset, repeated for each variable.

The best split will vary based which branch of the tree you are looking at.

For lower elevation homes, price per square foot, at X dollars per sqft, is the best variable for the next if-then statement. For higher elevation homes, it is price, at Y dollars.