R 和 Kap 的一些比较

R 和 Kap 的一些比较
A little comparison between R and Kap

原始链接: https://blog.dhsdevelopments.com/a-little-comparison-between-r-and-kap

本文探讨了R和Kap在数据处理方面的差异，使用了对比Pandas（Python）和R的博客文章中的例子。作者用Kap重新实现了这些例子，以突出两种语言的方法。虽然Kap的解决方案通常更简洁，但R受益于有用的默认设置——例如，读取CSV时自动解析数据类型——这在Kap中需要显式处理。例如，在R中加载CSV文件会自动识别数字列，而Kap最初将所有内容读取为字符串，需要单独的步骤来定义列标题并转换数据类型。求和一列或按国家/地区分组等常见操作在两者中都可以实现，但Kap需要更直接的指定。作者展示了计算总数、应用折扣和删除异常值等任务，展示了Kap强大的数组操作能力。最终，R和Kap（或Pandas）之间的选择取决于个人偏好，R优先通过默认设置提供便利，而Kap提供更明确、可能更高效的方法。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 R 和 Kap 的一个小比较 (dhsdevelopments.com) 4 分，by tosh 1 小时前 | 隐藏 | 过去 | 收藏 | 讨论帮助考虑申请 YC 2026 夏季项目！申请截止至 5 月 4 日指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

原文

Some time ago, I read this article: Why pandas feels clunky when coming from R. In it, the author explains why they feel that R is a much smoother tool than Pandas.

I'm not a familiar with Pandas, but I do know a bit of R, so when I recently implemented some new features in Kap, I decided that reimplementing the examples in the blog post in Kap may be a good way to demonstrate the differences between the languages.

Spoiler: the Kap solutions are shorter, but R has some nice defaults that has to be specified explicitly in Kap. At the end of the day, it all comes down to individual preference.

Loading the dataset

In R, the function read_csv is used to load CSV data. This function automatically parses things that look like numeric values as numbers, while the corresponding function in Kap returns strings. It also does not make an attempt to process the column headers.

    purchases ← io:readCsv "purchases.csv"
┌→──────────────────────────────┐
↓  "country" "amount" "discount"│
│      "USA"   "2000"       "10"│
│      "USA"   "3500"       "15"│
│      "USA"   "3000"       "20"│
│   "Canada"    "120"       "12"│
│   "Canada"    "180"       "18"│
│   "Canada"   "3100"       "21"│
...
└───────────────────────────────┘

So, the first thing we want to do is to remove the first row and use it as column labels. The simplest way to do this is to combine these using a fork:

    purchases ← (>1↑)«labels»(1↓) purchases
┌───────────┬──────┬────────┐
│    country│amount│discount│
├→──────────┴──────┴────────┤
↓      "USA" "2000"     "10"│
│      "USA" "3500"     "15"│
│      "USA" "3000"     "20"│
│   "Canada"  "120"     "12"│
│   "Canada"  "180"     "18"│
│   "Canada" "3100"     "21"│
...
└───────────────────────────┘

All the above does is to take the first row (1↑) and turn that into a 1-dimensional array of strings (using <), then drop the first row (using 1↓) and finally pass these two arrays to labels which constructs the final result.

We still have to convert the strings into numbers. The function to do that is ⍎, but we don't want to call it on the first column. This is achieved by running the parsing with under applied on a drop of the first column:

purchases ← ⍎¨⍢(0 1↓) purchases

Now we have the data in the correct format. Perhaps there should be a variant of readCsv that can do all of this automatically. After all, it's a common enough operation that R does it automatically.

Taking the total sum

This is simple enough. Just take the values in the amount column and do a reduction over add:

+/ purchases.amount

Grouping

Kap provides the group function and the key operator when grouping. Here we can use key:

    purchases.country +/⌸ purchases.amount
┌→───────────────┐
↓      "USA" 8500│
│   "Canada" 3400│
│       "UK"  480│
│   "France"  500│
│  "Germany"  570│
│"Australia"  600│
│    "Italy"  630│
│    "Spain"  660│
│    "Japan"  690│
│    "India"  720│
│   "Brazil"  460│
└────────────────┘

Deducting the discount is just a calculation prior to the grouping:

purchases.country +/⌸ -/purchases[;1 2]

The above takes the second and third column and performs a reduction over minus. This just takes each value in the first column and subtracts the value in the second column. Since it's a reduction over two values, this just means that we take the first column minus the second column.

Removing outliers

When removing outliers, we will rely on selection. This just means that we'll create a bitmap of the elements we want to keep, and filter out the rest using ⌿.

Here's how we create the bitmap:

    (10×stat:median)⍛> purchases.amount
┌→──────────────────────────────────────────────────────────────┐
│1 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1│
└───────────────────────────────────────────────────────────────┘

And putting it all together:

{⍵.country +/⌸ +/⍵[;1 2]} ((10×stat:median)⍛> purchases.amount)⌿purchases

The final example is where we are supposed to take the median within each country. This just moves the filter inside the grouping function:

purchases.country {+/ ((10×stat:median)⍛> ⍵.amount) ⌿ -/⍵}⌸ purchases[;1 2]