摩尔多瓦破坏了我们的数据管道。
Moldova broke our data pipeline

原始链接: https://www.avraam.dev/blog/moldova-broke-our-pipeline

数据管道因意外问题失败:Shopify API 返回的国家名称“Moldova, Republic of”。DMS 复制过程将 CSV 文件写入 S3 以供 Redshift 加载,未对字段值进行引用,导致 Redshift 将国家名称中的逗号误解为新列。这导致数据损坏和管道失败。 最初的修复方法(例如更正源数据库中的数据)是临时的,因为 Shopify 会重新发送有问题的数据。更好的解决方案是在同步作业中清理国家名称,将逗号替换为连字符。 然而,最可靠的方法是在传输层解决问题:从 CSV 切换到 Parquet 格式(该格式了解模式并原生处理逗号),*并且*在数据摄取点清理数据。这可确保管道的弹性和干净的数据存储。 关键要点是在边界处处理数据验证和规范化——数据进入系统的地方——而不是在下游修补问题,因为意外的边缘情况(例如包含逗号的国家/地区名称)不可避免地会出现。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 摩尔多瓦破坏了我们的数据管道 (avraam.dev) 4 点赞 by almoner 1小时前 | 隐藏 | 过去 | 收藏 | 2 条评论 帮助 waiwai933 4分钟前 | 下一个 [–] 如果你控制了管道,并且可以像文章中提到的那样更改分隔符,那么你也可以引用字段,而不是切换到 Parquet。回复 nivertech 1小时前 | 上一个 [–] 默认使用 TSV 代替 CSV 回复 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

This is one of those funny stories about a bug you want to remember to tell your grandkids before bed (omg, what a terrible grandpa I'm going to be). Anyway, a few days ago our DMS replication started failing. Redshift was rejecting rows. The pipeline was choking. Nothing had changed in the pipeline, how on earth was that happening?

The culprit? Moldova. A land between Romania and Ukraine that's essentially the European version of a hidden track on a CD, a country in "intense flirting" stage with the European Union while the ex-boyfriend (Vladimir Putin) fights to win her back. So the problem was Moldova. More specifically: Moldova, Republic of, the official Shopify name for the country, complete with a comma sitting right in the middle of it.

Moldova on the map

Here's what was happening. We were consuming through a background job the Delivery Countries of merchants from the Shopify API, and through a DMS we were replicating some data to another database for analytics purposes. DMS replicates our RDS tables to Redshift by writing CSV files to S3 and then issuing a COPY command. The problem is that DMS wasn't quoting field values. So a row that should look like:

id,zoneId,"Moldova, Republic of",MD,2026-02-27 09:15:01

was being written as:

id,zoneId,Moldova, Republic of,MD,2026-02-27 09:15:01

Redshift sees 6 columns instead of 5. MD ends up in the createdAt field. Redshift screams. Pipeline dies.

Thank God for Claude, these days it can help tremendously with debugging by chewing through tons of logs.

Diagram showing how the unquoted comma in Moldova, Republic of breaks the Redshift COPY pipeline

The wrong fix

The obvious move is to rename the record in the source database. Takes 3 seconds, pipeline recovers, everyone goes home happy.

Except the sync job runs on a schedule, and Shopify will happily send Moldova, Republic of right back on the next run, restoring the chaos. You've fixed the symptom, not the cause, and you've introduced a silent landmine that resets itself.

The "kinda ok" solution

The kinda ok solution is to sanitize the name at write time in the sync job itself, before the data ever touches the database. Replace , with - at the boundary:

name: country.name.replace(/, /g, " - ")

Moldova, Republic of becomes Moldova - Republic of. Perfectly readable, CSV-safe, and idempotent, every sync will write the same clean value regardless of what Shopify sends. You can run it a thousand times and the result is the same.

The real real fix

There are two other ways to fix the issue. The first is to change the CSV delimiter. DMS lets you configure the delimiter on the S3 target endpoint via extra connection attributes. Swap the comma for a pipe or a tab (something that will never appear in your data) and embedded commas stop being a problem entirely. At the same time, it's hard to know which special characters may appear in the data.

The second, more thorough option is to stop using CSV altogether and switch to Parquet. Parquet is columnar and schema-aware. There's no delimiter character, so embedded commas, newlines, quotes — all of it just works. Redshift's COPY command supports Parquet natively and actually performs better loading it than CSV.

Both of these fix the transport layer, which is the right abstraction for a transport problem. But they don't change what's in RDS. If anything reads from the database directly (another job, a report, an API), it still gets Moldova, Republic of. The sync-job sanitization fixes the data, not just the delivery format.

The best version is both: switch DMS to Parquet so the pipeline is resilient to whatever comes through, and still sanitize at the sync boundary so the stored data is clean. Each layer handles what it owns.

Sanitize at the boundary

There's a broader pattern here. External data sources (Shopify, third-party APIs, user input) will always find the exact edge case your infrastructure doesn't handle. They don't do it maliciously. They just don't know (or care) about your CSV pipeline, your column count, your COPY command. They're just sending you the data they have.

The temptation is to fix these things downstream. Patch the COPY command config, add a cleanup step in the warehouse, write a one-time migration script. These all work until the next Moldova shows up, and there's always a next Moldova.

The right place to deal with external data is at the boundary, the moment you first ingest it. Validate it, normalize it, sanitize it there, not two services later when the damage is already done and the error message is "invalid column count on row 4,721."

It’s a strange world when a sovereign nation’s formal title is the main character in your error logs. Moldova, if you’re reading this, I’ll support your EU bid, no matter how you spell your name.

联系我们 contact @ memedata.com