如何利用计算机视觉实现 Instagram 互动自动化（并导致账号被封）

如何利用计算机视觉实现 Instagram 互动自动化（并导致账号被封）
How to automate Instagram engagements with computer vision (and get banned)

原始链接: https://blog.florianherrengt.com/how-to-automate-instagram-engagements.html

自动化与 Instagram 等动态网页应用的交互非常困难，因为它们不断变化的 DOM 结构使得传统的爬虫技术极其脆弱。一种更可靠的替代方案是像人类一样处理界面：使用计算机视觉处理像素，而非代码。通过截取屏幕，系统可以识别“地标”（例如帖子的菜单图标和操作栏）以确定精确的搜索区域。这消除了整个屏幕的干扰，并确保脚本能够适应不同的帖子布局。在裁剪后的区域内，滑动窗口模板匹配会检测爱心图标，而垂直对齐过滤器则通过识别爱心出现的固定列来消除误报。一旦确定坐标，系统就会模拟自然的人类动作来点击目标。虽然这种方法有效地绕过了平台特定的混淆和 DOM 更改，但它仍然容易受到行为机器人检测的影响。最终，该实验证明，将屏幕视为计算机视觉的画布，超越 API 和选择器的限制，任何可视化用户界面都可以实现自动化。

Hacker News 最新 | 过往 | 评论 | 提问 | 展示 | 招聘 | 提交登录如何利用计算机视觉实现 Instagram 互动自动化（并因此被封号） (florianherrengt.com) 8 分，florianherrengt 发布于 39 分钟前 | 隐藏 | 过往 | 收藏 | 讨论 | 帮助指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

Obviously, Instagram does not want you to automate engagement. Their HTML is a mess of randomly generated class names and deeply nested divs. The structure changes every deployment. Any script that relies on DOM selectors breaks within weeks because the class name doesn't exist anymore.

But it doesn't matter anyway. Instagram can obfuscate their code all they want because code is for machines. But UI... The UI is for humans. A heart icon has to look like a heart icon. A comment button has to be where users expect it. The layout has to be consistent enough that a person can easily navigate it.

So instead of fighting the DOM, let's just bypass it entirely. Take a screenshot. Find the heart by its visual appearance. Get its coordinates. Move the cursor there. Click. Done.

This works on anything that renders to pixels. Web apps, native apps, games, terminals. If a human can see it and click it, a computer can too. No selectors, no APIs, no platform-specific hooks. Just computer vision and cursor automation.

Unfortunately, you can't just hardcode a position. Things move around all the time. A long caption pushes the action bar down. A location tag adds a line. A carousel of images takes up more vertical space. Every post compresses or expands the layout differently.

Navigate between 2 posts and watch what happens to the hearts' position:

Computer vision solves this. Instead of guessing where the hearts should be, you look at the screen and find where they actually are.

Too Much Screen, Too Many False Positives

The naive approach is simple: take the heart icon as a template and find it on the screen. Wherever it matches, that's a heart. It's the most basic computer vision operation you can do.

It doesn't work very well.

A full screenshot is huge. Over 7 million pixels on a typical screen. And a heart is small, roughly 70x60 pixels. That's a lot of surface area to search. In that sea of pixels there are plenty of things that vaguely resemble a heart. You get too many false positives.

The detection technique is fine. The search space is the problem. The screen is full of noise. The more area you search, the more noise you find.

Shrink the Search Space

The fix is to stop searching the whole screen. Instead, find things on screen that are easy to detect and use them to figure out where the hearts must be.

On Instagram, two things are consistently easy to find:

The triple-dots menu (⋯) in the top-right corner of every post. It's small, high-contrast and visually distinctive. It's always present.
The action bar (like/comment/share) at the bottom of every post. It's a wide, predictable pattern of icons.

Both can be found with basic template matching in milliseconds. But we don't care about them for their own sake. We care about what they tell us: the triple-dots sits directly above the heart column. The action bar sits directly below it. If we know where those two landmarks are, we know exactly where the hearts are. They're in the vertical strip between them.

crop.x      = triple_dots.x
crop.y      = triple_dots.y + triple_dots.height
crop.width  = triple_dots.width
crop.height = action_bar.y - crop.y - action_bar.height x 0.2

The only things left in the search region are actual hearts and whatever happens to be in that exact column.

And since the crop region is derived from the actual positions of the landmarks on screen rather than being hardcoded, it adapts to every post automatically. The landmarks might be higher or lower depending on the post content, but the geometric relationship between them and the hearts is always the same.

The Sliding Window

Now that the search space is tiny and clean, you can run the sliding window. Take the heart template and slide it across the search region pixel by pixel. Score every position. The better the match, the more likely it's a heart.

The sliding window is deliberately loose to catch every possible heart. But that means it also catches things that aren't hearts.

Hearts on Instagram are all in one vertical column. Every single one. Most detections will be on that line. Anything not on it is an outlier:

·   ·   ·   ·   ·   ·   ·   ·
·   ·   ·   ·   ·   ♡   ·   ·     ← most detections are here
·   ·   ·   ·   ·   ·   ·   ·
·   ·   ·   ·   ·   ♡   ·   ·     ← same column
·   ✕   ·   ·   ·   ·   ·   ·     ← outlier (off to the left)
·   ·   ·   ·   ·   ♡   ·   ·     ← same column
·   ·   ·   ·   ·   ·   ·   ·
·   ·   ·   ·   ·   ♡   ·   ·     ← same column
·   ·   ·   ·   ·   ·   ·   ·
·   ·   ·   ·   ·   ·   ✕   ·     ← outlier (off to the right)
·   ·   ·   ·   ·   ♡   ·   ·     ← same column
·   ·   ·   ·   ·   ·   ·   ·

The hearts (♡) cluster on one X coordinate. The false positives (✕) are scattered. The sliding window thought they looked heart-shaped, but they're not in the column.

So we find the most common X among all detections, the consensus line, and discard anything more than 10 pixels away. It's just finding the mode of the X values and treating everything else as noise. A few lines of code and nearly all false positives are gone. The sliding window was deliberately loose to catch every possible heart. This filter is tight to remove everything that isn't one.

The Full Detection Pipeline

Take screenshot
  → Find triple-dots and action bar via template matching
  → Calculate crop region from their positions
  → Crop to a 60-pixel-wide strip
  → Run sliding window template matching at multiple scales
  → Vertical alignment filter
  → Deduplicate, sort top to bottom
  → Return [{x, y}, ...]

The explainer video shows this whole process: