一百万(小网)截图
One million (small web) screenshots

原始链接: https://nry.me/posts/2025-10-09/small-web-screenshots/

受到onemillionscreenshots.com网站的启发,作者开始了一个项目,旨在发现“小型网络”中的隐藏瑰宝——即超越流行且过度商业化的网站的互联网。 发现从Common Crawl获取的初始数据集过于关注流行度(因此可能质量较低),他们着手构建一个优先考虑内容而非点击的发现工具。 该项目核心在于截取网站截图,并使用机器学习对其进行视觉映射。 这利用了自组织映射(SOM)进行降维和分配,作者发现尽管这种技术简单,但效果出奇地好。 最初使用DinoV3等模型效果过于精细,因此使用了具有三元组损失的自定义编码器,专注于美学细节。 为了进一步完善地图,第二个SOM在颜色分布上进行了训练,影响布局以及视觉相似性。 结果是小型网络的视觉地图,尽管一些主流网站不可避免地会穿插其中。 借助新爬取的25万个域名数据集,计划更新地图,展示作者对SOM的优雅和力量的 renewed appreciation。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 一百万张(小型网站)截图 (nry.me) 4 点赞 by squidhunter 1 小时前 | 隐藏 | 过去 | 收藏 | 讨论 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系 搜索:
相关文章

原文

Background

Last month I came across onemillionscreenshots.com and was pleasantly surprised at how well it worked as a tool for discovery. We all know the adage about judging book covers, but here, …it just kinda works. Skip over the sites with bright flashy colors begging for attention and instead, seek out the negative space in between.

The one nitpick I have though is in how they sourced the websites. They used the most popular websites from Common Crawl which is fine, but not really what I’m interested in…

There are of course exceptions, but the relationship between popularity and quality is loose. McDonald’s isn’t popular because it serves the best cheeseburger, it’s popular because it serves a cheeseburger that meets the minimum level of satisfaction for the maximum number of people. It’s a profit maximizing local minima on the cheeseburger landscape.

This isn’t limited to just food either, the NYT Best Sellers list, Spotify Top 50, and Amazon review volume are other good examples. For me, what’s “popular” has become a filter for what to avoid. Lucky for us though, there’s a corner of the internet where substance still outweighs click-through rates. A place that’s largely immune to the corrosive influence of monetization. It’s called the small web and it’s a beautiful place.

A small web variant

The timing of this couldn’t have been better. I’m currently working on a couple of tools specifically focused on small web discovery/recommendation and happen to already have most of the data required to pull this off. I just needed to take some screenshots, sooo… you’re welcome armchairhacker!

[full screen version: screenshots.nry.me]

Technical details

Because I plan on discussing how I gathered the domains in the near future, I’ll skip it for now (it’s pretty interesting). Suffice it to say though, once the domains are available, capturing the screenshots is trivial. And once those are ready, we have a fairly well worn path to follow:

  1. generate visual embeddings
  2. dimensionality reduction
  3. assignment

I find the last two steps particularly repetitive so I decided to combine them this time via self-organizing maps (SOMs). I tried using SOMs a few years ago to help solve a TSP problem (well, actually the exact opposite…) but ended up going in a different direction. Anyway, despite their trivial implementation they can be extremely useful. A bare bones SOM clocks in at about 10 lines with torch.

@torch.no_grad()
def som_step(alpha, sigma):

    # pick random training sample
    sample_index = np.random.randint(0, x.shape[0])
    D_t = torch.from_numpy(x[sample_index]).to(DEVICE)

    # find the bmu using cosine similarity
    bmu_flat_idx = torch.argmax(torch.nn.CosineSimilarity(dim=2)(D_t, W).flatten())

    # convert flat index to 2d coordinates (i, j)
    u_i = bmu_flat_idx // W.shape[1]
    u_j = bmu_flat_idx  % W.shape[1]

    # compute the l2  distance between the bmu and all other neurons
    dists_u = torch.sqrt((W_i - u_i)**2 + (W_j - u_j)**2)

    # apply neighborhood smoothing (theta)
    theta = torch.exp(-(dists_u / sigma)**2)

    # update the weights in-place
    W.add_((theta * alpha).unsqueeze(2) * (D_t - W))

At their core, most SOMs have two elements: a monotonically decreasing learning rate and a neighborhood function with an influence (radius) that is also monotonically decreasing. During training, each step consists of the following:

  1. Randomly select a training sample.
  2. Compare this training sample against all nodes in the SOM. The node with the smallest (quantization) error becomes the BMU (best matching unit).
  3. Update the SOM node weights proportional to how far away from the BMU they are. Nodes closer to the BMU become more like the training sample.

There are numerous modifications that can be made, but that’s basically it! If I’ve piqued your interest, I highly recommend the book Self-Organizing Maps by Teuvo Kohonen, it’s a fairly quick read and covers the core aspects of SOMs.

With dimensionality reduction and assignment resolved, we just need the visual embeddings now. I started with the brand new DinoV3 model, but was left rather disappointed. The progression of Meta’s self-supervised vision transformers has been truly incredible, but the latent space captures waaay more information than what I actually need. I just want to encode the high level aesthetic details of webpage screenshots. Because of this, I fell back on an old friend: the triplet loss on top of a small encoder. The resulting output dimension of 64 afforded ample room for describing the visual range while maintaining a considerably smaller footprint.

This got me 90% of the way there, but it was still lacking the visual layout I had envisioned. I wanted a stronger correlation with color at the expense of visual similarity. To achieve this, I had to manually enforce this bias by training two SOMs in parallel. One SOM operated on the encoder output (visual), the second SOM on the color distribution and were linked using the following:

When the quantization error is low, the BMU pulling force is dominated by the visual similarity. As quantization error increases, the pulling force due to visual similarity wanes and is slowly overpowered by the pulling force from the color distribution. In essence, the color distribution controls the macro placement while the visual similarity controls the micro placement. The only controllable hyperparameter with this approach is selecting a threshold for where the crossover point occurs.

I didn’t spend much time trying to find the optimal point, it’s currently peak fall and well, I’d much rather be outside. A quick look at the overall quantization error (below left) and the U-matrix (below right) was sufficient.

There’s still a lot of cruft that slipped in (substack, medium.com, linkedin, etc…) but overall, I’d say it’s not too bad for a first pass. In the time since generating this initial map I’ve already crawled an additional ~250k new domains so I suppose this means I’ll be doing an update. What I do know for certain though is that self-organizing maps have earned a coveted spot in my heart for things that are simple to the point of being elegant and yet, deceptively powerful (the others of course being panel methods, LBM, Metropolis-Hastings, and the bicycle).

联系我们 contact @ memedata.com