我在沙漠中央发现了一个贝壳。
I found a seashell in the middle of the desert

原始链接: https://github.com/Hawzen/I-found-a-seashell-in-the-middle-of-the-desert#i-found-a-seashell-in-the-middle-of-the-desert

在阿尔加特(Alghat)沙漠发现一颗神秘的贝壳状化石后,作者开始尝试通过自制的形态分析方法对其进行鉴定。该地区曾于侏罗纪时期被海水淹没。由于深知正规古生物学超出了自己的专业领域,作者开发了一套计算流程,仅根据形状对化石进行分类。 作者利用近6万张贝壳图像作为数据集,对每个样本的轮廓进行了标准化处理,并应用主成分分析(PCA)将其映射到二维“潜在空间”中。这使得贝壳形状的数学比较成为可能,而主成分分析也成功捕捉到了“尖锐度”和对称性等关键特征。 分析结果显示,*Sphincterochila candidissima* 是与该化石形态最接近的物种。尽管该化石很可能源自侏罗纪,而 *S. candidissima* 出现的时间要晚得多,但两者之间的相似性非常显著。作者总结认为,虽然形态学在确定谱系方面存在局限性,但该项目展示了数据科学如何为进化模式和趋同进化提供深刻的见解。感兴趣的读者可以在线探索生成的贝壳潜在空间工具:https://shell.hawzen.me。

这篇 Hacker News 讨论帖围绕 Hawzen 的一篇博客文章展开。作者在沙漠中发现了一块类似贝壳的岩石,并试图通过主成分分析(PCA)来对其进行鉴定。 讨论主要集中在以下几个主题: * **方法论与人工智能的对比:** 一些用户批评作者花费精力构建自定义的 PCA 流程,而不是直接使用现代 AI 模型或 ChatGPT。但也有人认为这种做法是一场具有教育意义的“探险”,并指出构建工具的过程比最终的鉴定结果(许多评论者认为结果并不准确)更有价值。 * **地质背景:** 许多参与者指出,在沙漠中发现海洋化石在地质学上是很常见的,因为许多内陆地区曾被古海(如特提斯洋)覆盖。 * **技术争论:** 持怀疑态度的用户探讨了这块贝壳究竟是真正的化石还是仅仅是岩石构造,另一些人则质疑作者所采用的二维图像分析法的有效性。 归根结底,这篇讨论帖是对“方案先行”思维的一种批判。它凸显了业余工程探索与现代 AI 工具效率之间的张力,同时也印证了沙漠中存在化石这一地质事实。
相关文章

原文

To my amazement, I found a fully solid rock that eerily resembles a seashell at the base of a cliff in the Alghat desert, Saudi Arabia. I didn't know what to make of it at first, it had the swirls and shape of a seashell but was fully a rock, more importantly, it shouldn't be here; the nearest coastline is Dammam's, 500 km away.

Ancient fossil Fossil location

This looks impossible

Carbonate rocks (e.g. limestone), marine fossils, coral fossils, and sedimentary structures (like ripples or bioturbation) all exist in and around Alghat, which points to the fact that parts of the Arabian Peninsula were once submerged under the sea. Specifically in the late Jurassic age (~150 million years ago)[1].

Fossil location

Stratigraphic distribution figure of areas near Najd[1]

Nevertheless, I was still super curious about the fossil I found; what animal inhabited it? what did it look like back in the Jurassic age? any modern relatives or lookalikes?

The proper way of answering these questions is to conduct a detailed analysis of the fossil (e.g. via inspecting the sediment it was found in, its shape, etc.), this should be done by an expert paleontologist. However, I know no paleontology, or any paleontologist, so I figured I could DIY it myself (how hard could it be..?), though I'll do it strictly via its shape — or what's called its morphology. Morphology alone is probably not accurate enough to discern lineage as different species might lookalike but are from different lineages, so this is probably not the best way to do it, but it sounded fun and intuitive, so I gave it a try.

Concretely, I plan on:

  1. Mathematically representing the shape of a shell
  2. Defining a distance metric between shapes (so that I can find shells similar to the fossil's)
  3. Mapping out the space of shapes

7894 different species and 59244 images of shells were in the Zhang, et al. shell dataset[2]; good enough for me!

Capturing 'shape' is actually a very hard problem; any object can be rotated by pitch, yaw, roll, scaled, and translated. Before starting any statistical analysis, I followed a guideline to isolate the shape from other factors

  1. The shell must be centered to the midpoint of the picture
  2. The scale of the shell must be equivalent across all images (specifically, the maximum distance from the origin is 1)
  3. Orientation is the hardest part
    • Pitch and yaw can be fixed by only choosing samples where the shell's opening is facing the camera. This is not perfect, but I found the dataset to be pretty consistent with its angles
    • Roll is difficult. A shell can be rotated in any way around the axis (even whilst the opening is facing the camera). My fix was to use the longest radius as the reference point, and rotate the shell so that the longest radius is always on the right. This is not perfect either, but it was good enough for me.

Then, I extracted the contour of the shell to 256 points relative to the center. This way, each shell is represented by a 256x2 matrix, where each row is the (x, y) coordinates of a point on the contour. Example:

> contours[0].shape

(256, 2)

> contours[0].tolist()[:5]

[-0.38561132550239563, 0.9804982542991638],
 [-0.4204626679420471, 0.9785506725311279],
 [-0.4553140103816986, 0.976603090763092],
 [-0.4901654124259949, 0.9746555089950562],
 [-0.5230183005332947, 0.9685550928115845]]

Ancient fossil
Fossil location Fossil location

Normalization pipeline

Naturally, the distance between two shells s1 and s2 is squared euclidean distance between their contour points:

$$ d(s1, s2) = {\sum_{256} (s1.x_i - s2.x_i)^2 + (s1.y_i - s2.y_i)^2} $$

Representing the space will require 256 dimensions, which is a little more than just the 2 I need to plot it over x and y. Given the normalized shell contour above, it's clear that many of these dimensions are redundant (for instance, the space of all possible 256 contour points allows intersection, while the space of possible shells doesn't, AFAIK), so the space of possible shells can be condensed into a smaller latent space. To drive my point home, I'll show three examples of fully random contours (i.e. pseudo-random points around the origin).

Ancient fossil

Probably not a real shell

Dimensionality reduction techniques map the original 256 dimensions onto a smaller number of dimensions (e.g. 2 or 3) while trying to preserve the distance between shells as much as possible. One such technique I'll be using is Principal Component Analysis (PCA). Here's an excellent fragment that explains how PCA works: https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues/140579#140579.

After applying PCA, I retained 56.50% of the variance using only the first principal component (PC1), and 67.25% using the first two. This means we can describe a shell's shape by only two numbers, and be pretty close to the original shape!

The interesting part is trying to understand what these two numbers mean; dimension 1 in the original 256-dimensional space annotates the location of the first contour point of the shell, whereas dimension 1 of the latent space annotates a high-level feature, learned by the PCA algorithm. We can visually try to understand what PCA dimension PC1 represents by finding two shells, diametrically opposite in the PC1 dimension, yet similar in all other dimensions.

Essentially, we want to find two shells i and j such that the following score is maximized:

$$ \text{score}(i,j) = \frac{|z_{i,1} - z_{j,1}|} {|\mathbf{z}_{i,2:k} - \mathbf{z}_{j,2:k}|_2} $$

PC1 seems to capture the 'pointiness' of the shell, i.e. more than 50% of variance in shell shapes can be explained by how pointy they are. PC2 seems to capture the symmetry of the shell, or perhaps the mass distribution over the vertical axis. I'll leave the interpretation of the other dimensions as an exercise for the reader (I have no idea).

PCA

And now for the grand finale, we can plot the shells in the latent space, and see where our Alghat fossil fits in it. But first, for dramatic tension, I will discuss the plot.

The plot represents PC1 on the x-axis and PC2 on the y-axis, while color represents the roughness of a shell (computed as the difference in slope between consecutive points). The following observations are worth noting:

  1. Negative PC1 values (representing roundness) are way more common than positive PC1 values (representing pointiness). Yet roundness is less diverse and occupies less space than pointy shells
  2. Pointy shells seem to be way more rough than round shells
  3. Negative PC1 values always have PC2 values close to zero; no shell in the dataset has a round but asymmetric shape. Below, I will project those shells back from latent space to the shape space, imagining impossible shells

map

Map of shell latent space with example shells

PC1 animation PC2 animation

Modifying Principal Components against the mean shell

Impossible shell projections

Projecting 'impossible' shells

So, what shell most closely resembles our Alghat fossil? It's Sphincterochila candidissima (try to pronounce it). However, it is really young, nowhere near the Jurassic age; instead, the earliest fossil of it dates back 38 million years ago[4]. Ultimately, shape is not the best way of determining shell lineage, but its eerie similarity to the Alghat fossil is still fascinating, and perhaps points to some sort of convergent evolution, where two different species evolve to have similar shapes due to similar environmental pressures.

closest closest

Left: Alghat fossil compared, Right: Sphincterochila candidissima[3]

Feel free to explore the tool and try to figure out where a shell of your choice fits in the shell latent space!

https://shell.hawzen.me

  1. Aba Alkhayl, S. S. (2022). Marine macro-invertebrate fossils from the Lower Hanifa Formation (Hawtah Member), central Saudi Arabia. Arabian Journal of Geosciences, 15, 1410. https://doi.org/10.1007/s12517-022-10581-w
  2. Zhang, Q., Zhou, J., He, J. et al. A shell dataset, for shell features extraction and recognition. Sci Data 6, 226 (2019). https://doi.org/10.1038/s41597-019-0230-3
  3. https://en.wikipedia.org/wiki/Sphincterochila_candidissima
  4. Tracey, S., Todd, J. A., & Erwin, D. H. (1993). Mollusca: Gastropoda. In M. J. Benton (Ed.), The Fossil Record 2 (pp. 131–167). London: Chapman &
联系我们 contact @ memedata.com