Show HN：使用对比语言-音频预训练（CLAP）对机械故障进行分类

Show HN：使用对比语言-音频预训练（CLAP）对机械故障进行分类
Show HN: Classify mechanical faults using Contrastive Language-Audio Pretraining

原始链接: https://github.com/adam-s/car-diagnosis

**cardiag** 是一套端到端的机器学习流程，旨在通过音频录音对汽车故障进行预检。它并非作为最终的诊断工具，而是作为一种客观的辅助手段，用于识别声音是否异常、定位问题所在，并建议可能的故障部件。当数据不足时，该系统会给出“不确定”的结论，而非提供虚假的诊断结果。该流程通过稳健的工作流处理音频：提取并清理原始片段（去除语音、音乐和道路噪声），利用冻结的 CLAP 模型将其转换为 512 维嵌入，并通过轻量级线性头进行分类。这种方法论可高度复用于其他音频分类任务。尽管受限于分析原始手机音频的固有难度，该模型仍通过防泄漏、基于分组的交叉验证进行了严格的验证。该项目可通过命令行界面（CLI）和实时网页应用访问，允许用户快速抓取数据、训练模型并执行推理。其目的在于提供教育与预检支持，在追求排行榜指标之上，更侧重于透明度与校准。

Hacker News 最新 | 过往 | 评论 | 提问 | 展示 | 招聘 | 提交登录 Show HN: 使用对比语言-音频预训练 (CLAP) 对机械故障进行分类 (github.com/adam-s) 10 分，由 dataviz1000 于 3 小时前发布 | 隐藏 | 过往 | 收藏 | 讨论帮助 | 准则 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

原文

cardiag is an end-to-end audio-ML pipeline. It scrapes fault-sound clips from YouTube/TikTok, cleans the audio (isolating the mechanical sound from speech, music, and noise), embeds it with a frozen CLAP model, and trains small linear heads to triage the fault. It is exposed as a CLI and a live web app.

cardiag-demo.mp4

This is a proof of concept, and honest about what that means. Diagnosing a car fault from a phone recording is genuinely hard, so cardiag is built as a calibrated triage aid rather than a diagnoser: it tells you whether something sounds wrong, roughly where in the car it is, and a ranked shortlist of likely parts. When the audio won't support a call, it says "uncertain" instead of bluffing.

The real contribution is the cleaning + honest-training recipe, which is reusable on other audio datasets. The modest accuracy here reflects how hard the problem is from crude phone audio (we hit the literature ceiling); the same method reaches 0.93 AUROC on clean engine audio. See docs/DEFENSE.md.

Two pages visualize the first two stages of the pipeline:

Isolating the engine audio — an interactive look at the clean() cascade pulling a short mechanical span out of noisy YouTube audio (speech, music, road noise).
CLAP, visualized — how the frozen CLAP model turns those spans into the 512-d embedding the linear heads classify.

What it actually achieves

Measured out-of-sample, leakage-safe (by-video grouped CV over 1,031 video groups; permutation p = 0.0005). These are honest numbers, not a leaderboard.

Capability	Result	vs. chance
Is something wrong? (fault/normal)	AUROC 0.79 [0.76, 0.83]	0.50
Where in the car? (6 zones)	right zone in top-3 ≈ 75%	2×
Which part? (12+ families)	right part in top-3 ≈ 45–65%	3–4×
Knows when it doesn't know	calibrated (ECE ≈ 0.04), returns `UNCERTAIN`	—

Full details, and the one head we demoted for failing out-of-sample (knock), are in docs/MODEL_CARD.md.

Quickstart: clone to inference

A fresh clone is immediately usable. A small pre-trained model ships in models/, and a synthetic demo clip is bundled, so nothing needs to be downloaded or scraped.

git clone <this-repo> && cd car-diagnosis
uv venv && source .venv/bin/activate
uv pip install -e ".[scrape,web,dev,viz]"     # Python 3.11

cardiag doctor                 # preflight: what's installed
cardiag train --fixtures       # a working model offline in ~2s (no scrape, no 2 GB download)
cardiag diagnose <clip.wav>    # verdict + where-in-the-car + ranked parts
cardiag serve --model models   # live web app: drop a clip / paste a link, "explain why"

Verify the whole thing end-to-end in an isolated worktree: bash scripts/clone_verify.sh.

audio ──► clean() cascade ──► CLAP embedding ──► linear heads ──► Diagnosis
          (isolate spans)     (frozen, 512-d)    (fault/region/    (calibrated,
                                                  part/knock)       UNCERTAIN-aware)

There is one segmentation path. Scraped clips, your own recordings (cardiag ingest, any length), and uploads at inference all flow through the same clean() cascade that isolates short mechanical spans. Spans over ~10 s are split into windows so CLAP never silently truncates them. Training and serving share one embedding contract, so there is no train/serve skew.

cardiag diagnose clip.wav            # full model: verdict + region + ranked parts
cardiag triage   clip.wav            # calibrated engine-vs-running-gear
cardiag clean    clip.wav            # isolate the mechanical sound (no model needed)
cardiag inspect  clip.wav -o r.html  # SEE/HEAR the pipeline: spans, spectrograms, scores
cardiag ingest   ./my_audio --kind fault --cause wheel_bearing   # bring your own audio
cardiag scrape   youtube|tiktok      # build a corpus (Reddit is deprecated — too noisy)
cardiag train                        # train on your corpus

Add --json to any inference command for machine-readable output.

Valid for social-style / targeted-upload audio (YouTube, TikTok, or a phone clip a user records deliberately). It is not a safety-critical or standalone diagnostic. It is a triage assistant that narrows where to look and is honest about its uncertainty. Model files are joblib artifacts: load only ones you trust.

License: see LICENSE.