Show HN:用于检测非精确代码重复的嵌入模型 CLI 工具
Show HN: CLI tool for detecting non-exact code duplication with embedding models

原始链接: https://github.com/rafal-qa/slopo

Slopo 是一款轻量级 CLI 工具,旨在检测非精确的代码重复——即那些隐藏在不同模块或文件中的微妙且高风险的重复代码。与侧重于完全相同代码的传统工具不同,Slopo 使用嵌入模型来识别语义相似的代码簇。 通过计算代码单元的嵌入向量,Slopo 可以识别潜在的重复项,并根据余弦相似度和代码库中的物理距离对其进行排名。该工具专为集成到 AI 辅助开发工作流中而设计:Slopo 负责识别代码簇,而 AI 代理则进行验证,并将误报项标记到 `slopo.ignore.txt` 文件中以供忽略。 主要功能包括: * **广泛的语言支持**:适用于 Python、TS/JS、Java、Kotlin、C#、Go 和 Rust。 * **灵活的集成**:使用 LiteLLM 作为嵌入提供程序,并使用 `uv` 实现无缝安装。 * **增量分析**:支持对更改后的文件进行重新索引,并维护可供团队共享的持久忽略列表。 * **可配置的过滤**:允许用户调整相似度阈值和抽象语法树(AST)节点复杂度,以减少干扰信息。 Slopo 提供了清理冗余代码所需的洞察力,使开发人员和 AI 代理能够高效地针对大型代码库中的技术债务进行定位和重构。

一位开发者推出了“Slopo”,这是一款利用嵌入模型(embedding models)识别非精确代码重复的命令行(CLI)工具。与主要检测“复制粘贴”式克隆的传统工具不同,Slopo 能够识别“外观相似”的代码,即便这些代码段在代码库中相距甚远。 开发者承认该工具会产生一定比例的误报,同时也包含真实的重复代码,建议用户或 AI 代理对结果进行核实。该工具在发现其他方法可能遗漏的隐性代码冗余或潜在漏洞方面尤为有效。用于与编程智能体(coding agents)集成的示例提示词可在 slopo.dev 获取。社区已表示希望将其语言支持扩展至 PHP/WordPress。
相关文章

原文

A lightweight CLI tool for detecting non-exact code duplication using embedding models.

It focuses on the similar code that is hardest to detect and most harmful: snippets written similarly, sitting far apart in the codebase, often spread across different modules or separated within a large file. Exact copy-paste is easy to spot by other tools, and duplicates that are close together are easy to spot by humans or AI.

For more high-level description of the problem see slopo.dev.

Python, TypeScript, JavaScript, Java, Kotlin, C#, Go, Rust

It takes a different approach than typical duplication detection. For every code unit, it calculates an embedding, then looks for pairs whose embeddings are close. Similar code is not necessarily a duplicate, so each pair is a potential duplicate to confirm. Code doing the same thing but implemented in a completely different way produces distant embeddings and won't be detected.

The result is clusters of similar code units, ranked by similarity and by distance in the codebase. These clusters are meant as input for your AI coding agent, which can check whether a cluster is a real duplicate. Reviewed clusters can be marked as ignored or passed on for refactoring.

See doc/example-report generated from Slopo code, src directory, git tag v0.2.0.

This example confirmed that code parsers for each language have a lot of duplication, some are exact-copy, some are similar variants. It needs to be refactored.

This command uses uv (installing uv), a Python package manager, to install Slopo from PyPI in an isolated virtual environment. No need to get Python separately.

Run slopo init to create a config file template containing further instructions. Only the directory with code for analysis and embedding model configuration is required.

Embeddings are calculated using an external provider. For best results, consider models dedicated to code, e.g. Voyage AI (it works fine with low dimensions like 512).

You can use any model provider compatible with LiteLLM, see details here.

The provider API key can be set as an environment variable for better security.

Run slopo show-config to validate your config and show all configurable parameters, most are optional with sensible defaults.

Now you are ready to index code, calculate embeddings and generate a report:

slopo index
slopo embed
slopo analyze

This section demonstrates how Slopo can be used in a real development workflow.

It utilizes incremental re-indexing (update index with changed files only) and slopo.ignore.txt to discard already reviewed clusters.

  1. Create your first analysis and check results. You will notice index.md containing a list of all clusters and cluster details per file.
  2. You may want to exclude some directories or file patterns, usually excluding tests is a good idea. You can also tune thresholds if the result is too big or too small.
  3. Once satisfied with analysis results, ask your AI coding agent to filter out clusters that are not real duplicates. This is a common case because not every similar code is a duplication to act on. Ask the AI agent to add discarded cluster hashes to slopo.ignore.txt.
  4. Re-run the analysis to generate a report without reviewed clusters. This is a basis for refactoring, which can be done by an AI agent.
  5. ignore file can be committed to your Git repository and reused cross-team. New and modified clusters will reappear in the report. A configuration file without an API key can also be committed. Don't commit slopo.db, this is your local data.

Run slopo --help and slopo show-config to explore it by yourself anytime.

Most configuration is done with a configuration file with two exceptions:

  1. The location of the configuration file can be overridden with the --config option.
  2. The API key can be set with the SLOPO_EMBEDDING_API_KEY environment variable, also picked up from a .env file in the current directory.

Be aware that some parameters can't be changed after first indexing. You need to remove slopo.db and index/embed from the beginning: source_dir, embedding_model, embedding_dimensions, body_node_count_threshold.

All configurable parameters

  • source_dir: Source directory with code to index, absolute or relative path.
  • source_dir_exclude: .gitignore-style patterns to exclude from indexing.
  • db_file: SQLite database file with tool data.
  • report_dir: Output directory for analysis report.
  • ignore_file: Text file with ignored clusters.
  • embedding_model: Embedding model name in LiteLLM format.
  • embedding_dimensions: Embedding dimensions compatible with the used model.
  • embedding_api_key: API key for embedding provider. Optional if configured with an environment variable.
  • embedding_batch_size and embedding_batch_chars: Requests to the embedding API are batched for performance. Defaults are fine for most cases.
  • similarity_threshold: Controls minimal cosine similarity between embeddings.
  • rerank_threshold: Controls minimal similarity after applying a boost reflecting distance in the codebase.
  • body_node_count_threshold: Number of AST nodes inside the body (excluding signature and annotations). This value reflects the minimum code complexity of the included code unit, more precise than text length. Increase if you notice unwanted, too-small code units in the report.

Similar code units are filtered in two passes, each with its own configurable threshold. The pipeline is as follows:

  1. similarity_threshold filters out code unit pairs whose embeddings are not similar enough. The calculated value is cosine similarity ranging from -1 to 1 where 1 means the same.
  2. Similar pairs are grouped in clusters.
  3. Units in clusters are reranked after applying a boost. Boost is calculated based on the number of directory hops required to reach the other file in the pair (max. 15%). If they are in the same file, the boost is calculated based on distance in number of lines (max. 10%). rerank_threshold filters out clusters whose highest-scoring pair is not high enough.

The main goal of this tool is to detect non-exact code duplication, but exact copies (identical code at multiple paths) are reported too, just handled a little differently from merely similar code:

  • The report shows the code once, listing every path where it appears, instead of repeating identical snippets.
  • The analyze command reports the "similarity ratio" (the share of code units flagged as similar) in two variants: including and excluding exact copies.
联系我们 contact @ memedata.com