避免 Django 查询集中出现重复对象

避免 Django 查询集中出现重复对象
Avoiding duplicate objects in Django querysets

原始链接: https://johnnymetz.com/posts/avoiding-duplicate-objects-in-django-querysets/

## 避免 Django Queryset 中的重复对象在跨关系（一对多或多对多）过滤 Django Queryset 时，由于 SQL JOIN 操作，结果中可能会出现重复的对象。如果一个父对象有多个相关的对象匹配你的过滤器，它会在输出中重复出现。像 `.distinct()` 这样的简单解决方案可能效率低下，尤其是在处理大型字段（如 TextFields 或 JSONFields）时，因为它们会比较 *所有* 字段以确定唯一性。PostgreSQL 提供了 `.distinct(*fields)` 用于比较特定字段，但它会限制排序并且可能很复杂。 **最佳方法**是使用 **Exists 子查询**。它有效地检查是否存在匹配你标准的关联对象，而无需检索重复项。 ```python from django.db.models import Exists, OuterRef Author.objects.filter( Exists( Book.objects.filter( author=OuterRef("id"), title__startswith="Book" ) ) ).order_by("name") ``` Exists 子查询性能良好，避免排序限制，清晰地表达意图，并且与数据库无关，使其成为处理这个常见 Django 问题的首选方法。

## 避免 Django Queryset 中的重复对象 johnnymetz.com 上的一篇最近文章讨论了在使用 Django Queryset 时避免重复对象的方法。核心问题在于查询无意中返回冗余数据，通常需要使用 `.distinct()`。然而，评论者指出 `.distinct()` 并非总是最佳解决方案。使用 `Exists` 子查询通常更简洁，并且在数据库大小和查询细节方面可能性能更高。某些数据库甚至提供诸如 `any` 连接等功能，以固有地防止重复。一些用户建议，频繁需要 `.distinct()` 可能表明查询*设计*本身存在问题，并提倡重新审视上游的过滤逻辑。另一些人指出，当 ORM 抽象成为性能瓶颈时，切换到原始 SQL 可以提供更大的控制和可维护性，从而可以显式理解查询计划。最终，讨论强调根据特定上下文和数据库系统选择合适的工具——`Exists`、优化的查询或原始 SQL。

原文

When filtering Django querysets across relationships, you can easily end up with duplicate objects in your results. This is a common gotcha that happens with both one-to-many (1:N) and many-to-many (N:N) relationships. Let’s explore why this happens and the best way to avoid it.

The Problem

When you filter a queryset by traversing a relationship, Django performs a SQL JOIN. If a parent object has multiple related objects that match your filter, the parent object appears multiple times in the result set.

Let’s look at a concrete example with a one-to-many relationship:

class Author(models.Model):
    name = models.CharField(max_length=255)

class Book(models.Model):
    title = models.CharField(max_length=255)
    author = models.ForeignKey(Author, on_delete=models.CASCADE, related_name='books')

Here’s what the database tables might look like:

Author Table

Book Table

id	title	author_id
1	Book A	1
2	Book B	2
3	Book C	2
4	Novel D	3

If we want to find all authors who have written books whose titles start with “Book”:

Author.objects.filter(books__title__startswith="Book")
# [<Author: Charlie>, <Author: Alice>, <Author: Alice>]

Notice that Alice appears twice in the result set. This happens because when Django executes the query, it performs a JOIN across the relationship:

author_id	name	book_id	title
1	Charlie	1	Book A
2	Alice	2	Book B
2	Alice	3	Book C

Since Alice wrote both “Book B” and “Book C”, the JOIN produces two rows for her. The same issue occurs with many-to-many relationships — if an object belongs to multiple groups that match your filter, the object will appear multiple times.

Common Solutions (and Their Problems)

Using `distinct()`

The most straightforward solution is to use distinct():

Author.objects.filter(books__title__startswith="Book").distinct()
# [<Author: Charlie>, <Author: Alice>]

This works, but can be expensive. distinct() compares all selected fields to determine uniqueness, which is problematic if your model has large fields like JSONField or TextField. The database has to compare all these fields for every row, which can kill performance on large datasets.

Using `distinct(*fields)` (PostgreSQL only)

PostgreSQL supports DISTINCT ON specific fields, which is more efficient than comparing all fields. You can use any unique field (typically id):

Author.objects.filter(books__title__startswith="Book").distinct("id")
# [<Author: Charlie>, <Author: Alice>]

This only compares the specified unique field, avoiding the performance issues with large fields. However, distinct(*fields) has several limitations and gotchas (see Django documentation and PostgreSQL documentation). The most notable is that you can’t easily order by other fields. PostgreSQL requires that SELECT DISTINCT ON expressions match the initial ORDER BY expressions. For example, if you try to do:

Author.objects.filter(books__title__startswith="Book")
.order_by("name")
.distinct("id")

You’ll get this error:

django.db.utils.ProgrammingError: SELECT DISTINCT ON expressions must match initial ORDER BY expressions

A workaround is to use a subquery to get the distinct objects, then filter by those objects:

subquery = Author.objects.filter(books__title__startswith="Book").distinct("id")
Author.objects.filter(id__in=subquery).order_by("name")
# [<Author: Alice>, <Author: Charlie>]

This solves the ordering problem, but it’s verbose and requires two separate queries (though Django optimizes this into a subquery). It’s also not immediately clear what’s happening when reading the code.

The Best Solution: `Exists` Subquery

The cleanest and most performant solution is to use an Exists subquery:

from django.db.models import Exists, OuterRef

Author.objects.filter(
    Exists(
        Book.objects.filter(
            author=OuterRef("id"),
            title__startswith="Book",
        )
    )
).order_by("name")
# [<Author: Alice>, <Author: Charlie>]

Performance: Exists stops evaluation as soon as it finds a matching row, making it very efficient. Per the Django documentation:
Exists is a Subquery subclass that uses an SQL EXISTS statement. In many cases it will perform better than a subquery since the database is able to stop evaluation of the subquery when a first matching row is found.
No ordering restrictions: Unlike distinct(*fields), you can order by any field without issues.
Clear intent: The code clearly expresses “find authors who have books whose titles start with ‘Book’” without the complexity of subqueries or distinct operations.
Works everywhere: This solution works with all databases, not just PostgreSQL.

May your Django querysets be duplicate-free and fast.