为你推荐算法

为你推荐算法
X For You Feed Algorithm

原始链接: https://github.com/xai-org/x-algorithm

## X 的“为您推荐”信息流推荐系统 X 的“为您推荐”信息流使用机器学习驱动的推荐系统来个性化内容，将关注账号的帖子（通过 **Thunder** 的“网络内”内容）与发现的内容（通过 **Phoenix Retrieval** 的“网络外”内容）相结合。两者来源合并并使用基于 **Grok 的 Transformer 模型 (Phoenix)** 进行排名，该模型改编自 xAI 的开源版本。该系统基于用户的互动历史预测互动概率（例如，点赞、回复、转发等），无需人工设计的特征，而是依赖 Transformer 来理解相关性。一个模块化的 **候选流程 (Candidate Pipeline)** 协调整个过程：检索候选内容、丰富数据、过滤不符合条件的帖子、使用 Phoenix 模型进行评分，以及选择最佳结果。主要特点包括：排名期间的 **候选隔离** 以确保一致的评分，**多动作预测** 以实现细致的理解，以及 **可组合架构** 以方便修改流程。过滤器在评分前和评分后都会应用，以确保质量和多样性。最终的信息流是预测互动的加权组合，优先考虑积极互动并最大程度地减少不喜欢的帖子。

## X (前身为Twitter) 算法发布 - 摘要 X (前身为Twitter) 近日在 GitHub 上发布了其“For You”信息流算法的源代码，引发了技术社区的讨论。虽然被宣传为开源，但许多评论员认为它充其量是“源码可用”，因为它缺少诸如模型权重和构建说明等关键组件，并且似乎是一个简化的概念验证。该算法现在严重依赖 X 的 Grok 模型，实际上将排名过程变成了一个“黑匣子”。讨论的中心在于，这次发布是否真正有利于开源运动，还是为了安抚监管机构和吸引人才的公关举动。代码中揭示的关键细节包括基于 Rust 的数据管道、双塔检索系统以及预测用户参与度（点赞、回复等）的 Phoenix 排序器。然而，核心智能仍然存在于 Grok 模型本身。许多人认为 X 的竞争优势不在于其代码，而在于其用户基础和网络效应，使得代码层面的竞争变得无关紧要。

This repository contains the core recommendation system powering the "For You" feed on X. It combines in-network content (from accounts you follow) with out-of-network content (discovered through ML-based retrieval) and ranks everything using a Grok-based transformer model.

Note: The transformer implementation is ported from the Grok-1 open source release by xAI, adapted for recommendation system use cases.

The For You feed algorithm retrieves, ranks, and filters posts from two sources:

In-Network (Thunder): Posts from accounts you follow
Out-of-Network (Phoenix Retrieval): Posts discovered from a global corpus

Both sources are combined and ranked together using Phoenix, a Grok-based transformer model that predicts engagement probabilities for each post. The final score is a weighted combination of these predicted engagements.

We have eliminated every single hand-engineered feature and most heuristics from the system. The Grok-based transformer does all the heavy lifting by understanding your engagement history (what you liked, replied to, shared, etc.) and using that to determine what content is relevant to you.

┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│                                    FOR YOU FEED REQUEST                                     │
└─────────────────────────────────────────────────────────────────────────────────────────────┘
                                               │
                                               ▼
┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│                                         HOME MIXER                                          │
│                                    (Orchestration Layer)                                    │
├─────────────────────────────────────────────────────────────────────────────────────────────┤
│                                                                                             │
│   ┌─────────────────────────────────────────────────────────────────────────────────────┐   │
│   │                                   QUERY HYDRATION                                   │   │
│   │  ┌──────────────────────────┐    ┌──────────────────────────────────────────────┐   │   │
│   │  │ User Action Sequence     │    │ User Features                                │   │   │
│   │  │ (engagement history)     │    │ (following list, preferences, etc.)          │   │   │
│   │  └──────────────────────────┘    └──────────────────────────────────────────────┘   │   │
│   └─────────────────────────────────────────────────────────────────────────────────────┘   │
│                                              │                                              │
│                                              ▼                                              │
│   ┌─────────────────────────────────────────────────────────────────────────────────────┐   │
│   │                                  CANDIDATE SOURCES                                  │   │
│   │         ┌─────────────────────────────┐    ┌────────────────────────────────┐       │   │
│   │         │        THUNDER              │    │     PHOENIX RETRIEVAL          │       │   │
│   │         │    (In-Network Posts)       │    │   (Out-of-Network Posts)       │       │   │
│   │         │                             │    │                                │       │   │
│   │         │  Posts from accounts        │    │  ML-based similarity search    │       │   │
│   │         │  you follow                 │    │  across global corpus          │       │   │
│   │         └─────────────────────────────┘    └────────────────────────────────┘       │   │
│   └─────────────────────────────────────────────────────────────────────────────────────┘   │
│                                              │                                              │
│                                              ▼                                              │
│   ┌─────────────────────────────────────────────────────────────────────────────────────┐   │
│   │                                      HYDRATION                                      │   │
│   │  Fetch additional data: core post metadata, author info, media entities, etc.       │   │
│   └─────────────────────────────────────────────────────────────────────────────────────┘   │
│                                              │                                              │
│                                              ▼                                              │
│   ┌─────────────────────────────────────────────────────────────────────────────────────┐   │
│   │                                      FILTERING                                      │   │
│   │  Remove: duplicates, old posts, self-posts, blocked authors, muted keywords, etc.   │   │
│   └─────────────────────────────────────────────────────────────────────────────────────┘   │
│                                              │                                              │
│                                              ▼                                              │
│   ┌─────────────────────────────────────────────────────────────────────────────────────┐   │
│   │                                       SCORING                                       │   │
│   │  ┌──────────────────────────┐                                                       │   │
│   │  │  Phoenix Scorer          │    Grok-based Transformer predicts:                   │   │
│   │  │  (ML Predictions)        │    P(like), P(reply), P(repost), P(click)...          │   │
│   │  └──────────────────────────┘                                                       │   │
│   │               │                                                                     │   │
│   │               ▼                                                                     │   │
│   │  ┌──────────────────────────┐                                                       │   │
│   │  │  Weighted Scorer         │    Weighted Score = Σ (weight × P(action))            │   │
│   │  │  (Combine predictions)   │                                                       │   │
│   │  └──────────────────────────┘                                                       │   │
│   │               │                                                                     │   │
│   │               ▼                                                                     │   │
│   │  ┌──────────────────────────┐                                                       │   │
│   │  │  Author Diversity        │    Attenuate repeated author scores                   │   │
│   │  │  Scorer                  │    to ensure feed diversity                           │   │
│   │  └──────────────────────────┘                                                       │   │
│   └─────────────────────────────────────────────────────────────────────────────────────┘   │
│                                              │                                              │
│                                              ▼                                              │
│   ┌─────────────────────────────────────────────────────────────────────────────────────┐   │
│   │                                      SELECTION                                      │   │
│   │                    Sort by final score, select top K candidates                     │   │
│   └─────────────────────────────────────────────────────────────────────────────────────┘   │
│                                              │                                              │
│                                              ▼                                              │
│   ┌─────────────────────────────────────────────────────────────────────────────────────┐   │
│   │                              FILTERING (Post-Selection)                             │   │
│   │                 Visibility filtering (deleted/spam/violence/gore etc)               │   │
│   └─────────────────────────────────────────────────────────────────────────────────────┘   │
│                                                                                             │
└─────────────────────────────────────────────────────────────────────────────────────────────┘
                                               │
                                               ▼
┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│                                     RANKED FEED RESPONSE                                    │
└─────────────────────────────────────────────────────────────────────────────────────────────┘

Location: home-mixer/

The orchestration layer that assembles the For You feed. It leverages the CandidatePipeline framework with the following stages:

Stage	Description
Query Hydrators	Fetch user context (engagement history, following list)
Sources	Retrieve candidates from Thunder and Phoenix
Hydrators	Enrich candidates with additional data
Filters	Remove ineligible candidates
Scorers	Predict engagement and compute final scores
Selector	Sort by score and select top K
Post-Selection Filters	Final visibility and dedup checks
Side Effects	Cache request info for future use

The server exposes a gRPC endpoint (ScoredPostsService) that returns ranked posts for a given user.

Location: thunder/

An in-memory post store and realtime ingestion pipeline that tracks recent posts from all users. It:

Consumes post create/delete events from Kafka
Maintains per-user stores for original posts, replies/reposts, and video posts
Serves "in-network" post candidates from accounts the requesting user follows
Automatically trims posts older than the retention period

Thunder enables sub-millisecond lookups for in-network content without hitting an external database.

Location: phoenix/

The ML component with two main functions:

1. Retrieval (Two-Tower Model)

Finds relevant out-of-network posts:

User Tower: Encodes user features and engagement history into an embedding
Candidate Tower: Encodes all posts into embeddings
Similarity Search: Retrieves top-K posts via dot product similarity

2. Ranking (Transformer with Candidate Isolation)

Predicts engagement probabilities for each candidate:

Takes user context (engagement history) and candidate posts as input
Uses special attention masking so candidates cannot attend to each other
Outputs probabilities for each action type (like, reply, repost, click, etc.)

See phoenix/README.md for detailed architecture documentation.

Location: candidate-pipeline/

A reusable framework for building recommendation pipelines. Defines traits for:

Trait	Purpose
`Source`	Fetch candidates from a data source
`Hydrator`	Enrich candidates with additional features
`Filter`	Remove candidates that shouldn't be shown
`Scorer`	Compute scores for ranking
`Selector`	Sort and select top candidates
`SideEffect`	Run async side effects (caching, logging)

The framework runs sources and hydrators in parallel where possible, with configurable error handling and logging.

Query Hydration: Fetch the user's recent engagements history and metadata (eg. following list)
Candidate Sourcing: Retrieve candidates from:
- Thunder: Recent posts from followed accounts (in-network)
- Phoenix Retrieval: ML-discovered posts from the global corpus (out-of-network)
Candidate Hydration: Enrich candidates with:
- Core post data (text, media, etc.)
- Author information (username, verification status)
- Video duration (for video posts)
- Subscription status
Pre-Scoring Filters: Remove posts that are:
- Duplicates
- Too old
- From the viewer themselves
- From blocked/muted accounts
- Containing muted keywords
- Previously seen or recently served
- Ineligible subscription content
Scoring: Apply multiple scorers sequentially:
- Phoenix Scorer: Get ML predictions from the Phoenix transformer model
- Weighted Scorer: Combine predictions into a final relevance score
- Author Diversity Scorer: Attenuate repeated author scores for diversity
- OON Scorer: Adjust scores for out-of-network content
Selection: Sort by score and select the top K candidates
Post-Selection Processing: Final validation of post candidates to be served

The Phoenix Grok-based transformer model predicts probabilities for multiple engagement types:

Predictions:
├── P(favorite)
├── P(reply)
├── P(repost)
├── P(quote)
├── P(click)
├── P(profile_click)
├── P(video_view)
├── P(photo_expand)
├── P(share)
├── P(dwell)
├── P(follow_author)
├── P(not_interested)
├── P(block_author)
├── P(mute_author)
└── P(report)

The Weighted Scorer combines these into a final score:

Final Score = Σ (weight_i × P(action_i))

Positive actions (like, repost, share) have positive weights. Negative actions (block, mute, report) have negative weights, pushing down content the user would likely dislike.

Filters run at two stages:

Pre-Scoring Filters:

Filter	Purpose
`DropDuplicatesFilter`	Remove duplicate post IDs
`CoreDataHydrationFilter`	Remove posts that failed to hydrate core metadata
`AgeFilter`	Remove posts older than threshold
`SelfpostFilter`	Remove user's own posts
`RepostDeduplicationFilter`	Dedupe reposts of same content
`IneligibleSubscriptionFilter`	Remove paywalled content user can't access
`PreviouslySeenPostsFilter`	Remove posts user has already seen
`PreviouslyServedPostsFilter`	Remove posts already served in session
`MutedKeywordFilter`	Remove posts with user's muted keywords
`AuthorSocialgraphFilter`	Remove posts from blocked/muted authors

Post-Selection Filters:

Filter	Purpose
`VFFilter`	Remove posts that are deleted/spam/violence/gore etc.
`DedupConversationFilter`	Deduplicate multiple branches of the same conversation thread

1. No Hand-Engineered Features

The system relies entirely on the Grok-based transformer to learn relevance from user engagement sequences. No manual feature engineering for content relevance. This significantly reduces the complexity in our data pipelines and serving infrastructure.

2. Candidate Isolation in Ranking

During transformer inference, candidates cannot attend to each other—only to the user context. This ensures the score for a post doesn't depend on which other posts are in the batch, making scores consistent and cacheable.

Both retrieval and ranking use multiple hash functions for embedding lookup

4. Multi-Action Prediction

Rather than predicting a single "relevance" score, the model predicts probabilities for many actions.

5. Composable Pipeline Architecture

The candidate-pipeline crate provides a flexible framework for building recommendation pipelines with:

Separation of pipeline execution and monitoring from business logic
Parallel execution of independent stages and graceful error handling
Easy addition of new sources, hydrations, filters, and scorers

This project is licensed under the Apache License 2.0. See LICENSE for details.