解决向 Microsoft 用户发送电子邮件的问题

解决向 Microsoft 用户发送电子邮件的问题
Troubleshooting Email Delivery to Microsoft Users

## 微软邮件投递问题 - 2026年2月 2026年2月，出现了一个重大问题，导致发送给微软（Hotmail、Live、MSN、Outlook）用户的邮件被延迟，并显示“451 因IP信誉受限”错误。尽管Sendgrid的发送信誉为99%，Gmail Postmaster Tools的指标正常，但所有发送到微软地址的邮件都停止投递。初步调查指向微软方面的问题，可能为中断或错误配置，因为网上出现了类似的报告。公司收到了用户投诉，特别是关于延迟的事务性邮件（登录、账单等）。一个临时解决方案是通知客服管理用户预期。进一步研究表明，微软对“激增”的邮件流量非常敏感。对发送日志的分析发现，在问题发生之前，有一分钟内发送到微软用户的邮件数量高于平时，可能触发了速率限制。一个基于Redis的速率限制器被迅速实施，以控制发送到微软地址的速度，限制为每IP每分钟10封邮件。在向微软升级后，一位人工客服代表确认了连接限制的调整。72小时内恢复了投递。该事件强调了事务性邮件需要专用IP地址，以及在Sendgrid层面上需要更强大的节流系统。

## Microsoft 邮件投递问题 Hacker News 上的一场讨论强调了向 Microsoft（Hotmail、Live、Outlook）用户发送邮件时持续存在的困难。许多用户报告了无法解释的邮件发送失败，通常伴随着模糊的错误信息和来自 Microsoft 的无帮助支持。一位用户讲述了持续多年的问题，即主题行包含“summer camp”（夏令营）的邮件被阻止，只有在略微更改为“summer lamp”（夏令灯）后才得以解决。其他人也详细描述了类似的令人沮丧的经历，即使遵循了邮件最佳实践（SPF、DKIM、DMARC），仍然会突然且无法解释地被阻止。一个关键的结论是区分事务性和营销邮件流的重要性——理想情况下使用子域名甚至单独的域名。许多评论者认为 Microsoft 的过滤过于敏感，可能依赖于惩罚合法发件人的启发式方法。许多人最终诉诸使用替代服务，如 Amazon SES，以绕过 Microsoft 的限制，理由是缺乏支持以及他们认为 Microsoft 正在推动使用自己的 Office 365 平台。

原文

On Feb 24, 2026 we started to see a rise in user complaints saying they're not receiving our emails. They all had one in thing in common: Microsoft as their email provider. Since I'd never encountered this before, my first reaction was to check my Sendgrid logs for details. According to the logs, emails to Hotmail, Live, MSN, and Outlook email addresses were being Deferred. Emails to other email addresses were flowing normally. Digging deeper, Sendgrid showed this reason for the deferral:

451 4.7.650 The mail server [redacted] has been temporarily rate limited due to IP reputation. For e-mail delivery information, see https://aka.ms/postmaster (S775) [Name=Protocol Filter Agent][AGT=PFA][MxId=redacted]

This raised even more questions. First, no emails were being delivered to Microsoft users at all. It wasn't a 'temporary rate limit'. It was more like a temporary ban. For how long? Unclear. And due to IP reputation? Sendgrid showed a 99% sending reputation. I opened Gmail Postmaster Tools and the spam rate looked normal. I was following all the best practices with SPF, SKIM, and DMARC. I also hadn't made any changes to my email setup recently.

A quick Google search for this error code led me to this: https://learn.microsoft.com/en-us/answers/questions/5786144/all-sending-ips-temporarily-rate-limited-451-4-7. Interestingly, this was posted less than 24 hours ago. Seems like I wasn't alone. This led me to believe that the problem lay with Microsoft and not me; either an outage or a misconfiguration on their end was penalizing legitimate senders. Surely, it would resolve itself...

Email deliverability is crucial for the smooth running of our business. We send ~350k emails per month of which ~39k emails are transactional in nature (login, billing, password reset, etc...). Currently, all these emails are sent from two dedicated IPs irrespective of whether the email is transactional or not. Both IPs were being rate limited.

Complaints were flowing in on Helpscout. I first instructed my CX reps to create a saved reply to calm users down. Lucky for us, emails sent via Helpscout don't go through our Sendgrid IPs.

Hi {%customer.firstName,fallback=there%},
It appears Microsoft is throttling some of our emails which is preventing the email from reaching your inbox in a timely manner.
We recommend waiting for 24 hours. If you still haven't received the email, please reply back.
We apologize for the inconvenience.

Even though I firmly believed this was a Microsoft problem which would eventually fix itself, I couldn't rest easy knowing this was happening. So I busied myself with research about email deliverability to Microsoft users. I've picked up many email deliverability quirks during times of crisis like this. This time, I learned that Microsoft has a reputation for being hypersensitive compared to other email providers like Gmail.

I also learned about SNDS, Microsoft's version of Gmail Postmaster Tools, and immediately created an account. This allowed me to confirm my IP reputation from Microsoft's perspective. As I expected, everything was normal. Complaint rate was < 0.1% and days leading up the incident showed all green boxes, no red flags. This further convinced me it was a Microsoft issue.

Nevertheless, some mysterious inner drive prevented me from just resigning to Microsoft to fix the problem or lift the ban. It bothered me that loyal users of my site were being affected, and I couldn't remedy it immediately.

As a next step, I emailed Microsoft via their support portal: https://olcsupport.office.com/. I made sure to include as much relevant information as possible, including a subset of Sendgrid logs. This also forced me to clarify my understanding of the issue. I hit submit, not very optimistic that I'd hear back anytime soon.

The process of gathering all the evidence and submitting that ticket brought my attention to something I had ignored earlier. The error said has been temporarily rate limited due to IP reputation. So far, I had been focused on the IP reputation part of that message. But during my research on Microsoft deliverability, I learned that senders can be rate limited for other reasons too, most notably spiky send traffic. Part of their motivation is to contain a potential threat before it does more harm. A temporary ban, like the one I was experiencing, fit into this theory.

We send out a weekly personalized newsletter. We split the sends into four batches, Monday through Thursday, to spread out the traffic. Each batch is sent as fast as our system can call Sendgrid's Mail API. We don't throttle. And I confirmed with Sendgrid's support team that they don't throttle either.

Is it possible Microsoft imposed a temporary ban on my IPs after it saw a sudden spike in emails originating from them?

SELECT
  DATE_TRUNC('minute', created_at) AS minute_group,
  COUNT(*)
FROM notifications
WHERE
  email ~ 'hotmail|live|msn|outlook' AND
  DATE(created_at) <= '2026-02-25'
GROUP BY 1
HAVING COUNT(*) > 0
ORDER BY 2 DESC

This gave the number of emails we attempted to send to Microsoft users each minute leading up the incident. After filtering for counts < 50, here's the result:

2026-02-23 21:48:00,53
2025-11-26 15:36:00,66
2025-10-28 21:48:00,52
2025-08-05 21:51:00,54

The first row looked suspicious because it occurred right before the incident started. Before that, we hadn't sent that many emails to Microsoft users in a single minute since November 2025.

By now, I had received a boilerplate response from Microsoft saying my sending IPs were fine. I responded saying things were not fine and tried my best to sound professional even though I was panicked. My CX reps were doing a great job handling the fallout on Helpscout. Sendgrid was periodically retrying deferred emails. It would continue to do so for up to 72 hours. I still had no clue how long this ban would last. The uncertainty was weighing on me.

But the results of my SQL query and deliverability research convinced me that regardless of the cause, I needed a way to control my sending rate to Microsoft users. Coding is a great cure for uncertainty. It's predictable, controlled, and gives a sense of accomplishment. So I dived right into this task. I needed something simple and performant. Since Redis was already a part of my infrastructure, I decided to implement a simple Redis-backed throttler.

class RedisRateLimiter(BaseRedisModel, ABC):
    def is_allowed(self, limit, window_seconds) -> Tuple[bool, int, float]:
        """
        Check if action is allowed under rate limit.

        Returns:
        - allowed: bool
        - remaining: int
        - retry_after: float
        """
        assert Redis is not None, 'Redis not configured'

        now = time.time()
        window_start = now - window_seconds

        # Use pipeline for atomic operations
        pipe = Redis.pipeline()
        # Remove old entries
        pipe.zremrangebyscore(
            self.cache_key,
            '-inf',
            window_start
        )
        # Count current entries
        pipe.zcard(self.cache_key)
        # Get oldest entry timestamp
        pipe.zrange(
            self.cache_key,
            0,
            0,
            withscores=True
        )
        results = pipe.execute()
        current_count = results[1]
        oldest_entry = results[2]

        if current_count < limit:
            # Allowed
            Redis.zadd(
                self.cache_key,
                {f"{now}:{id(now)}": now}
            )
            Redis.expire(
                self.cache_key,
                window_seconds
            )
            return True, limit - current_count - 1, 0

        # Rate limited
        if oldest_entry:
            retry_after = \
                (oldest_entry[0][1] + window_seconds) - now
        else:
            retry_after = window_seconds

        return False, 0, retry_after

    def count(self, window_seconds):
        """
        Get current count.
        """
        assert Redis is not None, 'Redis not configured'

        now = time.time()
        window_start = now - window_seconds
        count = Redis.zcount(
            self.cache_key,
            window_start,
            now
        )
        return count

I used it like follows:

class IPPoolRateLimiter(RedisRateLimiter):
    name = 'ip-pool-rate-limiter'

def send_email(recipient_email, ip_pool, ...):
    """
    Sends email.
    """
    def _rate_limit(name,
                    limit,
                    window_seconds=60):
        """
        Rate limit helper method.
        """
        limiter = IPPoolRateLimiter(
            f'{ip_pool.name}-{name}'
        )
        while True:
            is_allowed, _, retry_after = \
                limiter.is_allowed(
                    limit,
                    window_seconds
                )
            if is_allowed:
                break
            else:
                logger.warning(
                    f'Throttled email '
                    f'due to {name} limiter.'
                )
                time.sleep(retry_after)

    is_outlook = is_outlook_email(
        recipient_email
    )
    if is_outlook:
        # Outlook is very sensitive to spiky traffic
        _rate_limit('outlook', 10)

While I shipped and monitored this in production, I was also surprised to hear back from a human being, Anthony, to my escalation email.

The connection and throttling limitation against your IP [redacted;redacted] has been set to a more appropriate level based on your reputation.

A few hours after receiving Anthony's response, Sendgrid started to deliver our emails to Hotmail, Live, MSN, and Outlook email addresses. Lucky for us, all this happened within 72 hours of the start of the incident, so no emails were dropped completely. Some emails were just delayed by ~72 hours.

Since implementing the throttling mechanism, my send rate to Microsoft's email addresses has not exceeded 10 emails per minute per IP. Users don't seem to care about the few seconds of delay this introduces.

It's hard to say for sure because the 451 error does not offer much detail. Anthony in his response also did not acknowledge any outage or change on Microsoft's end. Therefore, my theory is that, during one of our batch newsletter send jobs, Microsoft received what it considered to be a spike of email traffic from both our IPs. Thinking it suspicious, it blocked both IPs for an undisclosed period of time.

Even though we sent a similar volume to Microsoft during a one-minute interval in November 2025, this time was different. This could be either due to a recent configuration change to be more sensitive or due to how "spiky" the sends felt this time compared to last time. Their algorithms are probably more sophisticated than my one-minute send count heuristic.

The problem was exacerbated for us because we send transaction and bulk emails through both IPs. Majority of user complaints were about not receiving transactional emails. I eventually want to warm up an IP solely for transactional emails. But our transaction email volume of ~39k emails per month doesn't justify such a move yet. Research suggests we need a send volume of ~100k emails per month to keep an IP warm. In retrospect, having the ability to send emails via a shared IP would have been handy.

Finally, Sendgrid is fantastic. I've used them for almost a decade and been happy with their service. But I feel they should be responsible for throttling sends to Microsoft at the IP level. Otherwise, I could hit their API a hundred times in a minute and get my sending IP blocked by Microsoft for an indefinite period of time...

解决向 Microsoft 用户发送电子邮件的问题 Troubleshooting Email Delivery to Microsoft Users

解决向 Microsoft 用户发送电子邮件的问题
Troubleshooting Email Delivery to Microsoft Users