LiteLLM (YC W23): 创始可靠性工程师 – 20万-27万美元薪资，以及0.5-1.0%的股权

LiteLLM (YC W23): 创始可靠性工程师 – 20万-27万美元薪资，以及0.5-1.0%的股权
LiteLLM (YC W23): Founding Reliability Engineer – $200K-$270K and 0.5-1.0% equity

原始链接: https://www.ycombinator.com/companies/litellm/jobs/unlCynJ-founding-reliability-performance-engineer

## LiteLLM：首位可靠性工程师 - 概要 LiteLLM 是一家快速发展的开源 AI 网关，为 NASA、Netflix 和 Stripe 等大型公司提供 AI 基础设施，每天处理数亿次 API 调用，年经常性收入 (ARR) 达 700 万美元。他们正在寻找首位专门的可靠性工程师，以确保平台在客户 AI 堆栈中的稳定性。该职位兼具运营可靠性（60%）和性能工程（40%），需要解决大规模下的内存泄漏、竞态条件和数据库瓶颈等问题。职责包括待命支持、事件响应、性能优化和构建可观察性工具。理想的候选人拥有 2 年以上运行 Python 服务于生产环境的经验，对异步 Python、PostgreSQL 和 Kubernetes 有深入的了解，并具备调试复杂问题的能力。具有代理/网关经验或在 Stripe 或 Cloudflare 等以基础设施为重点的公司工作经验将是加分项。这是一个能够在快速发展的初创公司中定义可靠性实践、并具有显著开源影响力和有意义股权的高影响力机会。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 LiteLLM (YC W23): 联合创始人可靠性工程师 – 20万-27万美元和0.5-1.0%股权 (ycombinator.com) 11分钟前 | 隐藏指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

原文

TLDR

LiteLLM is an open-source AI gateway (36K+ GitHub stars) that routes hundreds of millions of LLM API calls daily for companies like NASA, Adobe, Netflix, Stripe, and Nvidia. We're at $7M ARR, 10 people, YC W23.

When LiteLLM goes down, our customers' entire AI stack goes down. We need someone who makes sure that doesn't happen.

You'd be the first dedicated reliability hire. You'll own reliability, performance, and production stability end-to-end. Nobody will tell you how to do it

What this job actually is

We'll be straight with you: this role is roughly 60% operational reliability and 40% deep performance engineering. On any given week you might be:

Hunting a memory leak in our async streaming handler that causes OOMs after 4 hours under load
Fixing a race condition where PodLockManager releases another pod's lock
Profiling why update_database() does 7 deep copies per request in the spend tracking hot path
Helping a Fortune 500 customer debug why their 20-pod deployment is exhausting Postgres connections
Building soak tests that catch degradation before a release goes out
Reviewing a PR that touches the request hot path and saying "this will add 50ms at P99, here's why"

If you're looking for a pure optimization role where you sit in a profiler all day — this isn't it. If you want to own production health for one of the most widely deployed AI infrastructure projects in the world — keep reading.

Why this matters

We route traffic for some of the largest AI deployments on the planet. One customer is scaling from 20M to 200M daily AI calls through our gateway. Another has 150K users hitting us daily. When we ship a bad release, it doesn't just break a dashboard — it breaks production AI systems at companies you've heard of.

The problems here are genuinely hard:

Memory management in long-running Python async services — our proxy handles thousands of concurrent streaming connections. HTTP client sessions, response iterators, and background tasks all need careful lifecycle management.
Database at scale — spend logging, auth, and rate limiting all interact with Postgres. At 100K+ requests/day, naive patterns fall apart.
100+ provider surface area — we translate between OpenAI, Anthropic, Bedrock, Vertex, and 100+ other APIs. Each has unique streaming behavior. A refactor that fixes one provider can break three others.

You won't run out of interesting problems.

What you'll own

Production reliability

On-call for critical issues (shared rotation with the team, not solo)
Incident response and blameless post-mortems
Customer escalation support for enterprise deployments
Making the proxy self-healing when DB/Redis is temporarily unavailable

Performance engineering

Memory leak detection and prevention (soak tests, CI integration)
Hot path optimization — our target is <10ms overhead at 5K+ RPS
P50/P95/P99 latency benchmarks that block releases on regression
Profiling and fixing bottlenecks (Pydantic validation, connection pools, async task scheduling)

Observability & release safety

Structured logging, distributed tracing, correlation IDs
Prometheus metrics that are actually accurate and actionable
Building toward canary deployments and automated rollback
SLO definition and tracking for enterprise customers

Who you are

Must have:

2+ years of experience running Python services in production, with real exposure to debugging things that break at scale
Strong understanding of Python async internals — asyncio event loop, aiohttp/httpx session management, connection pooling
Experience debugging production memory leaks, OOMs, or latency degradation (bonus if you've used memray, py-spy, or tracemalloc)
Solid PostgreSQL knowledge — connection pool tuning, query optimization, understanding how DB operations on the request path degrade under load
Comfort with Kubernetes at an operational level — pod lifecycle, resource limits, health probes
You've been on-call before and you didn't hate it

Strong signals:

You've worked on a proxy, API gateway, load balancer, or middleware service where overhead itself is what you optimize
You've worked at Meta (Production Engineering), Cloudflare, Fastly, Datadog, Stripe, or a similar infrastructure company
You've been an early reliability/infra hire at a startup and built production practices from scratch
You've contributed to open-source infrastructure projects
You understand HTTP/2, streaming responses (SSE), and how async Python handles them under concurrency

Why LiteLLM

Scale & impact: Your work is in the critical path for hundreds of millions of AI API calls daily. NASA, Netflix, Adobe, Stripe depend on this.
Open source visibility: 36K GitHub stars. Your contributions are visible to the entire AI infrastructure community. Your GitHub profile will look incredible.
Ownership: First dedicated reliability hire. You define what reliability means here. No bureaucracy, no tickets — you see a problem, you fix it.
Trajectory: $7M ARR growing fast, 10-person team, YC W23. Meaningful equity at a stage where it can matter.