（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=41483675

Reclaim The Stack 的创建者和管理者运营他们的自托管平台，无需监控费用。他们通过直接访问实时资源使用情况和无限的日志保留（根据 GDPR 法规，上限为 12 个月），提高了 Heroku 的可观察性。平台相关任务每月大约需要三天时间，主要用于软件更新。该团队最初假设自动化数据库操作员将解决所有常见问题，例如 HA/备份/DR，但 Redis 除外，Redis 会导致一些问题。为了解决这个问题，他们用定制的 Kubernetes 模板替换了 Redis 运算符。通过 Talos Linux、Talos Manager、Sealed Secrets+CLI 工具和自主开发的最佳实践等工具，Kubernetes、PostgreSQL、Elasticsearch、秘密管理、操作系统和本地存储等其他复杂系统变得更加容易。借助 Talos Linux，升级 Kubernetes 非常简单，并且停机时间极短。灾难恢复涉及从真实来源（在本例中为 Postgres）重建索引或从异地备份进行恢复。该团队使用操作员来处理故障转移、确保数据一致性和管理磁盘空间。总体而言，该平台消除了对大约四名全职 DevOps 员工的需求，同时只需要额外几天的每月维护，从而降低了成本。作者承认，每个人可能不会选择实施 Reclaim The Stack，但他们在网上详细分享了他们的经验和见解。

Original creator and maintainer of Reclaim the Stack here.

> you also removed monitoring of the platform

No we did not: Monitoring: https://reclaim-the-stack.com/docs/platform-components/monit...

Log aggregation: https://reclaim-the-stack.com/docs/platform-components/log-a...

Observability is on the whole better than what we had at Heroku since we now have direct access to realtime resource consumption of all infrastructure parts. We also have infinite log retention which would have been prohibitively expensive using Heroku logging addons (though we cap retention at 12 months for GDPR reasons).

> Who/What is going to be doing that on this new platform and how much does that cost?

Me and my colleague who created the tool together manage infrastructure / OS upgrades and look into issues etc. So far we've been in production 1.5 years on this platform. On average we spent perhaps 3 days per month doing platform related work (mostly software upgrades). The rest we spend on full stack application development.

The hypothesis for migrating to Kubernetes was that the available database operators would be robust enough to automate all common high availability / backup / disaster recovery issues. This has proven to be true, apart from the Redis operator which has been our only pain point from a software point of view so far. We are currently rolling out a replacement approach using our own Kubernetes templates instead of relying on an operator at all for Redis.

> Now you need to maintain k8s, postgresql, elasticsearch, redis, secret managements, OSs, storage... These are complex systems that require people understanding how they internally work

Thanks to Talos Linux (https://www.talos.dev/), maintaining K8s has been a non issue.

Running databases via operators has been a non issue, apart from Redis.

Secret management via sealed secrets + CLI tooling has been a non issue (https://reclaim-the-stack.com/docs/platform-components/secre...)

OS management with Talos Linux has been a learning curve but not too bad. We built talos-manager to manage bootstrapping new nodes to our cluster straight forward (https://reclaim-the-stack.com/docs/talos-manager/introductio...). The only remaining OS related maintenance is OS upgrades, which requires rebooting servers, but that's about it.

For storage we chose to go with simple local storage instead of complicated network based storage (https://reclaim-the-stack.com/docs/platform-components/persi...). Our servers come with datacenter grade NVMe drives. All our databases are replicated across multiple servers so we can gracefully deal with failures, should they occur.

> Who is going to upgrade kubernetes when they release a new version that has breaking changes?

Ugrading kubernetes in general can be done with 0 downtime and is handled by a single talosctl CLI command. Breaking changes in K8s implies changes to existing resource manifest schemas and are detected by tooling before upgrades occur. Given how stable Kubernetes resource schemas are and how averse the community is to push breaking changes I don't expect this to cause major issues going forward. But of course software upgrades will always require due diligence and can sometimes be time consuming, K8s is no exception.

> What happens when ElasticSearch decides to splitbrain and your search stops working?

ElasticSearch, since major version 7, should not enter split brain if correctly deployed across 3 or more nodes. That said, in case of a complete disaster we could either rebuild our index from source of truth (Postgres) or do disaster recovery from off site backups.

It's not like using ElasticCloud protects against these things in any meaningfully different way. However, the feedback loop of contacting support would be slower.

> When the DB goes down or you need to set up replication?

Operators handle failovers. If we would lose all replicas in a major disaster event we would have to recover from off site backups. Same rules would apply for managed databases.

> What is monitoring replication lag?

For Postgres, which is our only critical data source. Replication lag monitoring + alerting is built into the operator.

It should be straight forward to add this for Redis and ElasticSearch as well.

> Or even simply things like disks being close to full?

Disk space monitoring and alerting is built into our monitoring stack.

At the end of the day I can only describe to you the facts of our experience. We have reduced costs to cover hiring about 4 full time DevOps people so far. But we have hired 0 new engineers and are managing fine with just a few days of additional platform maintenance per month.

That said, we're not trying to make the point that EVERYONE should Reclaim the Stack. We documented our thoughts about it here: https://reclaim-the-stack.com/docs/kubernetes-platform/intro...

（评论） (comments)

（评论）
(comments)