Linux 危机工具

Linux 危机工具
Linux Crisis Tools

原始链接: https://www.brendangregg.com/blog/2024-03-24/linux-crisis-tools.html

这是给定文本的简化版本，重点关注基本 Linux 危机工具的概念。作者描述了一种假设情况，其中 Linux 服务器遇到性能问题导致停机，以及提前准备好必要工具以有效排除故障并最大限度地减少停机时间的重要性。他推荐了各种实用程序，例如 procps、util-linux、sysstat、iproute2、numactl、tcpdump、linux-tools-common 和 bcc，用于在危机期间分析系统的不同方面。作者提到，一些较大的组织可能会维护包含这些工具的定制 Linux 发行版，而较小的组织可能会因安装时间较慢和潜在的访问限制而在需要时难以访问它们。他最后建议流行的 Linux 发行版在其产品中包含这些工具，以提供广泛的便利。本质上，作者强调了将一系列关键系统分析工具预加载到 Linux 服务器上的重要性，以在意外事件期间加快问题解决速度并减少对最终用户的潜在影响。

当然，你是完全正确的。上下文很重要，在没有仔细考虑和验证的情况下，不应仅从表面上理解由 ChatGPT 或其他大型语言模型生成的脚本。该实用程序不仅在于生成有用的代码片段，还在于激发灵感并为用户提供修改和适应其特定上下文的起点。感谢您提出这个问题。另一点值得一提：评论中提供的脚本并不意味着是一个独立的解决方案，而是用于构建单个用例的模板或基础。挑战在于使其适应独特的环境和要求，而不是仅仅依靠自动化工具来完成繁重的工作。手动干预和微调是复杂系统中有效解决问题的重要组成部分。最后，重要的是要记住，每个工具（包括 ChatGPT）都有局限性和缺点。虽然它可以提供有价值的见解和建议，但在将其应用于现实场景时，运用判断力和判断力至关重要。盲目依赖任何单一工具都可能导致意想不到的并发症或疏忽，使得人类的参与在复杂系统的故障排除和维护过程中不可或缺。

原文

When you have an outage caused by a performance issue, you don't want to lose precious time just to install the tools needed to diagnose it. Here is a list of "crisis tools" I recommend installing on your Linux servers by default (if they aren't already), along with the (Ubuntu) package names that they come from:

Package	Provides	Notes
procps	ps(1), vmstat(8), uptime(1), top(1)	basic stats
util-linux	dmesg(1), lsblk(1), lscpu(1)	system log, device info
sysstat	iostat(1), mpstat(1), pidstat(1), sar(1)	device stats
iproute2	ip(8), ss(8), nstat(8), tc(8)	preferred net tools
numactl	numastat(8)	NUMA stats
tcpdump	tcpdump(8)	Network sniffer
linux-tools-common linux-tools-$(uname -r)	perf(1), turbostat(8)	profiler and PMU stats
bpfcc-tools (bcc)	opensnoop(8), execsnoop(8), runqlat(8), softirqs(8), hardirqs(8), ext4slower(8), ext4dist(8), biotop(8), biosnoop(8), biolatency(8), tcptop(8), tcplife(8), trace(8), argdist(8), funccount(8), profile(8), etc.	canned eBPF tools[1]
bpftrace	bpftrace, basic versions of opensnoop(8), execsnoop(8), runqlat(8), biosnoop(8), etc.	eBPF scripting[1]
trace-cmd	trace-cmd(1)	Ftrace CLI
nicstat	nicstat(1)	net device stats
ethtool	ethtool(8)	net device info
tiptop	tiptop(1)	PMU/PMC top
cpuid	cpuid(1)	CPU details
msr-tools	rdmsr(8), wrmsr(8)	CPU digging

(This is based on Table 4.1 "Linux Crisis Tools" in SysPerf 2.)

Some longer notes: [1] bcc and bpftrace have many overlapping tools: the bcc ones are more capable (e.g., CLI options), and the bpftrace ones can be edited on the fly. But that's not to say that one is better or faster than the other: They emit the same BPF bytecode and are equally fast once running. Also note that bcc is evolving and migrating tools from Python to libbpf C (with CO-RE and BTF) but we haven't reworked the package yet. In the future "bpfcc-tools" should get replaced with a much smaller "libbpf-tools" package that's just tool binaries.

This list is a minimum. Some servers have accelerators and you'll want their analysis tools installed as well: e.g., on Intel GPU servers, the intel-gpu-tools package; on NVIDIA, nvidia-smi. Debugging tools, like gdb(1), can also be pre-installed for immediate use in a crisis.

Essential analysis tools like these don't change that often, so this list may only need updating every few years. If you think I missed a package that is important today, please let me know (e.g., in the comments).

The main downside of adding these packages is their on-disk size. On cloud instances, adding Mbytes to the base server image can add seconds, or fractions of a second, to instance deployment time. Fortunately the packages I've listed are mostly quite small (and bcc will get smaller) and should cost little size and time. I have seen this size concern prevent debuginfo (totaling around 1 Gbyte) from being included by default.

Can't I just install them later when needed?

Many problems can occur when trying to install software during a production crisis. I'll step through a made-up example that combines some of the things I've learned the hard way:

4:00pm: Alert! Your company's site goes down. No, some people say it's still up. Is it up? It's up but too slow to be usable.
4:01pm: You look at your monitoring dashboards and a group of backend servers are abnormal. Is that high disk I/O? What's causing that?
4:02pm: You SSH to one server to dig deeper, but SSH takes forever to login.
4:03pm: You get a login prompt and type "iostat -xz 1" for basic disk stats to begin with. There is a long pause, and finally "Command 'iostat' not found...Try: sudo apt install sysstat". Ugh. Given how slow the system is, installing this package could take several minutes. You run the install command.
4:07pm: The package install has failed as it can't resolve the repositories. Something is wrong with the /etc/apt configuration. Since the server owners are now in the SRE chatroom to help with the outage, you ask: "how do you install system packages?" They respond "We never do. We only update our app." Ugh. You find a different server and copy its working /etc/apt config over.
4:10pm: You need to run "apt-get update" first with the fixed config, but it's miserably slow.
4:12pm: ...should it really be taking this long??
4:13pm: apt returned "failed: Connection timed out." Maybe this system is too slow with the performance issue? Or can't it connect to the repos? You begin network debugging and ask the server team: "Do you use a firewall?" They say they don't know, ask the network security team.
4:17pm: The network security team have responded: Yes, they have blocked any unexpected traffic, including HTTP/HTTPS/FTP outbound apt requests. Gah. "Can you edit the rules right now?" "It's not that easy." "What about turning off the firewall completely?" "Uh, in an emergency, sure."
4:20pm: The firewall is disabled. You run apt-get update again. It's slow, but works! Then apt-get install, and...permission errors. What!? I'm root, this makes no sense. You share your error in the SRE chatroom and someone points out: Didn't the platform security team make the system immutable?
4:24pm: The platform security team are now in the SRE chatroom explaining that some parts of the file system can be written to, but others, especially for executable binaries, are blocked. Gah! "How do we disable this?" "You can't, that's the point. You'd have to create new server images with it disabled."
4:27pm: By now the SRE team has announced a major outage and informed the executive team, who want regular status updates and an ETA for when it will be fixed. Status: Haven't done much yet.
4:30pm: You start running "cat /proc/diskstats" as a rudimentary iostat(1), but have to spend time reading the Linux source (admin-guide/iostats.rst) to make sense of it. It just confirms the disks are busy which you knew anyway from the monitoring dashboard. You really need the disk and file system tracing tools, like biosnoop(8), but you can't install them either. Unless you can hack up rudimentary tracing tools as well...You "cd /sys/kernel/debug/tracing" and start looking for the FTrace docs.
4:55pm: New server images finally launch with all writable file systems. You login – gee it's fast – and "apt-get install sysstat". Before you can even run iostat there are messages in the chatroom: "Website's back up! Thanks! What did you do?" "We restarted the servers but we haven't fixed anything yet." You have the feeling that the outage will return exactly 10 minutes after you've fallen asleep tonight.
12:50am: Ping! I knew this would happen. You get out of bed and open your work laptop. The site is down – it's been hacked – someone disabled the firewall and file system security.

I've fortunately not experienced the 12:50am event, but the others are based on real world experiences. In my prior job this sequence can often take a different turn: a "traffic team" may initiate a cloud region failover by about the 15 minute mark, so I'd eventually get iostat installed but then these systems would be idle.

Default install

The above scenario explains why you ideally want to pre-install crisis tools so you can start debugging a production issue quickly during an outage. Some companies already do this, and have OS teams that create custom server images with everything included. But there are many sites still running default versions of Linux that learn this the hard way. I'd recommend Linux distros add these crisis tools to their enterprise Linux variants, so that companies large and small can hit the ground running when performance outages occur.