Linux commands that DevOps engineers (or SysAdmin) should know.
ref:
https://peteris.rocks/blog/htop/
http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html
http://techblog.netflix.com/2015/08/netflix-at-velocity-2015-linux.html
總覽
$ top
$ sudo apt-get install htop
$ htop
# 每 1 秒輸出一次資訊
$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 1580104 171620 4287208 0 0 0 11 2 2 9 0 90 0 0
0 0 0 1579832 171620 4287340 0 0 0 0 2871 2414 13 2 85 0 0
0 0 0 1578688 171620 4287344 0 0 0 40 2311 1700 18 1 82 0 0
1 0 0 1578640 171620 4287348 0 0 0 48 1302 1020 5 0 95 0 0
...
查 CPU
$ uptime
Load average: 0.03 0.11 0.19
Load average: 一分鐘 五分鐘 十五分鐘內的平均負載
單核心,如果 Load average 是 1 表示負載 100%
多核心的話,因為 Load average 是所有 CPU 數加起來,所以數值可能會大於 1
$ sudo apt-get install sysstat
# 每個 CPU 的使用率
$ mpstat -P ALL 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
07:38:49 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
07:38:50 PM all 98.47 0.00 0.75 0.00 0.00 0.00 0.00 0.00 0.00 0.78
07:38:50 PM 0 96.04 0.00 2.97 0.00 0.00 0.00 0.00 0.00 0.00 0.99
07:38:50 PM 1 97.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00
07:38:50 PM 2 98.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
...
# 每個 process 的 CPU 使用率
$ pidstat 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
07:41:02 PM UID PID %usr %system %guest %CPU CPU Command
07:41:03 PM 0 9 0.00 0.94 0.00 0.94 1 rcuos/0
07:41:03 PM 0 4214 5.66 5.66 0.00 11.32 15 mesos-slave
07:41:03 PM 0 4354 0.94 0.94 0.00 1.89 8 java
07:41:03 PM 0 6521 1596.23 1.89 0.00 1598.11 27 java
...
查 Memory
$ free –m
total used free shared buffers cached
Mem: 7983 6443 1540 0 167 4192
-/+ buffers/cache: 2083 5900
Swap: 0 0 0
查 Disk
$ iostat -xz 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
73.96 0.00 3.73 0.03 0.06 22.21
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvda 0.00 0.23 0.21 0.18 4.52 2.08 34.37 0.00 9.98 13.80 5.42 2.44 0.09
xvdb 0.01 0.00 1.02 8.94 127.97 598.53 145.79 0.00 0.43 1.78 0.28 0.25 0.25
xvdc 0.01 0.00 1.02 8.86 127.79 595.94 146.50 0.00 0.45 1.82 0.30 0.27 0.26
查 Disk Usage
# show whole disk
$ df -h
# show every folder under the directory
$ du -h /data
# show the top directory only
$ du -hs /var/lib/influxdb/data
77.4G /var/lib/influxdb/data
# show largest top 10 files
$ du -hsx * | sort -rh | head -10
ref:
https://www.codecoffee.com/tipsforlinux/articles/22.html
https://www.cyberciti.biz/faq/how-do-i-find-the-largest-filesdirectories-on-a-linuxunixbsd-filesystem/
查 IO
$ sudo apt-get install dstat iotop
# 可以顯示哪些 process 在進行 io 操作
$ dstat --top-io --top-bio
# with –only option to see only processes or threads actually doing I/O
$ sudo iotop --only
ref:
https://www.cyberciti.biz/hardware/linux-iotop-simple-top-like-io-monitor/
查 CPU bound 或 IO bound
$ iostat -c | head -3 ; iostat -c 1 20
ref:
https://serverfault.com/questions/72209/cpu-or-network-i-o-bound
https://askubuntu.com/questions/1540/how-can-i-find-out-if-a-process-is-cpu-memory-or-disk-bound
iotop
cannot is not working inside a container.
查 Process
$ ps aux
$ pstree -a
# attach to a process to find out system calls the process calls
# -t -- absolute timestamp
# -T -- print time spent in each syscall
# -s strsize -- limit length of print strings to STRSIZE chars (default 32)
# -f -- follow forks
# -e -- filtering expression: `option=trace,abbrev,verbose,raw,signal,read,write,fault`
# -u username -- run command as username handling setuid and/or setgid
$ strace -t -T -f -s 2048 -p THE_PID
# find out which files that nginx accesses
# you could try to find something related to the error message first:
# write(1, "Ign http://192.168.212.136 trusty Release\n", 62) = 62
# writev(12, [{"HTTP/1.1 500 Internal Server Error"..., 256}, {...}, {...}, {...}, 4]) = 276
$ strace -f -e trace=file service nginx start
# 顯示 PID 3001 的 process 是用什麼指令和參數啟動的
$ tr '\0' '\n' < /proc/3001/cmdline
# only on macOS
$ top -c a -p 1537
ref:
https://mp.weixin.qq.com/s/Sf79W5dqUFx7rUYRrtx88Q
https://blogs.oracle.com/linux/strace-the-sysadmins-microscope-v2
https://zwischenzugs.com/2011/08/29/my-favourite-secret-weapon-strace/
查 Kernel Logs
# 顯示最近的 15 筆 system messages
$ dmesg | tail -fn 15
# 顯示有關 killed process 的 logs
$ dmesg | grep -E -i -B50 'killed process'
ref:
https://stackoverflow.com/questions/726690/what-killed-my-process-and-why
查 Network
$ sar -n TCP,ETCP 1
查 DNS
Resolve a domain name using dig
:
$ apt-get install curl dnsutils iputils-ping
# or
$ apk add --update bind-tools
$ dig +short october-api.default.svc.cluster.local
10.32.1.79
$ dig +short redis-broker.default.svc.cluster.local
10.60.32.20
10.60.33.15
$ dig +short redis-broker-0.redis-broker.default.svc.cluster.local
10.60.32.20
ref:
https://blog.longwin.com.tw/2013/03/dig-dns-query-debug-2013/
Resolve a domain name using nslookup
:
$ apt-get install dnsutils
$ nslookup redis-broker.default.svc.cluster.local
Server: 10.3.240.10
Address 1: 10.3.240.10 kube-dns.kube-system.svc.cluster.local
Name: redis-broker.default.svc.cluster.local
Address 1: 10.0.69.46 redis-broker-0.redis-broker.default.svc.cluster.local
Find specific types of DNS records:
$ nslookup -q=TXT codetengu.com
Server: 1.1.1.1
Address: 1.1.1.1#53
Non-authoritative answer:
codetengu.com text = "zoho-verification=xxx.zmverify.zoho.com"
Authoritative answers can be found from:
nslookup
could return multiple A records for a domain which is commonly known as round-robin DNS.
查 Nginx
# 顯示各個 status code 的數量
$ cat access.log | cut -d '"' -f3 | cut -d ' ' -f2 | sort | uniq -c | sort -rn
# 顯示哪些 URL 的 404 數量最多
$ awk '($9 ~ /404/)' access.log | awk '{print $7}' | sort | uniq -c | sort -rn
# 顯示 2016/10/01 的 16:00 ~ 18:00 的 log
$ grep "01/Oct/2016:1[6-8]" access.log
# 顯示 2016/10/01 的 09:00 ~ 12:00 的 log
$ egrep "01/Oct/2016:(0[8-9]|1[0-2])" access.log
ref:
http://stackoverflow.com/questions/7575267/extract-data-from-log-file-in-specified-range-of-time
http://superuser.com/questions/848971/unix-command-to-grep-a-time-range
如果 status code 是 502 Bad Gateway
通常表示是 load balancer / nginx 的 upstream server 掛了或連不到
如果是 Kubernetes service 的話
可能是 Service spec.selector 跟 pod 匹配不起來