« Posts under sysadmin

Get network stats for RRD graphing

This snippet displays Active, Passive and Established connections reported by “netstat –statistics” for saving into RRD or other monitoring tools.

Apache versus lighttpd

Both run on the same server: Apache/2.0.59 (port 80) & lighttpd 1.4.19 (port 8080). 2 tests: dynamic & static files. To make things a little realistic, it’s from a EU client to a US server.

Serving a dynamic file

eu$ ab -n 1000 -c 10 “http://us.server/run-some-sql.php”

Server Software:        Apache
Server Port:            80
Document Length:        824 bytes
Time taken for tests:   36.51463 seconds
Total transferred:      1213118 bytes
HTML transferred:       847886 bytes
Requests per second:    27.74 [#/sec] (mean)
Time per request:       360.515 [ms] (mean)
Time per request:       36.051 [ms] (mean, across all concurrent requests)
Transfer rate:          32.84 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:      158  158   0.3    158     162
Processing:   174  200  26.4    191     340
Waiting:      173  199  26.3    191     340
Total:        332  358  26.4    349     498
Server Software:        lighttpd/1.4.19
Server Port:            8080
Document Length:        921 bytes
Time taken for tests:   35.406200 seconds
Total transferred:      1202655 bytes
HTML transferred:       857071 bytes
Requests per second:    28.24 [#/sec] (mean)
Time per request:       354.062 [ms] (mean)
Time per request:       35.406 [ms] (mean, across all concurrent requests)
Transfer rate:          33.16 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:      158  158   0.6    158     167
Processing:   172  193  29.0    183     383
Waiting:      172  192  29.0    183     383
Total:        330  351  29.1    341     541

Apache: 27.74 requests/sec
Lighttpd 28.24 requests/sec

Serving a static file

eu$ ab -n 1000 -c 10 “http://us.server/img/some-image.gif”

Server Software:        Apache
Server Port:            80
Document Length:        14781 bytes
Time taken for tests:   63.858434 seconds
Total transferred:      15060000 bytes
HTML transferred:       14781000 bytes
Requests per second:    15.66 [#/sec] (mean)
Time per request:       638.584 [ms] (mean)
Time per request:       63.858 [ms] (mean, across all concurrent requests)
Transfer rate:          230.31 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:      157  158   0.5    158     164
Processing:   476  478   4.6    478     549
Waiting:      158  159   4.0    159     228
Total:        634  636   4.6    636     707
Server Software:        lighttp/1.4.19
Server Port:            8080
Document Length:        14781 bytes
Time taken for tests:   63.736261 seconds
Total transferred:      14992000 bytes
HTML transferred:       14781000 bytes
Requests per second:    15.69 [#/sec] (mean)
Time per request:       637.363 [ms] (mean)
Time per request:       63.736 [ms] (mean, across all concurrent requests)
Transfer rate:          229.70 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:      157  158   0.4    158     160
Processing:   476  478   2.2    478     491
Waiting:      158  158   1.5    159     166
Total:        634  636   2.3    636     649

Apache 15.66 requests/sec
Lighttpd 15.69 requests/sec

Apache is very decent when there is a low concurrency level (about 10-20). When taken into account the stability, features, modules, it’s an excellent choice. Lighttpd under high load although can perform very well, it does suffer from an issue with PHP (current with 1.4.19 and 5.1.6), its backend fast-cgi became overloaded and gave out 500 errors to clients. Bad lighty, or bad PHP! Hope they got it fixed in 1.5 or some future version of PHP

Counting TIME_WAIT with netstat

# netstat -tan | grep ':80 ' | awk '{print $6}' | sort | uniq -c
Sample Output:

     15 CLOSING
     26 ESTABLISHED
     31 FIN_WAIT1
      7 FIN_WAIT2
     14 LAST_ACK
      2 LISTEN
     24 SYN_RECV
   2428 TIME_WAIT

What happens when you do "rm -rf /*"

Just for the fun of it. Here is what happens:

[root@s10 ~]# cd /
[root@s10 /]# dir
bin   dev  initrd  lost+found  misc  opt   sbin     srv  tmp  var
boot  etc  lib     media       mnt   proc  selinux  sys  usr
[root@s10 /]# rm -rf *
rm: cannot remove directory `boot': Device or resource busy
rm: cannot remove directory `dev/shm': Device or resource busy
rm: cannot remove `dev/pts/1': Operation not permitted
rm: `proc/asound/ICH' changed dev/ino: Operation not permitted
[root@s10 /]#
[root@s10 /]# dir
-bash: /usr/bin/dir: /lib/ld-linux.so.2: bad ELF interpreter: No such file or directory
[root@s10 /]# ll
-bash: ls: command not found
[root@s10 /]# reboot
-bash: /sbin/reboot: /lib/ld-linux.so.2: bad ELF interpreter: No such file or directory

Since the processes are still running, SSH still accept connections, but cannot sign in, can’t run anything either. Was it fun?!

vmstat – Get an overview look at your server

Get an update every one second

[root@s14 trungson]# vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 3  0  41200  33324   2152 1489108    0    0     4    27    0     1  5  3 91  0
 2  0  41200  33500   2152 1489108    0    0     8     0 1838  3320  8  4 88  0
 1  0  41200  31452   2152 1489176    0    0    32     0 1787  3078  7  4 89  0
 2  0  41200  33260   2152 1489176    0    0     8     0 1788  2895  6  4 90  0
 1  0  41200  33068   2164 1489164    0    0    32   768 2038  3207  7  4 87  2
 2  0  41200  33132   2168 1489228    0    0    32     0 2082  4422 10  5 85  0
 2  0  41200  35628   2172 1489360    0    0   148     0 1924  3658  8  5 86  1
 0  0  41200  34596   2172 1489360    0    0    16     0 1904  3531  8  5 87  0
 4  0  41200  28636   2172 1489428    0    0   116     0 1922  3732  9  5 85  1
 0  0  41200  33036   2180 1489488    0    0     8   860 2127  3828  8  5 86  1
 1  0  41200  32844   2180 1489488    0    0    20     0 1784  3108  7  5 88  0
 0  0  41200  32780   2180 1489556    0    0    24     0 1850  3108  7  4 88  0
 2  0  41200  32844   2180 1489692    0    0   120     0 1915  3842  9  5 85  0
 2  0  41200  26508   2180 1489828    0    0    32   376 1976  3744  8  6 86  0

From the man page:

Procs
  r: The number of processes waiting for run time.
  b: The number of processes in uninterruptible sleep.
Memory
  swpd: the amount of virtual memory used.
  free: the amount of idle memory.
  buff: the amount of memory used as buffers.
  cache: the amount of memory used as cache.
  inact: the amount of inactive memory. (-a option)
  active: the amount of active memory. (-a option)
Swap
  si: Amount of memory swapped in from disk (/s).
  so: Amount of memory swapped to disk (/s).
IO
  bi: Blocks received from a block device (blocks/s).
  bo: Blocks sent to a block device (blocks/s).
System
  in: The number of interrupts per second, including the clock.
  cs: The number of context switches per second.
CPU
  These are percentages of total CPU time.
  us: Time spent running non-kernel code. (user time, including nice time)
  sy: Time spent running kernel code. (system time)
  id: Time spent idle. Prior to Linux 2.5.41, this includes IO-wait time.
  wa: Time spent waiting for IO. Prior to Linux 2.5.41, shown as zero.

Use ethtool or mii-tool to detect problems with ethernet card

[root@s2 adserver]# ethtool eth0
Settings for eth0:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
                      100baseT/Half 100baseT/Full
                      1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Full
Advertised auto-negotiation: Yes
Speed: Unknown! (0)
Duplex: Half
Port: Twisted Pair
PHYAD: 0
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: pumbg
Wake-on: g
Current message level: 0x00000033 (51)
Link detected: yes

You can also change the interface settings with ethtool.

[root@s2 adserver]# mii-tool
eth0: negotiated 10baseT-FD, link ok

Linux CentOS – Kernel panic

This looks like an error with memory by sim. Anyone has a better clue? The kernel version is 2.6.9-67.0.4.EL, then we rebooted and upgraded to 2.6.9-67.0.20.EL. Any kernel bug I should be aware of?

Jul 13 04:03:13 host syslogd 1.4.1: restart.
Jul 16 08:00:01 host kernel: swap_free: Unused swap offset entry 00010000
Jul 16 08:00:01 host kernel: swap_free: Unused swap offset entry 00010000
Jul 16 08:45:01 host kernel: Unable to handle kernel paging request at virtual address 313a3921
Jul 16 08:45:01 host kernel:  printing eip:
Jul 16 08:45:01 host kernel: c015eebb
Jul 16 08:45:01 host kernel: *pde = 00000000
Jul 16 08:45:01 host kernel: Oops: 0000 [#1]
Jul 16 08:45:01 host kernel: Modules linked in: ip_vs_wrr ip_vs md5 ipv6 ipt_TOS iptable_mangle ip_conntrack_ftp ip_conntrack_irc ipt_REJECT ipt_LOG ipt_limit
iptable_filter ipt_multiport ipt_state ip_conntrack ip_tables autofs4 sunrpc dm_mirror dm_mod button battery ac parport_pc parport 8139too mii ext3 jbd
Jul 16 08:45:01 host kernel: CPU:    0
Jul 16 08:45:01 host kernel: EIP:    0060:[]    Not tainted VLI
Jul 16 08:45:01 host kernel: EFLAGS: 00010202   (2.6.9-67.0.4.EL)
Jul 16 08:45:01 host kernel: EIP is at find_vma+0x29/0x4d
Jul 16 08:45:01 host kernel: eax: 313a3919   ebx: 00c8479c   ecx: 313a3931   edx: c97ec6b4
Jul 16 08:45:01 host kernel: esi: de5b40a0   edi: c8929360   ebp: bff08518   esp: c85dcef4
Jul 16 08:45:01 host kernel: ds: 007b   es: 007b   ss: 0068
Jul 16 08:45:01 host kernel: Process sim (pid: 6909, threadinfo=c85dc000 task=c8929360)
Jul 16 08:45:01 host kernel: Stack: de5b40a0 de5b40d0 c011d901 00000000 00c8479c c85dcfc4 c032ebbf 00000007
Jul 16 08:45:01 host kernel:        0000000e 0000000b 00000000 00000000 00000000 00000000 00000000 00030001
Jul 16 08:45:01 host kernel:        00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Jul 16 08:45:01 host kernel: Call Trace:
Jul 16 08:45:01 host kernel:  [] do_page_fault+0x114/0x4dc
Jul 16 08:45:01 host kernel:  [] do_page_fault+0x0/0x4dc
Jul 16 08:45:01 host kernel:  [] error_code+0x2f/0x38
Jul 16 08:45:01 host kernel:  [] schedule_tail+0xfd/0x106
Jul 16 08:45:01 host kernel:  [] do_page_fault+0x0/0x4dc
Jul 16 08:45:01 host kernel:  [] error_code+0x2f/0x38
Jul 16 08:45:01 host kernel: Code: 5d c3 56 89 c6 53 89 d3 31 d2 85 c0 74 3c 8b 50 08 85 d2 74 0a 39 5a 08 76 05 39 5a 04 76 2b 8b 4e 04 31 d2 85 c9 74 22 8d 4
1 e8 <39> 58 08 76 0c 39 58 04 89 c2 76 0c 8b 49 0c eb 03 8b 49 08 85
Jul 16 08:45:01 host kernel:  <0>Fatal exception: panic in 5 seconds
Jul 16 10:12:18 host syslogd 1.4.1: restart.

Load balancing FastCGI

Run this command on a worker

spawn-fcgi -p 8081 -a 192.168.2.100 -f /usr/bin/php-cgi -u lighttpd -g lighttpd -C 5 -P /var/run/spawn-fcgi-8081.pid

Don’t forget to open up the right port (8081 in the example) and monitor the processes (say restart when it dies)

Reference

http://www.cyberciti.biz/tips/lighttpd-mod_proxy-to-run-php-fastcgi-app-server.html

Bind to a socket

spawn-fcgi -s /tmp/php-fastcgi-ext.sock -f /usr/bin/php-cgi -u lighttpd -g lighttpd -C 5 -P /var/run/spawn-fcgi.pid

Bind to an IP:port

spawn-fcgi -p 8081 -a 192.168.2.100 -f /usr/bin/php-cgi -u lighttpd -g lighttpd -C 5 -P /var/run/spawn-fcgi-8081.pid

Reference

http://trac.lighttpd.net/trac/wiki/Docs%3AModFastCGI#load-balancing

Need also to turn on lighttpd (service lighttpd start) so reporter can get status on this server directly (through port 80)

Using an external fcgi in lighttpd.conf to load balance only a specific file

fastcgi.server = (
"/index.php"=>
(
 ("socket"=>"/tmp/php-fastcgi.socket",
  "bin-path"=>"/usr/bin/php-cgi",
  "min-procs"=>2,
  "max-procs"=>4,
  "bin-environment"=>("PHP_FCGI_CHILDREN"=>"10","PHP_FCGI_MAX_REQUESTS"=>"5000")
 ),
 ("host"=>"192.168.2.100",
  "port"=>8081,
  "check-local"=>"disable",
  "disable-time"=>30
 )
),
".php"=>
(
 ("socket"=>"/tmp/php-fastcgi.socket",
  "bin-path"=>"/usr/bin/php-cgi",
  "min-procs"=>1,
  "max-procs"=>2,
  "bin-environment"=>("PHP_FCGI_CHILDREN"=>"5","PHP_FCGI_MAX_REQUESTS"=>"1000")
 )
)
)

Misterious 500 – Internal Server Error

This is a very generic error but it means there is some critical issue with the server. One time we experienced with this because our codebase was getting heavier and the default value of memory_limit=8M in php.ini wasn’t enough.

Solution: increase this value to something higher

LVS-Tun & ISPs

LVS is a software load balancing solution. It’s open-source software, built directly in Linux kernel and it’s free.

The director (load balancer) can be in one DC, and the real servers are in different DCs. The director only needs good bandwidth, Pentium 4 or even P3 is fine since it’s Layer 4 switching (less overhead than Layer 7, eg: HAProxy). The incoming traffic flows from Client -> Director -> Worker. The returning traffic: Worker -> Client. As you can see, the director has a much higher throughput since it only handles incoming requests. The return packets come directly from the workers. We current manage several LVS setups. One example: 3 directors, 12 real servers, in over 5 different DCs spanning across US and Europe. It’s quite easy to set up and manage.

Reference: LVS-Tun is an LVS original. It is based on LVS-DR and has the same high scalability/throughput of LVS-DR. LVS-Tun can be used with realservers that can tunnel (==IPIP encapsulation). The director encapsulates the request packet inside an IPIP packet before sending it to the realserver. The realserver must be able to decapsulate the IPIP packet. Initially only Linux could decapsulate IPIP packets, but recently FreeBSD and W2K can now do it too (hmm 2005, I think Microsoft has dropped support for IPIP). With LVS-DR, the realservers can have almost any OS.

Unlike LVS-DR, with LVS-Tun the realservers can be on a network remote from the director, and can each be on separate networks. Thus the realservers could be in different countries (e.g. a set of ftp mirror sites for a project). If this is the case, the realservers will be generating reply packets with VIP:port->CIP (where port is the LVS’ed service). Not being on the VIP network, the routers for the realservers will have to be programmed to accept outgoing packets with src_addr=VIP:port. Routers normally drop these packets as an anti-spoofing measure. If you aren’t in control of the routers, you’ll just have to inform the people who are, that packets from VIP:port are valid for your business. If they don’t want to help you with your business, then you should find another provider who will. Read more here and here

To detect if the ISPs allow LVS-TUN, follow the tests on this page, more specifically, this test:

realserver# traceroute -s VIRTUAL_IP -n CLIENT_IP
 1  * * *
 2  * * *
 3  * * *
 4  * * *
 5  * * *
 6  * * *
 7  * * *
 8  * * *

Be patient and wait on the director to see something similar to the following

director# tcpdump -ln host CLIENT_IP
tcpdump: listening on eth0
19:20:20.310162 CLIENT_IP > VIRTUAL_IP: icmp: CLIENT_IP udp port 33483 unreachable
19:22:40.639844 CLIENT_IP > VIRTUAL_IP: icmp: CLIENT_IP udp port 33511 unreachable
19:22:45.641061 CLIENT_IP > VIRTUAL_IP: icmp: CLIENT_IP udp port 33512 unreachable
19:23:30.664315 CLIENT_IP > VIRTUAL_IP: icmp: CLIENT_IP udp port 33521 unreachable

If you don’t see anything response on the director, it might be the realserver cannot get any packet out to the client because the ISP’s router dropped these packets.

It is very important that ISPs see the demand/request for LVS-TUN setups to distinguish it from malicious network attacks. Security is good but cannot be too strict or rigid to have flexibility, growth for business. If you have experienced setting up LVS-TUN with other ISPs, webhosting companies, please let me know to add to the list.

List of ISPs support LVS-TUN (allow outgoing spoofed-yet-valid packets for the realservers):

  1. LayeredTech: at Savvis building in Dallas, their other DataBank DC blocks this. Currently working with LT to unblock. Updated: LT is very accommodating for their clients, they exclude our load balancer’s IP address in the router filter list.
  2. Hivelocity: blocked but then unblocked, willing to make an exception.
  3. 1paket at Lambdanet in Germany
  4. SoftLayer: custom router setup
  5. WebNX in LA

List of ISPs do NOT support LVS-TUN (drop these packets and are not willing to make exception):

  1. ThePlanet: denied, not willing to make exception in network filter for this type of packets, against their AUP