Friday, November 4, 2011

Solving httpd MaxClients and mod_status stuck "sending reply"

Every once in while, I get into a situation that I just have to get to the bottom off. A few days ago, one of network engineers discovered that a client's site would occassionally become unresponsive. This was very odd given that the site is run on an apache cluster powered by very very very beefy servers. Worse still, the client hasn't even started marketing the site and actually has an official launch date in less than 48 hours. I decided to approach this problem in the only way I know how, with surgical precision.

Background:
1. Site is an Ajax web application.
2. Running on Apache, pg-pool, postgresql setup.
3. Memcached also running.
4. Entire site is in PHP.

First, the symptoms.
1. Site returns error 500 or "site is taking too long to respond"
2. This happens even during off-peak hours (2am at night, etc).

Possible diagnosis:
1. httpd configuration for max clients is set too low.
2. number of database connections is set too low, causing httpd connection to queue up.

*All linux commands are in orange- Ex: ps aux
"Interesting.....we will need to order some tests"
  1.  httpd error logs: we need to find out if max clients was exceeded. You can quickly do this by running the following commands
    #  grep -i maxclient /var/log/httpd/error_log
    Tip: don't forget to substitute the actual location of your error log file for mine if it is different
    Results: [error] server reached MaxClients setting, consider raising the MaxClients setting.

    /etc/httpd/conf/httpd.conf: we need the max client setting
    Results: it is set quite high - 1024

    Tip: learn how to setup error logs here http://www.sitepoint.com/configuring-web-logs-apache/
     
  2. server status: We have mod_status for apache implemented and we checked it locally by going to http://localhost/server-status
    Results: We found a lot of process stuck in a state of W "sending reply"

    If you don't have mod_status or would like to learn how to use it, check out http://www.cyberciti.biz/faq/apache-server-status/
    Current Time: Thursday, 03-Nov-2011 23:38:16 PDT
    Restart Time: Wednesday, 02-Nov-2011 18:59:22 PDT
    Parent Server Generation: 0
    Server uptime: 1 day 4 hours 38 minutes 54 seconds
    Total accesses: 544995 - Total Traffic: 2.0 GB
    CPU Usage: u205.55 s45.23 cu2.41 cs0 - .245% CPU load
    5.28 requests/sec - 20.2 kB/second - 3910 B/request
    276 requests currently being processed, 18 idle workers
    WWWWW.WWWWWWWWWWWWWWWWWWWWW_WWW.WWWWWWWWWWWWWW.WWWWWWWWWWWWWW.WW
    WWWWWWWWWWWWWWWWWW...WWWWWWWWWW_WW.WWW.WW.W._..WWKWWW.....WWW.WW
    W.W.W.WCW_W.W......_W..W.W.WW._..._WWWW.W._._W._..W.W..W...._WWW
    .W.W..K.WW._W_WW..W.WWW.WWW_W_.._W.WWWWWWWW_WWWWWWWWWWWWWWWWWWWW
    WWWWWWWWWWWWWWWWWW..WWWWWWWWWWWWWWWW.WW.WWWWWWWWW.WWWW.WWWWWWWW.
    WWWWWWW...WW.WWWWWWWW.WWWWWWWWW.WWWWWWWWWW.W.WW..W.W..._...W..W.
    ................................................................
    ................................................................
    
    Scoreboard Key:
    "_" Waiting for Connection, "S" Starting up, "R" Reading Request,
    "W" Sending Reply, "K" Keepalive (read), "D" DNS Lookup,
    "C" Closing connection, "L" Logging, "G" Gracefully finishing,
    "I" Idle cleanup of worker, "." Open slot with no current process














"Houston, we have a problem"

Wow, why are there so many connection stuck in W state? And it appears that the number of connections stuck in this state continued to grow in number.

"This must be stopped"

We found that 95% of these connections were connecting to a specific PHP script passing along certain perfectly normal GET variables. The other 5% were to another script. Unfortunately, server-status seems to truncate the request column

M
SSReqConnChildSlotClientVHostRequest
W
3064400.00.005.708.8.8.8yoursite.comGET /twitter/authsetfast.php?new=

"SS looks CRAZZZZZZZYYYY"
Now that's just not right, SS = 30644???? In case you don't know, SS is the duration of the connection in seconds. And it keeps growing with each refresh, meaning that this connection is NEVER released until the server is restarted.
Another symptom: High SS and growing SS in mod_status report.

We have more information now, so we need to add a few more possibilities to our diagnosis
1. PHP Sessions may be locking: It appears that if you have an ajax site that makes multiple connections, and each connection sets a value for a session variable specific to that brower, each connection has to wait for the previous one to close before they start "writing" to the session. For example, say we set $_SESSION['last_connect_time'] = $timestamp; on scriptA.php and scriptB.php. If scriptA.php is called and runs for a very long time (image processing for example), if scriptB.php is called before scriptA.php finishes, it will be queued by PHP until scriptA.php finishes.

Wow, but there is a simple solution. Please put session_write_close(); when you are done writing all the session variables in every script. This will allow the session to be available to the next script even if the current script has finished executing.

So, with scriptA.php, put session_write_close(); close to the top of script before it starts image processing.

2. Open Database connections : with postgresql and PHP, it is good practice to call pg_close($conn) when you are done with the connection. Don't assume that it will closed when the script is done executing.


"Time for Surgery"
So I decided to get the actual GET values being sent to the script during these W states by storing all $_GET to a database table everytime the script is called. I checked the access logs but it appears that the access logs are written after the connection ends, thus there were no records of the W state requests.

I restarted the server to clear out all the "stuck in W" connections and let the site run for a few hours. Sure enough, more "stuck in W" connections started to pop up. I checked the table I set up to store the $_GET variables and retrieved the $_GET for that connection.

I then proceeded to restart the httpd server and then called the script using the $_GET variables. And Voila, a "stuck in W" connection appeared with my ipaddress.

"Okay, we are making progress....we can actually recreate the problem"

At this point, I looked over the variables and there was one that really stood out. It was the user's id number, it was 0. This is not supposed to be possible. This would cause the query for the user's info to search the entire user table (over 30 million records using a column that was not indexed). Yikes!!!

The easy thing to do here would be to put an if clause in the script to prevent it from doing anything if user id is zero. But I wanted to find the actual class and method that performed the query. I needed to make sure that even if another programmer forgets to check that user id > 0, the method should be smart enough to not perform such a disastrous query. So happens is the script gets stuck waiting for the query to complete, and if another call is made for the same script using the same $_GET, it too will get stuck. The number of stuck connections will continue to grow as long as the same $_GET values are sent to the script.

Anyhow, I found it and made the changes. I restarted the httpd and ran the script again using the same $_GET variables and the connection completed in 143ms. No more "stuck in W".

Now, my server status looks like this


Current Time: Friday, 04-Nov-2011 21:51:56 PDT
Restart Time: Friday, 04-Nov-2011 16:14:46 PDT
Parent Server Generation: 0
Server uptime: 5 hours 37 minutes 9 seconds
Total accesses: 149405 - Total Traffic: 501.5 MB
CPU Usage: u8.97 s2.59 cu.17 cs0 - .058% CPU load
7.39 requests/sec - 25.4 kB/second - 3520 B/request
12 requests currently being processed, 19 idle workers
KKKKKK_____K._.__K__.__..__K_.__...C..W..K_._...................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
Scoreboard Key:
"_" Waiting for Connection, "S" Starting up, "R" Reading Request,
"W" Sending Reply, "K" Keepalive (read), "D" DNS Lookup,
"C" Closing connection, "L" Logging, "G" Gracefully finishing,
"I" Idle cleanup of worker, "." Open slot with no current process




The "K"s are fine, we set keepalive so the site can use the same connection for mulitple requests by a single browser.

I am going to call this the

"Perpertual Bush Syndrome"


or PBS for short because no one wants to be "stuck with W" forever :)

How to make a diagnosis of PBS

The following criteria must be met:
1. server status via mod_status must show SS (duration of connection is seconds) to be growing indefinitely.
2. number of "Stuck in W" requests keeps growing.

Presentation:
1. Webserver becomes inaccessible even during off-peak hours.
2. Error 500 may be seen by users browsing the site
3. maxclient error may be seen if the number of connections grows beyond the set max clients value.

Treatment:
1. Get the actual script that causes the PBS and track down where it is getting stuck. Simplest way is to use echo "I made it here"; exit(0); and paste at different points in the script as you walk your way down,

I hope this helps anyone who has problem with max clients and find that their web server suffers from the dreaded Perpertual Bush Syndrome

8 comments:

  1. This might seem a dumb question but how come you have so many so many .'s, _'s, W's and R's etc displaid. I have just asked my server company to allow me access to mine and it has 10.

    Is this maybe a possible cause of my server freezing problems?

    Great article by the way!

    ReplyDelete
  2. Hello!!!

    Great article and very interesting!! :-)
    I have something similar in fact:

    255-0 - 0/0/2 . 0.00 29822 47253834 0.0 0.00 0.00 85.44.206.154 127.0.0.1 GET /messages.php?id_srv=XXXX&id_device=XX&last_event=1&id_...

    as you can see, SS is almost 30000...
    All that started once I upgraded PHP libraries on live CentOS 5.5 Final webserver to have JSON support.
    In some php scripts, included this messages.php, I have at the top session_start();....due of your article, should I put at the bottom of all scripts that have session_start() just session_write_close()?
    Since I am using also cookies to remember some state settings if I back in the page from another, if use session_write_close() would I lose cookies or not?

    Thanks in advance
    Ciao
    Luigi

    ReplyDelete
  3. In my case, i m getting status W sending only for one specific IP address. if the requesting ip is changed it works perfect. I have also added the ip to whitelist. Any Idea how can i Fix this ?

    ReplyDelete
  4. Excellent diagnosis! That's helped me loads!

    ReplyDelete
  5. Good job, thanks to this post i found that this was why my server was freezed.

    But I'd like to find the goo tool to track this in complex php scripts like wordpress. Cause i can't put exit(); in it, it would be a hell.

    ReplyDelete
  6. Hello, I'm getting an odd issue like ones a day the servers start saying something like this on Server-status:
    1 requests currently being processed, 0 idle workers, keeps like that and starts sending white pages for some users like isn't able to serve all request the normal way it is is like:

    8 requests currently being processed, 17 idle workersW___C_..C_W___._.._.___C_.C_W_C._...............................
    ................................................................
    ................................................................
    ................................................................
    ................................................................
    ................................................................
    ................................................................
    ....................................................

    those anyone knows what could be causing this? Haven't been able to find any information about it only about the maxclients limit being reached what is not the case here.

    Thank You a lot.

    ReplyDelete
  7. I finally isolated my problem: my cURL calls (from PHP) were not being closed and didn't have a timeout set. All fixed now.

    ReplyDelete