PDA

View Full Version : blocked by sockets in time-wait??


Nope
05-25-2004, 10:29 PM
I am a bit baffled. Ok, I have a multithreaded HTTP/1.1 server and a
multitasking HTTP/1.0 server. I tried to put a stress test on them both to
see if any bugs turn up. You know, like memory leaks, seg faults and so
on. Now what happens has taken me a bit off guard.

The test program is a simple thing hammering the target server with a
repeated http/0.9 get request. Open connection, send the request, get the
answer(check for errors) and close the connection and so on. The test
program itself simulates several concurrent connections. Not that good for
testing the 1.1 keep alive features, I know.

That happens: Both servers start out ok. Not as fast as I hoped for, but to
that later. Both servers run into a full stop after around 16000 reconnects,
both at a rate around 700-780 (700 the old, 780 the new one) requests
per second and both never reach more than 4 threads/tasks at a time.
When I try to "get" a cgi script instead the speed is a lot lower
(~250req/sec) and the 1.1 server runs into that barrier even earlier at
~4000 (4096?), while the 1.0 one again reaches someplace around 16000.

The stop is complete. I can't reach the server anymore, not even over the
127.0.0.1 internal loopback. Then after 80-120 seconds all is back to
normal (without touching the system) while before even a restart of the
server couldn't get it back to life.
This is repeatable as long as I set the concurrent connections to over 1.
With a value of just one the server runs slower and for a very long time.
As long as the average rts is below 580 or so it goes on forever (well, I
stopped it at 1 million connections). After some little optimisations in the
code the average did go up to over 610 and then the server stopped
again. Sometimes as soon as 20000, sometime it reached over 100K, but
never beyond that.

While all system parameters keep in normal boundaries, the count of
time-wait sockets goes quite fast up. I have done a bit research and my
Linux uses a fast time-wait recovery, meaning it closes the thing (against
the specs) right after it gets the final acknowledge from the other
computer on the close. Now, my second PC only has an older WinME with
pumped tcp to 256 sockets instead the normal 100 and I guess that the
windows might skip the final parts of the protocol if it runs itself out of
sockets. Linux also keeps the time of the time-wait very low. With normal
settings at 60 seconds. After the wait, when the server reacts normal
again, all time-wait sockets are closed again.

So any thoughts about that? Do I have to look at something else? I'll dust
off an old third PC with a Linux on it and try if that works out better, but
that'll take a while.

Then the speed thingy. The server maxes out at around 780 connections,
but at that time the system is still 63% idle! There is absolutely no reason
on system side why it should stop there. So either the process scheduler
or the select/accept maxes out. I changed the nice value of the process to
max priority, that had no effect beside it now reacts faster whenn I start
the test. Any idea how I might get beyond that point with just one listen
socket?

RobSeace
05-26-2004, 12:10 PM
First of all, are you running the testing client app on a different
host from the server? If it's running on the same host, then I
might be less surprised about a ton of TIME_WAITs hosing
things up...

Assuming it's really on a different system, my next guess is that
it's a very poor system with a very small ephemeral port# range...
(Eg: one of those thoroughly brain-dead ones that were so popular
for a long time, with an ephemeral port range of just 1024 - 5000...)
In that case, your testing client will burn through ephemeral ports
so fast, that it'll cycle around to reuse previous ones, before your
TIME_WAITs have timed out... Try your test from a more sane
system, like a Linux box, with "/proc/sys/net/ipv4/ip_local_port_range"
set to some nice large range of values...

Though, given your comments about being unable to reach the
server even over loopback, I'm not sure that's really it, either...
When, the server is in this state, I assume netstat shows a bunch
of TIME_WAIT sockets on your server host? If so, then your
server is beating the client to the punch on sending out the first
FIN... You could cheat, and wait for the client to close() first, and
that way the client will be stuck with the TIME_WAITs... ;-) Or, of
course, you could defeat TIME_WAIT with SO_LINGER, but I
think that's generally a very bad idea...

When, your server becomes unreachable, how about other
processes/services on that same host? Ie: can you reach other
network services on the same host? I'm just curious if the whole
system's TCP/IP stack is locked up somehow, or if it's just your
server process...

Nope
05-26-2004, 03:31 PM
I've changed a lot since I posted that. Well, my server home is a Linux PC
with a AMD 1400/512MByte and my second PC is a XP2500+/512MByte
with WinME. Sadly I wasn't able yet to get it running Linux, seems I have
some problems with the nForce mainboard drivers. Have still to check if a
SuSE past V8.1 is doing the trick. I have however 4 other PCs under my
desk, all a lot older the fastest being a K6-450/384MByte. I see if I can
wire that one up to play the tester part later.

It seems Linux has a max-tw-sockets-per-process of 4096 or so. When I
start both servers on different listen sockets one keeps running if the
other stops. So it is definitly a per process thing.

Of course my server closes the socket first. It has to. The specs of
HTTP/1.1 and 1.0 state that a request of 1.0 and 0.9 origin has to be
closed right after sending out the response. Can't change that. They
changed that behaviour in 1.1 where a request is automatically of type
keep-alive unless told otherwise. That was brought in exactly because of
the fact that sockets are a limited good on a server.

One thing I have seen by now. As the reconnects come in, the response
time goes down, I guess that's due to the fact that the scheduler starts to
give the server more running time and that the content is held completely
in the drive cache after a while. Always a couple of seconds before the
thing comes to a sudden break the response time is going below or near
0.003 seconds. I think this is the barrier when my WinME starts to
overdribble itself and doesn't manage to keep up the proper closing
anymore.

As for the max speed of around 770. I tried several different code
sequences with and without for example select and semaphores. It doesn't
matter much. A select followed by an accept is exactly the same speed as
just waiting in the accept the whole time while pre-creating threads and
waiting in a semaphore is actually decreasing the speed down to 550-600.
And that's allthough the same sems are there in the select code also, they
are just never blocking when a task reaches them. I have a select that's
followed by an "allow-next" semaphore and another semaphore before the
select that works as a blocking max-thread counter. So I try to change
some settings for the thread scheduler next.

allow-next set to 1
max-thread-counter set to 50 (config file)

do
-wait in select
-wait on allow-next>0
-allow-next -1
-create thread
-wait on max-thread-counter >0
-max-thread-counter -1
while not terminated

thread:
--accept
--allow next+1
--process incoming data
....
--close socket
--max-thread-counter +1

Funny thing that an error response can lead up to well over 1000 requests
per second. Perhaps it is the scheduler after all in some way.

The multi tasking one looks roughly the same with the difference that it
discards too fast incoming requests with a 503 error and a calculated try
again later value in the header. The single tasks normally send a rough
estimate of the needed response time back to the main task and those
values are used to kill deadlocked tasks as well as for the try-again later
message. When it maxes out its speed it also runs at 100% system
resources, so no need to look there for a speed up. The task creation just
needs too much time, even for such a small thing like a 11Kbyte server
core.

I am running with SO_Linger set to 3 seconds in the multithreaded server
now, following the Apache lead. It still causes the server to run into a full
stop after several 10k connections, but at least it comes back again in a
matter of a few seconds, not over a minute like before. Seems I've finally
found the reason for that Linger setting in the other servers.

Nope
05-26-2004, 03:41 PM
A yes. When the server is blocked, select is not triggered anymore, the listen queue is empty and incoming new connection requests seem to be blocked by the OS, just like what would happen when the listenqueue is full or set to 0. Could it be that it's the DOS-attack defense shield??

RobSeace
05-26-2004, 07:06 PM
Yes, it sounds quite likely that it's some sort of anti-DOS thing...
Do you see any logged kernel messages in "dmesg" or anything?
Have you played around with the various "/proc/sys/net/ipv4/"
settings to see if it improved things at all?

I really hate the SO_LINGER approach, because avoiding the
TIME_WAIT state is dangerous... But, if it works with a non-zero
linger time (ie: it seems to recover from its stuck state after that
given linger time elapses), then I think the problem is not so much
TIME_WAIT as more likely FIN_WAIT*... Because, if the remote
FIN arrived within the given linger time, everything should proceed
as normal, and the socket would still go TIME_WAIT; but, if it failed
to arrive, it'd sit in FIN_WAIT_1 or FIN_WAIT_2 until the linger time
expired, at which point, it'd RST the connection and just fully tear
down the socket... That indicates to me that it is indeed the Windoze
client host that's getting wacked out somehow, and starting to
drop/ignore FINs... (Does "netstat" show FIN_WAIT_* sockets
on the server system?) I don't have nearly as much problem with
this sort of use of SO_LINGER as I do with using with a 0 linger
time, to just always RST connections, and always avoid the
TIME_WAIT state... At least in this case, most normal connections
should still get a TIME_WAIT, but badly behaved ones that can't
be shutdown properly in a reasonable amount of time get RST...
But, I still just dislike SO_LINGER on general principle, anyway... ;-)

You might want to try setting up a sniffer and see if you can spot
the change in behavior that triggers this when it starts happening...
If it's really some detectable change in behavior anyway, and not
just the Linux box doing some anti-DoS trickery on the lone host
it seems pounding on the same port over and over again... *shrug*
(I'd think that if it were some sort of anti-DoS thing, it would at
least log something somewhere about it... And, probably be
tunable, somehow... Hmmmm... You're not running any kind of
netfilter rules on the server host are you?)

Nope
05-27-2004, 01:30 AM
Currently I am outright confused. I am no longer able to repeat the error
as it was yesterday, not even with my older code base. On the first few
tries the old build did run just fine several times over the 1mil mark. Then
the test program crashed on the windows side. The server still reacted to
requests from a browser located on client and server. The windows dos
box is so dead that it can't restart the connection without a system
re-boot.

Now, after a reboot of both PCs the connection crashed again, with CGI
requests at ~4000 and else at 8000-24000. But now the server program
remains reachable, at least for slow single requests.A newly started test
right afterward would stop after 30-300 reconnects. Netstat shows just
around 4000 sockets in timewait, not one in any other state. I'd say it's
the windows side that goes down the drain now.

I think it is really overdue to get rid of the windows wildcard and to get the
other LinuxPC running. I'll do that the next few days and then post again.

:?

RobSeace
05-27-2004, 11:38 AM
Yeah, with ~4000 TIME_WAIT sockets, you're covering the entire
range of 1024-5000, which I'm betting anything the Windoze host
is using as its ephemeral port# range (and, lots of other systems
for years and years have stupidly used, due to a silly typo in
early BSD (http://www.kohala.com/start/borman.97jan30.txt))... So, you'd have a TIME_WAIT socket for every
possible ephemeral port the Windoze box might choose, so when
it attempts to connect with one, it'd get treated as a wandering
duplicate packet from the previous connection that used that port#...
At that point, the Windoze box probably tries to assassinate the
TIME_WAIT socket on the Linux box, but I believe Linux protects
against TIME_WAIT assassination, and so would probably ignore
the RSTs from the Windoze box, and leave its TIME_WAIT intact
for the duration of the timeout... But, this is all just speculation on
my part; if you set up a sniffer and looked at the actual traffic that
occurs during this weirdness, you'd be able to tell for sure...