/* As outlined in RFC 2525, section 2.17, we send a RST here because
* data was lost. To witness the awful effects of the old behavior of
* always doing a FIN, run an older 2.1.x kernel or 2.0.x, start a bulk
* GET in an FTP client, suspend the process, wait for the client to
* advertise a zero window, then kill -9 the FTP client, wheee...
* Note: timeout is always zero in such a case.
*/
Ok, so the RST is explained and well justified by the literature. But what are the “awful effects” of sending FIN instead? Can someone explain?The RFC explains.
The other side - the server - will be stuck on `send(sock_fd, more_bytes...)`. If it's the 90s and your FTP server is single-threaded, that means the server will appear completely stuck. This won't resolve for several minutes, or possibly forever if the server-side TCP stack is buggy or lacks timeouts.
The client's connection, even after the process is gone, will still be alive in the kernel. It will be in FIN-WAIT-2, waiting for the server to send its FIN. The remote sender will be stuck waiting for the zero-window state to pass, sending probes to figure out when the receive window opens (which will never happen). Things will be stuck in this state until one of the sides times out, which could be several minutes. Or if the implementations are buggy, it could be permanent.
From RFC 2525, 2.17
Failure to reset the connection can lead to permanently hung
connections, in which the remote endpoint takes no further action
to tear down the connection because it is waiting on the local TCP
to first take some action. This is particularly the case if the
local TCP also allows the advertised window to go to zero, and
fails to tear down the connection when the remote TCP engages in
"persist" probes (see example below).
https://datatracker.ietf.org/doc/html/rfc2525#page-50In some protocols, end-of-file has semantic meaning that all data has been transferred, and TCP is set up such that you should be able to rely on that - if you can't rely on that difference, it is a bug in a TCP stack along the way.
FIN also has a sequence number, so you can wait to ACK it until you get the corresponding data if it is dropped or out of order.
TCP RST says the other side won't be resending if not ACKed, it is reset. Further, the downloading client usually cannot even read any packets in the receive window either once an RST has been received - that might be hundreds of KB of missed data.
RST and FIN are very semantically meaningfully different.
Reading the post, if gunicorn is e.g. sending a 404 after seeing a POST to a path it doesn't know about before reading the body, the client will never get the 404 because gunicorn hasn't read the message body.
This case is partly why "Expect: 100-continue" exists, so it will be properly handled, even if it does introduce an extra round-trip lag in the POST.
It might be dangerous to have your protocol rely on a piece of TCP that is often incorrectly implemented.
FTP has different data and command connections so the server may not have an outstanding read to detect the data connection break.
But.. it should still clean up both when the command connection dies
Also, I think state of the art hasn't really changed? If you don't want a reset, you need to read everything from the socket before you close. If you don't really care about a reset as long as it doesn't interrupt the reader, you can shutdown in your direction, and drop the socket off to something that will wait "long enough" before it closes. In an eventloop architecture, you can just put in as a deferred task; in process per connection, you should probably send the socket to a dedicated lingering closer process that doesn't interrupt your flow.
There's a part 2. It's only linked at the top for some reason, not at the bottom.
Part 2 says they tried shutdown and it didn't change anything.
For the case here the server should call shutdown with SHUT_WR after sending the data and then drain the incoming data before closing the socket.
In this situation we were discarding the HTTP response without reading it before closing, which kept Go from reusing the connection. I didn't dig quite as deep as this post's author, but I imagine the same RST behavior was happening under the hood.
Um, yes? That's how TCP has been universally implemented for more than 30 years. See [0], 2.17 for discussion.
That's not what's happening here. The server is closing the socket when there's data from the client that it hasn't read.
If the client does a read() after the RST gets there, any data in the receive window is gone, any packets that might have been dropped and need to be resent are gone, etc.
Here's a little reproducer: https://gist.github.com/jcalvinowens/da57edda9a01ca9f4c4088a...
$ gcc -O2 test.c -o test
$ strace -e socket,connect,write,accept,read,close ./test --rx
<...>
socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 3
accept(3, NULL, NULL) = 4
close(3) = 0
read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
<...>
read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 3035
read(4, "", 4096) = 0
close(4) = 0
+++ exited with 0 +++
$ strace -e socket,connect,write,accept,read,close ./test --tx
<...>
socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 3
connect(3, {sa_family=AF_INET, sin_port=htons(31337), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 600000) = 600000
close(3) = 0
+++ exited with 0 +++
...versus: $ gcc -O2 -DWRITE_TO_SOCKET_BEFORE_READ test.c -o test
$ strace -e socket,connect,write,accept,read,close ./test --rx
<...>
socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 3
accept(3, NULL, NULL) = 4
close(3) = 0
write(4, "\250\3\0\0\0\0\0\0\250\3\0\0\0\0\0\0$\0\0\0\0\0\0\0$\0\0\0\0\0\0\0"..., 4096) = 4096
read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
<...>
read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 997
read(4, 0x7ffd45c2d3c0, 4096) = -1 ECONNRESET (Connection reset by peer)
<...>
+++ exited with 1 +++
$ strace -e socket,connect,write,accept,read,close ./test --tx
<...>
socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 3
connect(3, {sa_family=AF_INET, sin_port=htons(31337), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 600000) = 600000
close(3)
+++ exited with 0 +++