As of December 2010, the performance of the SCTP implementation in Linux is severely lacking. One problem is the user configured fixed kernel buffer size, which rarely provides a good trade-off between memory and speed. This document describes my work towards fixing that problem.
Each socket's buffer space does not actually exist as a single address
range. Instead, a limit is set per socket on the aggregate size of all
allocated buffers. These variables are called sk_sndbuf and
sk_rcvbuf.
The memory for each chunk is allocated by calling
__alloc_skb, alias alloc_skb, which returns a
struct sk_buff *. These structures are used to form
linked lists.
Buffer memory accounting is performed by calling the
sctp_set_owner_w, sctp_wfree,
sctp_skb_set_owner_r and sctp_sock_rfree
functions.
When a blocking sender tries to send data on a socket that has run out
of send buffer space, it needs to wait. This is accomplished by calling
sctp_wait_for_sndbuf. The waiting processes are awakened by
calls to sctp_wfree.
There already are a few sysctl variables for this:
net.sctp.sctp_mem = 93504 124672 187008 net.sctp.sctp_rmem = 4096 349500 3989504 net.sctp.sctp_wmem = 4096 16384 3989504
sysctl_sctp_mem defines global limits measured in pages.
sysctl_sctp_rmem and sysctl_sctp_wmem define per
socket limits measured in bytes. As of Linux 2.6.37, these
variables are not used for anything.
The nine sysctl variables defined for SCTP have similarly named counterparts in the TCP implementation. Only six of these are actually used:
sysctl_tcp_rmem[0], sysctl_tcp_wmem[0] and
sysctl_tcp_mem[1] are never
used for any purpose in Linux 2.6.37, contrary to claims made in the
documentation.
sysctl_tcp_rmem[1] and sysctl_tcp_wmem[1] are
used to set the initial values of sk_rcvbuf and
sk_sndbuf, upon socket creation.
sysctl_tcp_wmem[2] is used in two places:
sysctl_tcp_wmem[2].tcp_should_expand_sndbuf is called to determine if the send
buffer should be increased. If so, it is adjusted to twice the size of
the congestion window, clamped to sysctl_tcp_wmem[2].tcp_should_expand_sndbuf checks for a few conditions, all
of which must be false:
sysctl_tcp_mem[0] pages?After initialization, sk_rcvbuf is adjusted in three places.
sysctl_tcp_rmem[2].sysctl_tcp_rmem[2].tcp_rcv_space_adjust is called to determine the ideal
receive buffer size.tcp_rcv_space_adjust calculates the amount of data passed
into user-space over the course of one round-trip, times 2. The receive
buffer size is increased up to this value, if it is greater than the
current size.
sysctl_tcp_mem[2] is used in the function
tcp_too_many_orphans.
tcp_too_many_orphans is used to determine if the number of
orphaned sockets is past a system-wide limit, or the global TCP memory
usage is past the limit defined by sysctl_tcp_mem[2]. This
function is called in a couple of places to determine whether a socket's
resources should be freed early in the tear-down phase.
Unlike for SCTP, the size of the send buffer decides the largest
message to can pass to sendmsg. Thus, a user may break some
applications if he sets the buffer size to small. For example, the
lksctp-tools unit tests expects support for 32768 byte messages, while
the default send buffer size is 16384 bytes.
One problem with the algorithm for determining the receive buffer size is that it requires an RTT estimate. This estimate requires at least one outbound data packet. For connections with unidirectional data flow, like the data connections used in FTP (IIRC), the receiving end does not send any data back and thus never gets any RTT estimate. The receive buffer size will consequently remain constant.
Computer X has two ethernet interfaces. One interface is connected to Computer A, while the other interface is connected to a switch. Computer B is connected to the same switch. The tests are performed between Computer A and Computer B.
Data is sent from Computer A to Computer B. Computer B echoes all data it receives back to the originating host. The amount of data received back by Computer A is used for calculating the transfer rate.
The kernel buffer sizes are polled for every 4096 bytes sent.
All tests ran for 10 seconds. Delays were added using /sbin/tc
qdisc add dev eth1 root netem delay 10ms and 100ms on
Computer X.

SCTP with auto buffer tuning starts out with a 16384 byte send buffer, while SCTP from Linux 2.6.37 is using a constant 122880 byte buffer. This explains why auto buffer tuning sometimes appears to perform worse. The solution is to fix whatever is causing fluctuations in the adjustment of congestion and receive windows. A short term fix is setting the default buffer at 122880 bytes.
These are the buffer sizes as they develop on Computer B during the 10 second tests for 0ms, 10ms and 100ms frame delay. Where the lines end, the buffer sizes become stable.
Y = bytes, X = seconds
| 0 ms delay | |
|---|---|
![]() |
![]() |
| 10 ms delay | |
![]() |
![]() |
| 100 ms delay | |
![]() |
![]() |
Disclaimer: This is my first time working on protocols below the transport layer, and I'm not entirely sure that what I'm writing makes any sense or is not blindingly obvious.
With this patch, transfer rates are much less affected by network latency. However, according to some quick calculations, an ideal slow-starting protocol running under similar conditions to these tests should be able to transfer more than 40 MB/s during the first 10 seconds at 40 ms RTT. Neither SCTP nor TCP are close to this. Also, SCTP performs much worse than TCP at lower latencies, though the difference decreases sharply as latency inreases..
I suspect there are race conditions that will cause the both SCTP and TCP to leave the exponential phase early due to reports of increased receiver windows not arriving early enough. A possible fix would be to resume exponential growth when the receive window increases and there has been no packet loss. If this is in fact what's happening, it could explain the occasinal extremely poor performace — if the exponential phase is left during the first dozen packets or so, it could take a long time to reach 100 MB/s..
As for the difference between TCP and SCTP at 0ms latency, I'm inclined to attribute this to the branch-heavy nature of the Linux SCTP code. There are many function calls and linked list operations involved in getting a single SCTP packet into the network interface buffer. It seems to me like the code is designed for beauty of structure rather than minimal CPU cache misses and branching. The profile of the SCTP code (as gathered by OProfile) is rather flat, meaning that the heavy work is not condensed into one function.
diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h |
index 505845d..e06d757 100644 |
--- a/include/net/sctp/sctp.h |
+++ b/include/net/sctp/sctp.h |
@@ -128,6 +128,10 @@ extern int sctp_register_pf(struct sctp_pf *, sa_family_t); |
int sctp_backlog_rcv(struct sock *sk, struct sk_buff *skb); |
int sctp_inet_listen(struct socket *sock, int backlog); |
void sctp_write_space(struct sock *sk); |
+void sctp_raise_sndbuf(struct sctp_transport *transport); |
+void sctp_raise_rcvbuf(struct sctp_association *asoc, int rcvmem); |
+unsigned int sctp_rcvbuf_adjust(struct sctp_association *asoc, |
+ unsigned int len); |
void sctp_data_ready(struct sock *sk, int len); |
unsigned int sctp_poll(struct file *file, struct socket *sock, |
poll_table *wait); |
diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h |
index 69fef4f..fa160bb 100644 |
--- a/include/net/sctp/structs.h |
+++ b/include/net/sctp/structs.h |
@@ -274,6 +274,10 @@ extern struct sctp_globals { |
#define sctp_checksum_disable (sctp_globals.checksum_disable) |
#define sctp_rwnd_upd_shift (sctp_globals.rwnd_update_shift) |
|
| These variables are now needed in several translation units, so I thought it was best to place them next to the global struct in <net/sctp/structs.h> |
+extern long sysctl_sctp_mem[3]; |
+extern int sysctl_sctp_rmem[3]; |
+extern int sysctl_sctp_wmem[3]; |
+ |
/* SCTP Socket type: UDP or TCP style. */ |
typedef enum { |
SCTP_SOCKET_UDP = 0, |
@@ -1754,6 +1758,17 @@ struct sctp_association { |
*/ |
__u32 rwnd_press; |
|
| The corresponding variables in the TCP implementation are also kept in a struct called rcvq_space. I don't think this is the best name, but in this case I value consistency over clarity. |
+ /* Timer used for calculating the appropriate receive buffer space. */ |
+ struct { |
+ /* The starting time of the current measurement period. */ |
+ unsigned long start_time; |
+ |
+ /* The number of bytes delivered to user space in the current |
+ * measurement period. |
+ */ |
+ int delivered; |
+ } rcvq_space; |
+ |
/* This is the sndbuf size in use for the association. |
* This corresponds to the sndbuf size for the association, |
* as specified in the sk->sndbuf. |
diff --git a/net/sctp/associola.c b/net/sctp/associola.c |
index 5f1fb8b..7ad88f5 100644 |
--- a/net/sctp/associola.c |
+++ b/net/sctp/associola.c |
@@ -1462,7 +1462,9 @@ void sctp_assoc_rwnd_increase(struct sctp_association *asoc, unsigned len) |
/* Decrease asoc's rwnd by len. */ |
void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned len) |
{ |
| The socket is now referenced so often in this function that I decided to put it into a separate variable. |
+ struct sock *sk = asoc->base.sk; |
int rx_count; |
| The desired value of sk->sk_rcvbuf is called rcvmem in the TCP implementation. We do the same here. |
+ int rcvmem; |
int over = 0; |
|
SCTP_ASSERT(asoc->rwnd, "rwnd zero", return); |
@@ -1471,14 +1473,19 @@ void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned len) |
if (asoc->ep->rcvbuf_policy) |
rx_count = atomic_read(&asoc->rmem_alloc); |
else |
- rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc); |
+ rx_count = atomic_read(&sk->sk_rmem_alloc); |
+ |
| When we receive some data that would overflow the receive buffer, we try to increase the size of the receive buffer before resorting to choking the sender. |
+ if (rx_count >= sk->sk_rcvbuf) { |
+ rcvmem = min_t(int, rx_count + len, sysctl_sctp_rmem[2]); |
+ sctp_raise_rcvbuf(asoc, rcvmem); |
+ } |
|
/* If we've reached or overflowed our receive buffer, announce |
* a 0 rwnd if rwnd would still be positive. Store the |
* the pottential pressure overflow so that the window can be restored |
* back to original value. |
*/ |
- if (rx_count >= asoc->base.sk->sk_rcvbuf) |
+ if (rx_count >= sk->sk_rcvbuf) |
over = 1; |
|
if (asoc->rwnd >= len) { |
diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c |
index 2cc46f0..9c0629a 100644 |
--- a/net/sctp/sm_make_chunk.c |
+++ b/net/sctp/sm_make_chunk.c |
@@ -2352,7 +2352,7 @@ int sctp_process_init(struct sctp_association *asoc, sctp_cid_t cid, |
*/ |
list_for_each_entry(transport, &asoc->peer.transport_addr_list, |
transports) { |
| The peer's initial advertised receive window is rarely an indication of the transport's bandwidth product delay. The send buffer is typically set to twice the size of the congestion window, so I decided to set the slow-start threshold to half the maximum size of the send buffer. The SCTP RFC allows any value here. This change is made in two places. |
- transport->ssthresh = asoc->peer.i.a_rwnd; |
+ transport->ssthresh = sysctl_sctp_wmem[2] / 2; |
} |
|
/* Set up the TSN tracking pieces. */ |
diff --git a/net/sctp/socket.c b/net/sctp/socket.c |
index fff0926..b79d670 100644 |
--- a/net/sctp/socket.c |
+++ b/net/sctp/socket.c |
@@ -1946,6 +1946,7 @@ SCTP_STATIC int sctp_recvmsg(struct kiocb *iocb, struct sock *sk, |
struct sctp_ulpevent *event = NULL; |
struct sctp_sock *sp = sctp_sk(sk); |
struct sk_buff *skb; |
+ int rcvbuf_increment; |
int copied; |
int err = 0; |
int skb_len; |
@@ -2016,8 +2017,10 @@ SCTP_STATIC int sctp_recvmsg(struct kiocb *iocb, struct sock *sk, |
* rwnd by that amount. If all the data in the skb is read, |
* rwnd is updated when the event is freed. |
*/ |
| When data is delivered to user space, report the amount to sctp_rcvbuf_adjust. If the receive buffer size is updated, the difference is returned and stored in rcvbuf_increment. This amount is then added to the rwnd increment. |
- if (!sctp_ulpevent_is_notification(event)) |
- sctp_assoc_rwnd_increase(event->asoc, copied); |
+ if (!sctp_ulpevent_is_notification(event)) { |
+ rcvbuf_increment = sctp_rcvbuf_adjust(event->asoc, len); |
+ sctp_assoc_rwnd_increase(event->asoc, copied + rcvbuf_increment); |
+ } |
goto out; |
} else if ((event->msg_flags & MSG_NOTIFICATION) || |
(event->msg_flags & MSG_EOR)) |
@@ -3769,6 +3772,9 @@ SCTP_STATIC int sctp_init_sock(struct sock *sk) |
|
SCTP_DBG_OBJCNT_INC(sock); |
|
| This is where we initialize the socket buffer sizes to the configured defaults. |
+ sk->sk_sndbuf = sysctl_sctp_wmem[1]; |
+ sk->sk_rcvbuf = sysctl_sctp_rmem[1]; |
+ |
local_bh_disable(); |
percpu_counter_inc(&sctp_sockets_allocated); |
sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1); |
@@ -6253,6 +6259,83 @@ void sctp_write_space(struct sock *sk) |
} |
} |
|
| This function raises the socket send buffer size to twice the size of the congestion window, subject to certain limitations. These are the same limitations used in the TCP implementation. |
+void sctp_raise_sndbuf(struct sctp_transport *transport) |
+{ |
+ struct sctp_association *asoc = transport->asoc; |
+ struct sock *sk = asoc->base.sk; |
+ int sndmem; |
+ |
+ sndmem = min_t(int, transport->cwnd * 2, sysctl_sctp_wmem[2]); |
+ |
+ if (sk->sk_sndbuf >= sndmem) |
+ return; |
+ |
+ if (sk->sk_userlocks & SOCK_SNDBUF_LOCK) |
+ return; |
+ |
+ if (sctp_memory_pressure) |
+ return; |
+ |
+ if (atomic_long_read(&sctp_memory_allocated) >= sysctl_sctp_mem[0]) |
+ return; |
+ |
+ if (transport->flight_size >= transport->cwnd) |
+ return; |
+ |
+ sk->sk_sndbuf = sndmem; |
+} |
+ |
| This function raises the socket receive buffer size to a specified value, subject to limitations similar to those in the send buffer code. |
+void sctp_raise_rcvbuf(struct sctp_association *asoc, int rcvmem) |
+{ |
+ struct sock *sk = asoc->base.sk; |
+ |
+ if (sk->sk_rcvbuf >= rcvmem) |
+ return; |
+ |
+ if (sk->sk_userlocks & SOCK_RCVBUF_LOCK) |
+ return; |
+ |
+ if (sctp_memory_pressure) |
+ return; |
+ |
+ if (atomic_long_read(&sctp_memory_allocated) >= sysctl_sctp_mem[0]) |
+ return; |
+ |
+ sk->sk_rcvbuf = rcvmem; |
+} |
+ |
| This function measures how much data is copied into user space over the course of one RTT, times 2. If the receive buffer is smaller than this amount, it is increased. |
+unsigned int sctp_rcvbuf_adjust(struct sctp_association *asoc, unsigned int len) |
+{ |
+ struct sctp_transport *transport = asoc->peer.last_data_from; |
+ struct sock *sk = asoc->base.sk; |
+ unsigned long time; |
+ unsigned increment = 0; |
+ int rcvmem = 0; |
+ |
+ if (!asoc->rcvq_space.start_time) |
+ goto new_measure; |
+ |
+ asoc->rcvq_space.delivered += len; |
+ |
+ time = jiffies - asoc->rcvq_space.start_time; |
+ |
+ if (time < transport->rtt || !transport->rtt) |
+ return 0; |
+ |
+ rcvmem = min_t(int, asoc->rcvq_space.delivered * 2, sysctl_sctp_rmem[2]); |
+ |
+ if (sk->sk_rcvbuf < rcvmem) { |
+ increment = rcvmem - sk->sk_rcvbuf; |
+ sctp_raise_rcvbuf(asoc, rcvmem); |
+ } |
+ |
+new_measure: |
+ asoc->rcvq_space.start_time = jiffies; |
+ asoc->rcvq_space.delivered = 0; |
+ |
+ return increment; |
+} |
+ |
/* Is there any sndbuf space available on the socket? |
* |
* Note that sk_wmem_alloc is the sum of the send buffers on all of the |
diff --git a/net/sctp/transport.c b/net/sctp/transport.c |
index d3ae493..4c1768e 100644 |
--- a/net/sctp/transport.c |
+++ b/net/sctp/transport.c |
@@ -466,6 +466,8 @@ void sctp_transport_raise_cwnd(struct sctp_transport *transport, |
|
transport->cwnd = cwnd; |
transport->partial_bytes_acked = pba; |
+ |
| cwnd was (possibly) just updated, so we need to call sctp_raise_sndbuf to maintain the send buffer at twice the size of the congestion window. |
+ sctp_raise_sndbuf(transport); |
} |
|
/* This routine is used to lower the transport's cwnd when congestion is |
@@ -621,7 +623,7 @@ void sctp_transport_reset(struct sctp_transport *t) |
*/ |
t->cwnd = min(4*asoc->pathmtu, max_t(__u32, 2*asoc->pathmtu, 4380)); |
t->burst_limited = 0; |
| This is the same change to the initial ssthresh as describe above. |
- t->ssthresh = asoc->peer.i.a_rwnd; |
+ t->ssthresh = sysctl_sctp_wmem[2] / 2; |
t->rto = asoc->rto_initial; |
t->rtt = 0; |
t->srtt = 0; |
diff --git a/net/sctp/ulpevent.c b/net/sctp/ulpevent.c |
index aa72e89..4395e52b 100644 |
--- a/net/sctp/ulpevent.c |
+++ b/net/sctp/ulpevent.c |
@@ -987,7 +987,7 @@ static void sctp_ulpevent_receive_data(struct sctp_ulpevent *event, |
static void sctp_ulpevent_release_data(struct sctp_ulpevent *event) |
{ |
struct sk_buff *skb, *frag; |
- unsigned int len; |
+ unsigned int len, rcvbuf_increment; |
|
/* Current stack structures assume that the rcv buffer is |
* per socket. For UDP style sockets this is not true as |
@@ -1012,7 +1012,8 @@ static void sctp_ulpevent_release_data(struct sctp_ulpevent *event) |
} |
|
done: |
| This is the second place where we measure the amount of data passed to user space. See above for details. |
- sctp_assoc_rwnd_increase(event->asoc, len); |
+ rcvbuf_increment = sctp_rcvbuf_adjust(event->asoc, len); |
+ sctp_assoc_rwnd_increase(event->asoc, len + rcvbuf_increment); |
sctp_ulpevent_release_owner(event); |
} |
TCP memory documentation, created by Ian McDonald when trying to implement memory management for Net:DCCP.