Implementing SCTP Pluggable Congestion Control for Linux

As of March 2011, Linux' SCTP (LKSCTP) implements only a single congestion control algorithm, similar to TCP Reno. This document describes my work towards implementing pluggable congestion control, so that algorithms like TCP BIC and TCP CUBIC can optionally be used instead.

The key factors preventing us from using the TCP plugins directly in LKSCTP are:

  1. SCTP performs congestion control per transport rather than per socket
  2. SCTP's congestion window is measured in bytes rather than frames

How does LKSCTP do congestion control today?

Asked in another way: What causes the cwnd variable to be updated?

Most of this is handled through sctp_transport_raise_cwnd and sctp_transport_lower_cwnd.

What's the interface of a TCP congestion control plugin?

Function Description
ssthresh recalculate slow start threshold after a congestion event
cong_avoid grow the congestion window after acking some data
min_cwnd return the lower bound of the congestion window
set_state called to inform the plugin about changes in congestion state
cwnd_event called to inform the plugin about congestion events
undo_cwnd new value of cwnd after loss
pkts_acked hook for packet ack accounting
get_info get info for inet_diag

How do these interface map to each other?

The existing sctp_transport_lower_cwnd() function can be patched to use the functions exposed by the plugins to perform its calculations of ssthresh. set_state() must be called in advance of ssthresh() to inform the plugin how the congestion was detected.

sctp_transport_raise_cwnd() can almost directly be replaced with a call to the plugins' cong_avoid() functions.

What needs to be done to LKSCTP?

After the TCP plugins have been imported:

Results

In LKSCTP, the receiver window has a default size of 65,535 bytes, as defined by SCTP_DEFAULT_MAXWINDOW. This value is also used to initialize SSTHRESH. To make sure this value wasn't the limiting factor, and to allow comparison with unpatched versions of Linux, I capped the bandwidth at a very low limit.

This test consisted of transmitting 1,000 chunks of 1,452 bytes. The timing information was extracted from the event log retrieved from the kernel, where possible. The traffic went through an intermediate host to add delay and limit bandwidth. Execution time variations were in the order of 1000ms. The table is sorted from slowest to fastest.

Implementation Time (ms) Bytes transmitted,
including retransmits
CWND increases CWND decreases
SCTP: Unpatched (v2.6.35) 731,213
SCTP: Reno 148,915 1,751,324 33 22
SCTP: Unpatched (v2.6.38) 147,600
SCTP: Hybla 139,698 1,406,988 35 25
TCP: Reno 121,322 363 365
SCTP: Cubic 107,246 2,065,476 9 6
TCP: Cubic 81,700 285 286

The first 30 seconds of transfer

Y = bytes, X = seconds






Interpretation of the results

In the unpatched version Linux kernel, the transfer rate for SCTP was 9.60 KB/s. Using the CUBIC congestion control plugin, a transfer rate of 13.22 KB/s was attained; a 38% improvement.

The terrible performance of SCTP in Linux v2.6.35 is caused by a bug introduced to Linux' implementation of abs in 2009. This was fixed in January of 2011.

The emerging behavior of the similar plugins turned out to be very different: While the SCTP CUBIC plugin quickly stabilizes, the corresponding TCP plugin oscillates for the duration of the transfer. The TCP Reno plugin oscillates similarly, while the corresponding SCTP plugin repeats a pattern of unsustainable behavior.

I suspect the odd behavior of the SCTP Reno and Hybla plugins is caused by poor interaction with fast retransmit.

The patch

This patch includes event tracking code, "mortehu_debug_", and a couple of shortcuts, neither of which should be included in any final submission to Linux. Only the Reno, CUBIC, HTCP and Hybla plugins have been imported.

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index e64f4c6..1f1d935 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -289,6 +289,28 @@ static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
return (struct tcp_request_sock *)req;
}
Data structures for logging important events in the TCP code.
+#define MORTEHU_DEBUG_MAX 1024
+
+enum tcp_mortehu_debug_event_type
+{
+ mortehu_tcp_state = 1,
+ mortehu_tcp_acked = 2,
+ mortehu_tcp_new_cwnd = 3,
+ mortehu_tcp_congestion = 4,
+ mortehu_tcp_sack = 5,
+ mortehu_tcp_tx_packet = 6,
+ mortehu_tcp_tx_chunk = 7,
+ mortehu_tcp_gap_ack = 8,
+ mortehu_tcp_ca_state = 9
+};
+
+struct tcp_mortehu_debug_event
+{
+ __u32 time;
+ __u32 type;
+ __u32 params[4];
+};
+
struct tcp_sock {
/* inet_connection_sock has to be the first member of tcp_sock */
struct inet_connection_sock inet_conn;
@@ -460,6 +482,9 @@ struct tcp_sock {
* contains related tcp_cookie_transactions fields.
*/
struct tcp_cookie_values *cookie_values;
+
+ struct tcp_mortehu_debug_event *mortehu_debug;
+ __u32 mortehu_debug_count;
};
static inline struct tcp_sock *tcp_sk(const struct sock *sk)
@@ -467,6 +492,29 @@ static inline struct tcp_sock *tcp_sk(const struct sock *sk)
return (struct tcp_sock *)sk;
}
+static inline void mortehu_tcp_debug_add(struct sock *sk, __u32 type, __u32 p0, __u32 p1, __u32 p2, __u32 p3)
+{
+ __u32 i;
+ struct tcp_sock *sock;
+
+ sock = tcp_sk(sk);
+
+ if (!sock || !sock->mortehu_debug)
+ return;
+
+ if (sock->mortehu_debug_count == MORTEHU_DEBUG_MAX)
+ return;
+
+ i = sock->mortehu_debug_count++;
+
+ sock->mortehu_debug[i].time = jiffies;
+ sock->mortehu_debug[i].type = type;
+ sock->mortehu_debug[i].params[0] = p0;
+ sock->mortehu_debug[i].params[1] = p1;
+ sock->mortehu_debug[i].params[2] = p2;
+ sock->mortehu_debug[i].params[3] = p3;
+}
+
struct tcp_timewait_sock {
struct inet_timewait_sock tw_sk;
u32 tw_rcv_nxt;
diff --git a/include/net/sctp/congestion.h b/include/net/sctp/congestion.h
new file mode 100644
index 0000000..cc8c195
--- /dev/null
+++ b/include/net/sctp/congestion.h
@@ -0,0 +1,93 @@
The all-new congestion control interface, mostly copied from include/net/tcp.h
+#ifndef __sctp_congestion_h__
+#define __sctp_congestion_h__
+
+#include <net/sctp/sctp.h>
+
+#define sctp_time_stamp ((__u32)(jiffies))
+
+/* Events passed to congestion control interface */
+enum sctp_ca_event {
+ CA_EVENT_TX_START, /* first transmit when no packets in flight */
+ CA_EVENT_CWND_RESTART, /* congestion window restart */
+ CA_EVENT_COMPLETE_CWR, /* end of congestion recovery */
+ CA_EVENT_FRTO, /* fast recovery timeout */
+ CA_EVENT_LOSS, /* loss timeout */
+ CA_EVENT_FAST_ACK, /* in sequence ack */
+ CA_EVENT_SLOW_ACK, /* other ack */
+};
+
+/*
+ * Interface for adding new SCTP congestion control handlers
+ */
+#define SCTP_CA_NAME_MAX 16
+#define SCTP_CA_MAX 128
+#define SCTP_CA_BUF_MAX (SCTP_CA_NAME_MAX*SCTP_CA_MAX)
+
+#define SCTP_CONG_NON_RESTRICTED 0x1
+#define SCTP_CONG_RTT_STAMP 0x2
+
+struct sctp_congestion_ops {
+ struct list_head list;
+ unsigned long flags;
+
+ /* initialize private data (optional) */
+ void (*init)(struct sctp_transport *tp);
+ /* cleanup private data (optional) */
+ void (*release)(struct sctp_transport *tp);
+
+ /* return slow start threshold (required) */
+ u32 (*ssthresh)(struct sctp_transport *tp);
+ /* lower bound for congestion window (optional) */
+ u32 (*min_cwnd)(const struct sctp_transport *tp);
+ /* do new cwnd calculation (required) */
+ void (*cong_avoid)(struct sctp_transport *tp, __u32 sack_ctsn, __u32 bytes_acked);
+ /* call before changing ca_state (optional) */
+ void (*set_state)(struct sctp_transport *tp, u8 new_state);
+ /* call when cwnd event occurs (optional) */
+ void (*cwnd_event)(struct sctp_transport *tp, enum sctp_ca_event ev);
+ /* new value of cwnd after loss (optional) */
+ u32 (*undo_cwnd)(struct sctp_transport *tp);
+ /* hook for packet ack accounting (optional) */
+ void (*pkts_acked)(struct sctp_transport *tp, u32 num_acked, s32 rtt_us);
+ /* get info for inet_diag (optional) */
+ void (*get_info)(struct sctp_transport *tp, u32 ext, struct sk_buff *skb);
+
+ char name[SCTP_CA_NAME_MAX];
+ struct module *owner;
+};
+
+int sctp_register_congestion_control(struct sctp_congestion_ops *type);
+void sctp_unregister_congestion_control(struct sctp_congestion_ops *type);
+
+void sctp_init_congestion_control(struct sctp_transport *tp);
+void sctp_cleanup_congestion_control(struct sctp_transport *tp);
+int sctp_set_default_congestion_control(const char *name);
+void sctp_get_default_congestion_control(char *name);
+void sctp_get_available_congestion_control(char *buf, size_t len);
+void sctp_get_allowed_congestion_control(char *buf, size_t len);
+int sctp_set_allowed_congestion_control(char *allowed);
+int sctp_set_congestion_control(struct sctp_transport *tp, const char *name);
+void sctp_slow_start(struct sctp_transport *transport, __u32 bytes_acked);
+void sctp_cong_avoid_ai(struct sctp_transport *transport, u32 w);
+void sctp_cong_raise_cwnd(struct sctp_transport *transport, __u32 sack_ctsn, __u32 bytes_acked);
+void sctp_cong_lower_cwnd(struct sctp_transport *transport, sctp_lower_cwnd_t reason);
+
+extern struct sctp_congestion_ops sctp_init_congestion_ops;
+extern struct sctp_congestion_ops sctp_reno;
+
+u32 sctp_reno_ssthresh(struct sctp_transport *tp);
+u32 sctp_reno_min_cwnd(const struct sctp_transport *tp);
+void sctp_reno_cong_avoid(struct sctp_transport *sk, __u32 sack_ctsn, __u32 bytes_acked);
+void sctp_reno_cong_detected(struct sctp_transport *sk, sctp_lower_cwnd_t reason);
+int sctp_is_cwnd_limited(const struct sctp_transport *tp);
+
+static inline void sctp_set_ca_state(struct sctp_transport *tp, const u8 ca_state)
+{
+ if (ca_state != tp->ca_state)
+ mortehu_debug_add (tp->asoc->base.sk, mortehu_ca_state, ca_state, 0, 0, 0);
+ if (tp->ca_ops->set_state)
+ tp->ca_ops->set_state(tp, ca_state);
+ tp->ca_state = ca_state;
+}
+
+#endif /* __sctp_congestion_h__ */
diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index cc9185c..ebdb664 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -115,7 +115,6 @@ struct sctp_hashbucket {
struct hlist_head chain;
} __attribute__((__aligned__(8)));
-
/* The SCTP globals structure. */
extern struct sctp_globals {
/* RFC2960 Section 14. Suggested SCTP Protocol Parameter Values
@@ -272,6 +271,28 @@ extern struct sctp_globals {
#define sctp_checksum_disable (sctp_globals.checksum_disable)
#define sctp_rwnd_upd_shift (sctp_globals.rwnd_update_shift)
Data structures for logging important events in the SCTP code.
+#define MORTEHU_DEBUG_MAX 1024
+
+enum sctp_mortehu_debug_event_type
+{
+ mortehu_state = 1,
+ mortehu_acked = 2,
+ mortehu_new_cwnd = 3,
+ mortehu_congestion = 4,
+ mortehu_sack = 5,
+ mortehu_tx_packet = 6,
+ mortehu_tx_chunk = 7,
+ mortehu_gap_ack = 8,
+ mortehu_ca_state = 9
+};
+
+struct sctp_mortehu_debug_event
+{
+ __u32 time;
+ __u32 type;
+ __u32 params[4];
+};
+
/* SCTP Socket type: UDP or TCP style. */
typedef enum {
SCTP_SOCKET_UDP = 0,
@@ -341,6 +362,9 @@ struct sctp_sock {
atomic_t pd_mode;
/* Receive to here while partial delivery is in effect. */
struct sk_buff_head pd_lobby;
+
+ struct sctp_mortehu_debug_event *mortehu_debug;
+ __u32 mortehu_debug_count;
};
static inline struct sctp_sock *sctp_sk(const struct sock *sk)
@@ -353,6 +377,26 @@ static inline struct sock *sctp_opt2sk(const struct sctp_sock *sp)
return (struct sock *)sp;
}
+static inline void mortehu_debug_add(struct sock *sk, __u32 type, __u32 p0, __u32 p1, __u32 p2, __u32 p3)
+{
+ __u32 i;
+ struct sctp_sock *sock;
+
+ sock = sctp_sk(sk);
+
+ if (sock->mortehu_debug_count == MORTEHU_DEBUG_MAX)
+ return;
+
+ i = sock->mortehu_debug_count++;
+
+ sock->mortehu_debug[i].time = jiffies;
+ sock->mortehu_debug[i].type = type;
+ sock->mortehu_debug[i].params[0] = p0;
+ sock->mortehu_debug[i].params[1] = p1;
+ sock->mortehu_debug[i].params[2] = p2;
+ sock->mortehu_debug[i].params[3] = p3;
+}
+
#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
struct sctp6_sock {
struct sctp_sock sctp;
@@ -848,6 +892,19 @@ static inline int sctp_packet_empty(struct sctp_packet *packet)
return packet->size == packet->overhead;
}
+enum sctp_ca_state {
+ SCTP_CA_Open = 0,
+#define SCTPF_CA_Open (1<<SCTP_CA_Open)
+ SCTP_CA_Disorder = 1,
+#define SCTPF_CA_Disorder (1<<SCTP_CA_Disorder)
+ SCTP_CA_CWR = 2,
+#define SCTPF_CA_CWR (1<<SCTP_CA_CWR)
+ SCTP_CA_Recovery = 3,
+#define SCTPF_CA_Recovery (1<<SCTP_CA_Recovery)
+ SCTP_CA_Loss = 4
+#define SCTPF_CA_Loss (1<<SCTP_CA_Loss)
+};
+
/* This represents a remote transport address.
* For local transport addresses, we just use union sctp_addr.
*
@@ -1053,8 +1110,22 @@ struct sctp_transport {
/* 64-bit random number sent with heartbeat. */
__u64 hb_nonce;
Additions to the transport structure to hold transient congestion control data.
+
+ /* Congestion avoidance operations. */
+ struct sctp_congestion_ops *ca_ops;
+
+ /* Congestion avoidance state. */
+ enum sctp_ca_state ca_state;
+
+ /* Congestion avoidance private data. */
+ u32 ca_priv[16];
};
+static inline void *sctp_ca(const struct sctp_transport *tp)
+{
+ return (void *)tp->ca_priv;
+}
+
struct sctp_transport *sctp_transport_new(const union sctp_addr *,
gfp_t);
void sctp_transport_set_owner(struct sctp_transport *,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index cda30ea..0cc3a9e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -712,6 +712,8 @@ static inline void tcp_set_ca_state(struct sock *sk, const u8 ca_state)
{
struct inet_connection_sock *icsk = inet_csk(sk);
+ if (ca_state != icsk->icsk_ca_state)
+ mortehu_tcp_debug_add (sk, mortehu_tcp_ca_state, ca_state, 0, 0, 0);
if (icsk->icsk_ca_ops->set_state)
icsk->icsk_ca_ops->set_state(sk, ca_state);
icsk->icsk_ca_state = ca_state;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index b22d450..710def8 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2618,6 +2618,19 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
case TCP_USER_TIMEOUT:
val = jiffies_to_msecs(icsk->icsk_user_timeout);
break;
The interface through which user space applications can fetch the TCP debug event log.
+
+ case 0x31337:
+ {
+ if (copy_to_user(optval, tp->mortehu_debug, tp->mortehu_debug_count * sizeof(*tp->mortehu_debug)))
+ return -EFAULT;
+
+ val = tp->mortehu_debug_count;
+ tp->mortehu_debug_count = 0;
+
+ return val;
+ }
+ break;
+
default:
return -ENOPROTOOPT;
}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index da782e7..87f421c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3697,6 +3697,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
if (before(tp->frto_highmark, tp->snd_una))
tp->frto_highmark = 0;
+ mortehu_tcp_debug_add (sk, mortehu_tcp_acked, 0, ack, 0, 0);
+
if (tcp_ack_is_dubious(sk, flag)) {
/* Advance CWND, if state allows this. */
if ((flag & FLAG_DATA_ACKED) && !frto_cwnd &&
@@ -3709,6 +3711,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
tcp_cong_avoid(sk, ack, prior_in_flight);
}
+ mortehu_tcp_debug_add (sk, mortehu_tcp_new_cwnd, tcp_sk(sk)->snd_cwnd, tcp_sk(sk)->snd_ssthresh, tcp_sk(sk)->bytes_acked, 0);
+
if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP))
dst_confirm(__sk_dst_get(sk));
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f7e6c2c..3ad7453 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1830,6 +1833,9 @@ static int tcp_v4_init_sock(struct sock *sk)
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
+ tp->mortehu_debug = kmalloc(sizeof(*tp->mortehu_debug) * MORTEHU_DEBUG_MAX, GFP_KERNEL);
+ tp->mortehu_debug_count = 0;
+
skb_queue_head_init(&tp->out_of_order_queue);
tcp_init_xmit_timers(sk);
tcp_prequeue_init(tp);
@@ -1885,6 +1891,8 @@ static int tcp_v4_init_sock(struct sock *sk)
percpu_counter_inc(&tcp_sockets_allocated);
local_bh_enable();
+ mortehu_tcp_debug_add (sk, mortehu_tcp_new_cwnd, tp->snd_cwnd, tp->snd_ssthresh, tp->bytes_acked, 0);
+
return 0;
}
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 2b0c186..2e0e9e3 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1950,6 +1950,9 @@ static int tcp_v6_init_sock(struct sock *sk)
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
+ tp->mortehu_debug = kmalloc(sizeof(*tp->mortehu_debug) * MORTEHU_DEBUG_MAX, GFP_KERNEL);
+ tp->mortehu_debug_count = 0;
+
skb_queue_head_init(&tp->out_of_order_queue);
tcp_init_xmit_timers(sk);
tcp_prequeue_init(tp);
diff --git a/net/sctp/Kconfig b/net/sctp/Kconfig
index 126b014..7e33306 100644
--- a/net/sctp/Kconfig
+++ b/net/sctp/Kconfig
@@ -97,4 +97,81 @@ config SCTP_HMAC_MD5
endchoice
New configuration options for the build process, to allow users to choose what plugins to compile, and which one should be the default.
+menuconfig SCTP_CONG_ADVANCED
+ bool "SCTP: advanced congestion control"
+ ---help---
+ Support for selection of various SCTP congestion control
+ modules.
+
+ Nearly all users can safely say no here, and a safe default
+ selection will be made.
+
+ If unsure, say N.
+
+if SCTP_CONG_ADVANCED
+
+config SCTP_CONG_CUBIC
+ tristate "CUBIC SCTP"
+ default y
+ ---help---
+ This is a port of version 2.0 of BIC-TCP which uses a cubic growth
+ function among other techniques.
+
+config SCTP_CONG_HYBLA
+ tristate "HYBLA SCTP"
+ default y
+ ---help---
+ HYBLA congestion control.
+
+config SCTP_CONG_HTCP
+ tristate "HTCP SCTP"
+ default y
+ ---help---
+ HTCP congestion control.
+
+choice
+ prompt "Default SCTP congestion control"
+ default DEFAULT_SCTP_CUBIC
+ help
+ Select the SCTP congestion control that will be used by default
+ for all connections.
+
+ config DEFAULT_SCTP_CUBIC
+ bool "Cubic" if SCTP_CONG_CUBIC=y
+
+ config DEFAULT_SCTP_HYBLA
+ bool "Hybla" if SCTP_CONG_HYBLA=y
+
+ config DEFAULT_SCTP_HTCP
+ bool "HTCP" if SCTP_CONG_HTCP=y
+
+ config DEFAULT_SCTP_RENO
+ bool "Reno"
+
+endchoice
+
+endif # SCTP_CONG_ADVANCED
+
+config SCTP_CONG_CUBIC
+ tristate
+ depends on !SCTP_CONG_ADVANCED
+ default y
+
+config SCTP_CONG_HYBLA
+ tristate
+ depends on !SCTP_CONG_ADVANCED
+ default y
+
+config SCTP_CONG_HTCP
+ tristate
+ depends on !SCTP_CONG_ADVANCED
+ default y
+
+config DEFAULT_SCTP_CONG
+ string
+ default "cubic" if DEFAULT_SCTP_CUBIC
+ default "hybla" if DEFAULT_SCTP_HYBLA
+ default "htcp" if DEFAULT_SCTP_HTCP
+ default "reno"
+
endif # IP_SCTP
diff --git a/net/sctp/Makefile b/net/sctp/Makefile
index 5c30b7a..7930673 100644
--- a/net/sctp/Makefile
+++ b/net/sctp/Makefile
@@ -10,7 +10,8 @@ sctp-y := sm_statetable.o sm_statefuns.o sm_sideeffect.o \
transport.o chunk.o sm_make_chunk.o ulpevent.o \
inqueue.o outqueue.o ulpqueue.o command.o \
tsnmap.o bind_addr.o socket.o primitive.o \
- output.o input.o debug.o ssnmap.o auth.o
+ output.o input.o debug.o ssnmap.o auth.o \
+ congestion.o
sctp_probe-y := probe.o
@@ -19,3 +20,7 @@ sctp-$(CONFIG_PROC_FS) += proc.o
sctp-$(CONFIG_SYSCTL) += sysctl.o
sctp-$(subst m,y,$(CONFIG_IPV6)) += ipv6.o
+
+obj-$(CONFIG_SCTP_CONG_CUBIC) += cong_cubic.o
+obj-$(CONFIG_SCTP_CONG_HYBLA) += cong_hybla.o
+obj-$(CONFIG_SCTP_CONG_HTCP) += cong_htcp.o
diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index 6b04287..b00a9fb 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -60,6 +60,7 @@
#include <net/ipv6.h>
#include <net/sctp/sctp.h>
#include <net/sctp/sm.h>
+#include <net/sctp/congestion.h>
/* Forward declarations for internal functions. */
static void sctp_assoc_bh_rcv(struct work_struct *work);
@@ -766,6 +767,9 @@ struct sctp_transport *sctp_assoc_add_peer(struct sctp_association *asoc,
asoc->peer.retran_path = peer;
}
+ peer->ca_ops = &sctp_init_congestion_ops;
+ sctp_init_congestion_control(peer);
+
return peer;
}
diff --git a/net/sctp/cong_cubic.c b/net/sctp/cong_cubic.c
new file mode 100644
index 0000000..010cf76
--- /dev/null
+++ b/net/sctp/cong_cubic.c
@@ -0,0 +1,454 @@
+/*
+ * Port of the CUBIC congestion control algorithm from TCP to SCTP.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/math64.h>
+#include <net/sctp/sctp.h>
+#include <net/sctp/sm.h>
+#include <net/sctp/congestion.h>
+
+#define BICSCTP_BETA_SCALE 1024 /* Scale factor beta calculation
+ * max_cwnd = cwnd * beta
+ */
+#define BICSCTP_HZ 10 /* BIC HZ 2^10 = 1024 */
+
+/* Two methods of hybrid slow start */
+#define HYSTART_ACK_TRAIN 0x1
+#define HYSTART_DELAY 0x2
+
+/* Number of delay samples for detecting the increase of delay */
+#define HYSTART_MIN_SAMPLES 8
+#define HYSTART_DELAY_MIN (2U<<3)
+#define HYSTART_DELAY_MAX (16U<<3)
+#define HYSTART_DELAY_THRESH(x) clamp(x, HYSTART_DELAY_MIN, HYSTART_DELAY_MAX)
+
+static int fast_convergence __read_mostly = 1;
+static int beta __read_mostly = 717; /* = 717/1024 (BICSCTP_BETA_SCALE) */
+static int initial_ssthresh __read_mostly;
+static int bic_scale __read_mostly = 41;
+static int tcp_friendliness __read_mostly = 1;
+
+static int hystart __read_mostly = 1;
+static int hystart_detect __read_mostly = HYSTART_ACK_TRAIN | HYSTART_DELAY;
+static int hystart_low_window __read_mostly = 16;
+
+static u32 cube_rtt_scale __read_mostly;
+static u32 beta_scale __read_mostly;
+static u64 cube_factor __read_mostly;
+
+/* Note parameters that are used for precomputing scale factors are read-only */
+module_param(fast_convergence, int, 0644);
+MODULE_PARM_DESC(fast_convergence, "turn on/off fast convergence");
+module_param(beta, int, 0644);
+MODULE_PARM_DESC(beta, "beta for multiplicative increase");
+module_param(initial_ssthresh, int, 0644);
+MODULE_PARM_DESC(initial_ssthresh, "initial value of slow start threshold");
+module_param(bic_scale, int, 0444);
+MODULE_PARM_DESC(bic_scale, "scale (scaled by 1024) value for bic function (bic_scale/1024)");
+module_param(tcp_friendliness, int, 0644);
+MODULE_PARM_DESC(tcp_friendliness, "turn on/off tcp friendliness");
+module_param(hystart, int, 0644);
+MODULE_PARM_DESC(hystart, "turn on/off hybrid slow start algorithm");
+module_param(hystart_detect, int, 0644);
+MODULE_PARM_DESC(hystart_detect, "hyrbrid slow start detection mechanisms"
+ " 1: packet-train 2: delay 3: both packet-train and delay");
+module_param(hystart_low_window, int, 0644);
+MODULE_PARM_DESC(hystart_low_window, "lower bound cwnd for hybrid slow start");
+
+/* BIC SCTP Parameters */
+struct bicsctp {
+ u32 cnt; /* increase cwnd by 1 after ACKs */
+ u32 last_max_cwnd; /* last maximum cwnd */
+ u32 loss_cwnd; /* congestion window at last loss */
+ u32 last_cwnd; /* the last cwnd */
+ u32 last_time; /* time when updated last_cwnd */
+ u32 bic_origin_point;/* origin point of bic function */
+ u32 bic_K; /* time to origin point from the beginning of the current epoch */
+ u32 delay_min; /* min delay */
+ u32 epoch_start; /* beginning of an epoch */
+ u32 ack_cnt; /* number of acks */
+ u32 sctp_cwnd; /* estimated sctp cwnd */
+#define ACK_RATIO_SHIFT 4
+ u16 delayed_ack; /* estimate the ratio of Packets/ACKs << 4 */
+ u8 sample_cnt; /* number of samples to decide curr_rtt */
+ u8 found; /* the exit point is found? */
+ u32 round_start; /* beginning of each round */
+ u32 end_seq; /* end_seq of the round */
+ u32 last_jiffies; /* last time when the ACK spacing is close */
+ u32 curr_rtt; /* the minimum rtt of current round */
+};
+
+static inline void bicsctp_reset(struct bicsctp *ca)
+{
+ ca->cnt = 0;
+ ca->last_max_cwnd = 0;
+ ca->loss_cwnd = 0;
+ ca->last_cwnd = 0;
+ ca->last_time = 0;
+ ca->bic_origin_point = 0;
+ ca->bic_K = 0;
+ ca->delay_min = 0;
+ ca->epoch_start = 0;
+ ca->delayed_ack = 2 << ACK_RATIO_SHIFT;
+ ca->ack_cnt = 0;
+ ca->sctp_cwnd = 0;
+ ca->found = 0;
+}
+
+static inline void bicsctp_hystart_reset(struct sctp_transport *tp)
+{
+ struct bicsctp *ca = sctp_ca(tp);
+
+ ca->round_start = ca->last_jiffies = jiffies;
+ ca->end_seq = tp->asoc->ctsn_ack_point; /*tp->snd_nxt;*/
+ ca->curr_rtt = 0;
+ ca->sample_cnt = 0;
+}
+
+static void bicsctp_init(struct sctp_transport *tp)
+{
+ struct bicsctp *ca = sctp_ca(tp);
+
+ if (!ca)
+ printk(KERN_EMERG "ca is NULL\n");
+ if (!tp->asoc)
+ printk(KERN_EMERG "tp->asoc is NULL\n");
+
+ bicsctp_reset(ca);
+
+ if (hystart)
+ bicsctp_hystart_reset(tp);
+
+ if (!hystart && initial_ssthresh)
+ tp->ssthresh = initial_ssthresh;
+}
+
+/* calculate the cubic root of x using a table lookup followed by one
+ * Newton-Raphson iteration.
+ * Avg err ~= 0.195%
+ */
+static u32 cubic_root(u64 a)
+{
+ u32 x, b, shift;
+ /*
+ * cbrt(x) MSB values for x MSB values in [0..63].
+ * Precomputed then refined by hand - Willy Tarreau
+ *
+ * For x in [0..63],
+ * v = cbrt(x << 18) - 1
+ * cbrt(x) = (v[x] + 10) >> 6
+ */
+ static const u8 v[] = {
+ /* 0x00 */ 0, 54, 54, 54, 118, 118, 118, 118,
+ /* 0x08 */ 123, 129, 134, 138, 143, 147, 151, 156,
+ /* 0x10 */ 157, 161, 164, 168, 170, 173, 176, 179,
+ /* 0x18 */ 181, 185, 187, 190, 192, 194, 197, 199,
+ /* 0x20 */ 200, 202, 204, 206, 209, 211, 213, 215,
+ /* 0x28 */ 217, 219, 221, 222, 224, 225, 227, 229,
+ /* 0x30 */ 231, 232, 234, 236, 237, 239, 240, 242,
+ /* 0x38 */ 244, 245, 246, 248, 250, 251, 252, 254,
+ };
+
+ b = fls64(a);
+ if (b < 7) {
+ /* a in [0..63] */
+ return ((u32)v[(u32)a] + 35) >> 6;
+ }
+
+ b = ((b * 84) >> 8) - 1;
+ shift = (a >> (b * 3));
+
+ x = ((u32)(((u32)v[shift] + 10) << b)) >> 6;
+
+ /*
+ * Newton-Raphson iteration
+ * 2
+ * x = ( 2 * x + a / x ) / 3
+ * k+1 k k
+ */
+ x = (2 * x + (u32)div64_u64(a, (u64)x * (u64)(x - 1)));
+ x = ((x * 341) >> 10);
+ return x;
+}
+
+/*
+ * Compute congestion window to use.
+ */
+static inline void bicsctp_update(struct bicsctp *ca, u32 cwnd)
+{
+ u64 offs;
+ u32 delta, t, bic_target, max_cnt;
+
+ ca->ack_cnt++; /* count the number of ACKs */
+
+ if (ca->last_cwnd == cwnd &&
+ (s32)(sctp_time_stamp - ca->last_time) <= HZ / 32)
+ return;
+
+ ca->last_cwnd = cwnd;
+ ca->last_time = sctp_time_stamp;
+
+ if (ca->epoch_start == 0) {
+ ca->epoch_start = sctp_time_stamp; /* record the beginning of an epoch */
+ ca->ack_cnt = 1; /* start counting */
+ ca->sctp_cwnd = cwnd; /* syn with cubic */
+
+ if (ca->last_max_cwnd <= cwnd) {
+ ca->bic_K = 0;
+ ca->bic_origin_point = cwnd;
+ } else {
+ /* Compute new K based on
+ * (wmax-cwnd) * (srtt>>3 / HZ) / c * 2^(3*BICSCTP_HZ)
+ */
+ ca->bic_K = cubic_root(cube_factor
+ * (ca->last_max_cwnd - cwnd));
+ ca->bic_origin_point = ca->last_max_cwnd;
+ }
+ }
+
+ /* cubic function - calc*/
+ /* calculate c * time^3 / rtt,
+ * while considering overflow in calculation of time^3
+ * (so time^3 is done by using 64 bit)
+ * and without the support of division of 64bit numbers
+ * (so all divisions are done by using 32 bit)
+ * also NOTE the unit of those veriables
+ * time = (t - K) / 2^BICSCTP_HZ
+ * c = bic_scale >> 10
+ * rtt = (srtt >> 3) / HZ
+ * !!! The following code does not have overflow problems,
+ * if the cwnd < 1 million packets !!!
+ */
+
+ /* change the unit from HZ to BICSCTP_HZ */
+ t = ((sctp_time_stamp + (ca->delay_min>>3) - ca->epoch_start)
+ << BICSCTP_HZ) / HZ;
+
+ if (t < ca->bic_K) /* t - K */
+ offs = ca->bic_K - t;
+ else
+ offs = t - ca->bic_K;
+
+ /* c/rtt * (t-K)^3 */
+ delta = (cube_rtt_scale * offs * offs * offs) >> (10+3*BICSCTP_HZ);
+ if (t < ca->bic_K) /* below origin*/
+ bic_target = ca->bic_origin_point - delta;
+ else /* above origin*/
+ bic_target = ca->bic_origin_point + delta;
+
+ /* cubic function - calc bicsctp_cnt*/
+ if (bic_target > cwnd) {
+ ca->cnt = cwnd / (bic_target - cwnd);
+ } else {
+ ca->cnt = 100 * cwnd; /* very small increment*/
+ }
+
+ /* TCP Friendly */
+ if (tcp_friendliness) {
+ u32 scale = beta_scale;
+ delta = (cwnd * scale) >> 3;
+ while (ca->ack_cnt > delta) { /* update sctp cwnd */
+ ca->ack_cnt -= delta;
+ ca->sctp_cwnd++;
+ }
+
+ if (ca->sctp_cwnd > cwnd){ /* if bic is slower than sctp */
+ delta = ca->sctp_cwnd - cwnd;
+ max_cnt = cwnd / delta;
+ if (ca->cnt > max_cnt)
+ ca->cnt = max_cnt;
+ }
+ }
+
+ ca->cnt = (ca->cnt << ACK_RATIO_SHIFT) / ca->delayed_ack;
+ if (ca->cnt == 0) /* cannot be zero */
+ ca->cnt = 1;
+}
+
+/* static void bicsctp_cong_avoid(struct sctp_transport *sk, u32 ack, u32 in_flight) */
+static void bicsctp_cong_avoid(struct sctp_transport *transport,
+ __u32 sack_ctsn, __u32 bytes_acked)
+{
+ struct bicsctp *ca = sctp_ca(transport);
+ __u32 cwnd, ssthresh;
+
+ if (!sctp_is_cwnd_limited(transport))
+ return;
+
+ cwnd = transport->cwnd;
+ ssthresh = transport->ssthresh;
+
+ if (cwnd <= ssthresh) {
+ /* if (hystart && after(ack, ca->end_seq)) */
+ if (hystart && !TSN_lte(sack_ctsn, ca->end_seq))
+ bicsctp_hystart_reset(transport);
+ sctp_slow_start(transport, bytes_acked);
+ } else {
+ bicsctp_update(ca, cwnd);
+ sctp_cong_avoid_ai(transport, bytes_acked);
+ }
+
+}
+
+static u32 bicsctp_recalc_ssthresh(struct sctp_transport *tp)
+{
+ struct bicsctp *ca = sctp_ca(tp);
+ struct sctp_association *asoc = tp->asoc;
+
+ ca->epoch_start = 0; /* end of epoch */
+
+ /* Wmax and fast convergence */
+ if (tp->cwnd < ca->last_max_cwnd && fast_convergence)
+ ca->last_max_cwnd = (tp->cwnd * (BICSCTP_BETA_SCALE + beta))
+ / (2 * BICSCTP_BETA_SCALE);
+ else
+ ca->last_max_cwnd = tp->cwnd;
+
+ ca->loss_cwnd = tp->cwnd;
+
+ return max((tp->cwnd * beta) / BICSCTP_BETA_SCALE, 2 * asoc->pathmtu);
+}
+
+static u32 bicsctp_undo_cwnd(struct sctp_transport *tp)
+{
+ struct bicsctp *ca = sctp_ca(tp);
+
+ return max(tp->cwnd, ca->last_max_cwnd);
+}
+
+static void bicsctp_state(struct sctp_transport *tp, u8 new_state)
+{
+ if (new_state == SCTP_CA_Loss) {
+ bicsctp_reset(sctp_ca(tp));
+ bicsctp_hystart_reset(tp);
+ }
+}
+
+static void hystart_update(struct sctp_transport *tp, u32 delay)
+{
+ struct bicsctp *ca = sctp_ca(tp);
+
+ if (!(ca->found & hystart_detect)) {
+ u32 curr_jiffies = jiffies;
+
+ /* first detection parameter - ack-train detection */
+ if (curr_jiffies - ca->last_jiffies <= msecs_to_jiffies(2)) {
+ ca->last_jiffies = curr_jiffies;
+ if (curr_jiffies - ca->round_start >= ca->delay_min>>4)
+ ca->found |= HYSTART_ACK_TRAIN;
+ }
+
+ /* obtain the minimum delay of more than sampling packets */
+ if (ca->sample_cnt < HYSTART_MIN_SAMPLES) {
+ if (ca->curr_rtt == 0 || ca->curr_rtt > delay)
+ ca->curr_rtt = delay;
+
+ ca->sample_cnt++;
+ } else {
+ if (ca->curr_rtt > ca->delay_min +
+ HYSTART_DELAY_THRESH(ca->delay_min>>4))
+ ca->found |= HYSTART_DELAY;
+ }
+ /*
+ * Either one of two conditions are met,
+ * we exit from slow start immediately.
+ */
+ if (ca->found & hystart_detect)
+ tp->ssthresh = tp->cwnd;
+ }
+}
+
+/* Track delayed acknowledgment ratio using sliding window
+ * ratio = (15*ratio + sample) / 16
+ */
+static void bicsctp_acked(struct sctp_transport *tp, u32 cnt, s32 rtt_us)
+{
+ struct bicsctp *ca = sctp_ca(tp);
+ u32 delay;
+
+ if (tp->ca_state == SCTP_CA_Open) {
+ cnt -= ca->delayed_ack >> ACK_RATIO_SHIFT;
+ ca->delayed_ack += cnt;
+ }
+
+ /* Some calls are for duplicates without timetamps */
+ if (rtt_us < 0)
+ return;
+
+ /* Discard delay samples right after fast recovery */
+ if ((s32)(sctp_time_stamp - ca->epoch_start) < HZ)
+ return;
+
+ delay = usecs_to_jiffies(rtt_us) << 3;
+ if (delay == 0)
+ delay = 1;
+
+ /* first time call or link delay decreases */
+ if (ca->delay_min == 0 || ca->delay_min > delay)
+ ca->delay_min = delay;
+
+ /* hystart triggers when cwnd is larger than some threshold */
+ if (hystart && tp->cwnd <= tp->ssthresh &&
+ tp->cwnd >= hystart_low_window)
+ hystart_update(tp, delay);
+}
+
+static struct sctp_congestion_ops cubicsctp = {
+ .init = bicsctp_init,
+ .ssthresh = bicsctp_recalc_ssthresh,
+ .cong_avoid = bicsctp_cong_avoid,
+ .set_state = bicsctp_state,
+ .undo_cwnd = bicsctp_undo_cwnd,
+ .pkts_acked = bicsctp_acked,
+ .owner = THIS_MODULE,
+ .name = "cubic",
+};
+
+static int __init cubicsctp_register(void)
+{
+ BUILD_BUG_ON(sizeof(struct bicsctp) > ICSK_CA_PRIV_SIZE);
+
+ /* Precompute a bunch of the scaling factors that are used per-packet
+ * based on SRTT of 100ms
+ */
+
+ beta_scale = 8*(BICSCTP_BETA_SCALE+beta)/ 3 / (BICSCTP_BETA_SCALE - beta);
+
+ cube_rtt_scale = (bic_scale * 10); /* 1024*c/rtt */
+
+ /* calculate the "K" for (wmax-cwnd) = c/rtt * K^3
+ * so K = cubic_root( (wmax-cwnd)*rtt/c )
+ * the unit of K is BICSCTP_HZ=2^10, not HZ
+ *
+ * c = bic_scale >> 10
+ * rtt = 100ms
+ *
+ * the following code has been designed and tested for
+ * cwnd < 1 million packets
+ * RTT < 100 seconds
+ * HZ < 1,000,00 (corresponding to 10 nano-second)
+ */
+
+ /* 1/c * 2^2*BICSCTP_HZ * srtt */
+ cube_factor = 1ull << (10+3*BICSCTP_HZ); /* 2^40 */
+
+ /* divide by bic_scale and by constant Srtt (100ms) */
+ do_div(cube_factor, bic_scale * 10);
+
+ return sctp_register_congestion_control(&cubicsctp);
+}
+
+static void __exit cubicsctp_unregister(void)
+{
+ sctp_unregister_congestion_control(&cubicsctp);
+}
+
+module_init(cubicsctp_register);
+module_exit(cubicsctp_unregister);
+
+MODULE_AUTHOR("Morten Hustveit");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("CUBIC SCTP");
+MODULE_VERSION("2.3");
diff --git a/net/sctp/cong_htcp.c b/net/sctp/cong_htcp.c
new file mode 100644
index 0000000..9ea25e6
--- /dev/null
+++ b/net/sctp/cong_htcp.c
@@ -0,0 +1,313 @@
+/*
+ * H-TCP congestion control. The algorithm is detailed in:
+ * R.N.Shorten, D.J.Leith:
+ * "H-TCP: TCP for high-speed and long-distance networks"
+ * Proc. PFLDnet, Argonne, 2004.
+ * http://www.hamilton.ie/net/htcp3.pdf
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <net/sctp/congestion.h>
+
+#define ALPHA_BASE (1<<7) /* 1.0 with shift << 7 */
+#define BETA_MIN (1<<6) /* 0.5 with shift << 7 */
+#define BETA_MAX 102 /* 0.8 with shift << 7 */
+
+static int use_rtt_scaling __read_mostly = 1;
+module_param(use_rtt_scaling, int, 0644);
+MODULE_PARM_DESC(use_rtt_scaling, "turn on/off RTT scaling");
+
+static int use_bandwidth_switch __read_mostly = 1;
+module_param(use_bandwidth_switch, int, 0644);
+MODULE_PARM_DESC(use_bandwidth_switch, "turn on/off bandwidth switcher");
+
+struct htcp {
+ u64 alpha; /* Fixed point arith, << 7 */
+ u16 beta; /* Fixed point arith, << 7 */
+ u8 modeswitch; /* Delay modeswitch
+ until we had at least one congestion event */
+ u16 pkts_acked;
+ u32 packetcount;
+ u32 minRTT;
+ u32 maxRTT;
+ u32 last_cong; /* Time since last congestion event end */
+ u32 undo_last_cong;
+
+ u32 undo_maxRTT;
+ u32 undo_old_maxB;
+
+ /* Bandwidth estimation */
+ u32 minB;
+ u32 maxB;
+ u32 old_maxB;
+ u32 Bi;
+ u32 lasttime;
+};
+
+static inline u32 htcp_cong_time(const struct htcp *ca)
+{
+ return jiffies - ca->last_cong;
+}
+
+static inline u32 htcp_ccount(const struct htcp *ca)
+{
+ return htcp_cong_time(ca) / ca->minRTT;
+}
+
+static inline void htcp_reset(struct htcp *ca)
+{
+ ca->undo_last_cong = ca->last_cong;
+ ca->undo_maxRTT = ca->maxRTT;
+ ca->undo_old_maxB = ca->old_maxB;
+
+ ca->last_cong = jiffies;
+}
+
+static u32 htcp_cwnd_undo(struct sctp_transport *tp)
+{
+ struct htcp *ca = sctp_ca(tp);
+
+ if (ca->undo_last_cong) {
+ ca->last_cong = ca->undo_last_cong;
+ ca->maxRTT = ca->undo_maxRTT;
+ ca->old_maxB = ca->undo_old_maxB;
+ ca->undo_last_cong = 0;
+ }
+
+ return max(tp->cwnd, (tp->ssthresh << 7) / ca->beta);
+}
+
+static inline void measure_rtt(struct sctp_transport *tp, u32 srtt)
+{
+ struct htcp *ca = sctp_ca(tp);
+
+ /* keep track of minimum RTT seen so far, minRTT is zero at first */
+ if (ca->minRTT > srtt || !ca->minRTT)
+ ca->minRTT = srtt;
+
+ /* max RTT */
+ if (tp->ca_state == SCTP_CA_Open) {
+ if (ca->maxRTT < ca->minRTT)
+ ca->maxRTT = ca->minRTT;
+ if (ca->maxRTT < srtt &&
+ srtt <= ca->maxRTT + msecs_to_jiffies(20))
+ ca->maxRTT = srtt;
+ }
+}
+
+static void measure_achieved_throughput(struct sctp_transport *tp, u32 pkts_acked, s32 rtt)
+{
+ struct htcp *ca = sctp_ca(tp);
+ u32 now = jiffies;
+
+ if (tp->ca_state == SCTP_CA_Open)
+ ca->pkts_acked = pkts_acked;
+
+ if (rtt > 0)
+ measure_rtt(tp, usecs_to_jiffies(rtt));
+
+ if (!use_bandwidth_switch)
+ return;
+
+ /* achieved throughput calculations */
+ if (!((1 << tp->ca_state) & (SCTPF_CA_Open | SCTPF_CA_Disorder))) {
+ ca->packetcount = 0;
+ ca->lasttime = now;
+ return;
+ }
+
+ ca->packetcount += pkts_acked;
+
+ if (ca->packetcount >= tp->cwnd - (ca->alpha >> 7 ? : 1) &&
+ now - ca->lasttime >= ca->minRTT &&
+ ca->minRTT > 0) {
+ __u32 cur_Bi = ca->packetcount * HZ / (now - ca->lasttime);
+
+ if (htcp_ccount(ca) <= 3) {
+ /* just after backoff */
+ ca->minB = ca->maxB = ca->Bi = cur_Bi;
+ } else {
+ ca->Bi = (3 * ca->Bi + cur_Bi) / 4;
+ if (ca->Bi > ca->maxB)
+ ca->maxB = ca->Bi;
+ if (ca->minB > ca->maxB)
+ ca->minB = ca->maxB;
+ }
+ ca->packetcount = 0;
+ ca->lasttime = now;
+ }
+}
+
+static inline int between(__u32 seq1, __u32 seq2, __u32 seq3)
+{
+ return seq3 - seq2 >= seq1 - seq2;
+}
+
+static inline void htcp_beta_update(struct htcp *ca, u32 minRTT, u32 maxRTT)
+{
+ if (use_bandwidth_switch) {
+ u32 maxB = ca->maxB;
+ u32 old_maxB = ca->old_maxB;
+ ca->old_maxB = ca->maxB;
+
+ if (!between(5 * maxB, 4 * old_maxB, 6 * old_maxB)) {
+ ca->beta = BETA_MIN;
+ ca->modeswitch = 0;
+ return;
+ }
+ }
+
+ if (ca->modeswitch && minRTT > msecs_to_jiffies(10) && maxRTT) {
+ ca->beta = (minRTT << 7) / maxRTT;
+ if (ca->beta < BETA_MIN)
+ ca->beta = BETA_MIN;
+ else if (ca->beta > BETA_MAX)
+ ca->beta = BETA_MAX;
+ } else {
+ ca->beta = BETA_MIN;
+ ca->modeswitch = 1;
+ }
+}
+
+static inline void htcp_alpha_update(struct htcp *ca)
+{
+ u32 minRTT = ca->minRTT;
+ u32 factor = 1;
+ u32 diff = htcp_cong_time(ca);
+
+ if (diff > HZ) {
+ diff -= HZ;
+ factor = 1 + (10 * diff + ((diff / 2) * (diff / 2) / HZ)) / HZ;
+ }
+
+ if (use_rtt_scaling && minRTT) {
+ u32 scale = (HZ << 3) / (10 * minRTT);
+
+ /* clamping ratio to interval [0.5,10]<<3 */
+ scale = min(max(scale, 1U << 2), 10U << 3);
+ factor = (factor << 3) / scale;
+ if (!factor)
+ factor = 1;
+ }
+
+ ca->alpha = 2 * factor * ((1 << 7) - ca->beta);
+ if (!ca->alpha)
+ ca->alpha = ALPHA_BASE;
+}
+
+/*
+ * After we have the rtt data to calculate beta, we'd still prefer to wait one
+ * rtt before we adjust our beta to ensure we are working from a consistent
+ * data.
+ *
+ * This function should be called when we hit a congestion event since only at
+ * that point do we really have a real sense of maxRTT (the queues en route
+ * were getting just too full now).
+ */
+static void htcp_param_update(struct sctp_transport *tp)
+{
+ struct htcp *ca = sctp_ca(tp);
+ u32 minRTT = ca->minRTT;
+ u32 maxRTT = ca->maxRTT;
+
+ htcp_beta_update(ca, minRTT, maxRTT);
+ htcp_alpha_update(ca);
+
+ /* add slowly fading memory for maxRTT to accommodate routing changes */
+ if (minRTT > 0 && maxRTT > minRTT)
+ ca->maxRTT = minRTT + ((maxRTT - minRTT) * 95) / 100;
+}
+
+static u32 htcp_recalc_ssthresh(struct sctp_transport *tp)
+{
+ const struct htcp *ca = sctp_ca(tp);
+
+ htcp_param_update(tp);
+ return max((tp->cwnd * ca->beta) >> 7, 2U);
+}
+
+static void htcp_cong_avoid(struct sctp_transport *tp, __u32 sack_ctsn, __u32 bytes_acked)
+{
+ struct htcp *ca = sctp_ca(tp);
+
+ if (!sctp_is_cwnd_limited(tp))
+ return;
+
+ if (tp->cwnd <= tp->ssthresh)
+ sctp_slow_start(tp, bytes_acked);
+ else {
+ /* In dangerous area, increase slowly.
+ * In theory this is tp->cwnd += alpha / tp->cwnd
+ */
+ if ((tp->partial_bytes_acked * ca->alpha)>>7 >= tp->cwnd) {
+ tp->cwnd += min(bytes_acked, tp->asoc->pathmtu);
+ tp->partial_bytes_acked = 0;
+ htcp_alpha_update(ca);
+ } else
+ tp->partial_bytes_acked += ca->pkts_acked;
+
+ ca->pkts_acked = 1;
+ }
+}
+
+static void htcp_init(struct sctp_transport *tp)
+{
+ struct htcp *ca = sctp_ca(tp);
+
+ memset(ca, 0, sizeof(struct htcp));
+ ca->alpha = ALPHA_BASE;
+ ca->beta = BETA_MIN;
+ ca->pkts_acked = 1;
+ ca->last_cong = jiffies;
+}
+
+static void htcp_state(struct sctp_transport *tp, u8 new_state)
+{
+ switch (new_state) {
+ case SCTP_CA_Open:
+ {
+ struct htcp *ca = sctp_ca(tp);
+ if (ca->undo_last_cong) {
+ ca->last_cong = jiffies;
+ ca->undo_last_cong = 0;
+ }
+ }
+ break;
+ case SCTP_CA_CWR:
+ case SCTP_CA_Recovery:
+ case SCTP_CA_Loss:
+ htcp_reset(sctp_ca(tp));
+ break;
+ }
+}
+
+static struct sctp_congestion_ops htcp __read_mostly = {
+ .init = htcp_init,
+ .ssthresh = htcp_recalc_ssthresh,
+ .cong_avoid = htcp_cong_avoid,
+ .set_state = htcp_state,
+ .undo_cwnd = htcp_cwnd_undo,
+ .pkts_acked = measure_achieved_throughput,
+ .owner = THIS_MODULE,
+ .name = "htcp",
+};
+
+static int __init htcp_register(void)
+{
+ BUILD_BUG_ON(sizeof(struct htcp) > ICSK_CA_PRIV_SIZE);
+ BUILD_BUG_ON(BETA_MIN >= BETA_MAX);
+ return sctp_register_congestion_control(&htcp);
+}
+
+static void __exit htcp_unregister(void)
+{
+ sctp_unregister_congestion_control(&htcp);
+}
+
+module_init(htcp_register);
+module_exit(htcp_unregister);
+
+MODULE_AUTHOR("Baruch Even");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("H-TCP");
diff --git a/net/sctp/cong_hybla.c b/net/sctp/cong_hybla.c
new file mode 100644
index 0000000..4e37f0b
--- /dev/null
+++ b/net/sctp/cong_hybla.c
@@ -0,0 +1,187 @@
+/*
+ * SCTP HYBLA
+ *
+ * SCTP-HYBLA Congestion control algorithm, based on:
+ * C.Caini, R.Firrincieli, "SCTP-Hybla: A SCTP Enhancement
+ * for Heterogeneous Networks",
+ * International Journal on satellite Communications,
+ * September 2004
+ * Daniele Lacamera
+ * root at danielinux.net
+ */
+
+#include <linux/module.h>
+#include <net/sctp/congestion.h>
+
+/* Tcp Hybla structure. */
+struct hybla {
+ u8 hybla_en;
+ u64 snd_cwnd_cents; /* Keeps increment values when it is <1, <<7 */
+ u64 rho; /* Rho parameter, integer part */
+ u64 rho2; /* Rho * Rho, integer part */
+ u64 rho_3ls; /* Rho parameter, <<3 */
+ u64 rho2_7ls; /* Rho^2, <<7 */
+ u64 minrtt; /* Minimum smoothed round trip time value seen */
+};
+
+/* Hybla reference round trip time (default= 1/40 sec = 25 ms),
+ expressed in jiffies */
+static int rtt0 = 25;
+module_param(rtt0, int, 0644);
+MODULE_PARM_DESC(rtt0, "reference rout trip time (ms)");
+
+
+/* This is called to refresh values for hybla parameters */
+static inline void hybla_recalc_param (struct sctp_transport *tp)
+{
+ struct hybla *ca = sctp_ca(tp);
+
+ ca->rho_3ls = max_t(u32, tp->srtt / msecs_to_jiffies(rtt0), tp->asoc->pathmtu << 3);
+ ca->rho = ca->rho_3ls >> 3;
+ ca->rho2_7ls = (ca->rho_3ls * ca->rho_3ls) << 1;
+ ca->rho2 = ca->rho2_7ls >>7;
+}
+
+static void hybla_init(struct sctp_transport *tp)
+{
+ struct hybla *ca = sctp_ca(tp);
+
+ ca->rho = 0;
+ ca->rho2 = 0;
+ ca->rho_3ls = 0;
+ ca->rho2_7ls = 0;
+ ca->snd_cwnd_cents = 0;
+ ca->hybla_en = 1;
+ tp->cwnd = 2 * tp->asoc->pathmtu;
+
+ /* 1st Rho measurement based on initial srtt */
+ hybla_recalc_param(tp);
+
+ /* set minimum rtt as this is the 1st ever seen */
+ ca->minrtt = tp->srtt;
+ tp->cwnd = ca->rho;
+}
+
+static void hybla_state(struct sctp_transport *tp, u8 ca_state)
+{
+ struct hybla *ca = sctp_ca(tp);
+ ca->hybla_en = (ca_state == SCTP_CA_Open);
+}
+
+static inline u32 hybla_fraction(u32 odds)
+{
+ static const u32 fractions[] = {
+ 128, 139, 152, 165, 181, 197, 215, 234,
+ };
+
+ return (odds < ARRAY_SIZE(fractions)) ? fractions[odds] : 128;
+}
+
+/* SCTP Hybla main routine.
+ * This is the algorithm behavior:
+ * o Recalc Hybla parameters if min_rtt has changed
+ * o Give cwnd a new value based on the model proposed
+ * o remember increments <1
+ */
+static void hybla_cong_avoid(struct sctp_transport *tp, __u32 sack_ctsn, __u32 bytes_acked)
+{
+ struct hybla *ca = sctp_ca(tp);
+ u32 increment, odd, rho_fractions;
+ int is_slowstart = 0;
+
+ /* Recalculate rho only if this srtt is the lowest */
+ if (tp->srtt < ca->minrtt){
+ hybla_recalc_param(tp);
+ ca->minrtt = tp->srtt;
+ }
+
+ if (!sctp_is_cwnd_limited(tp))
+ return;
+
+ if (!ca->hybla_en) {
+ sctp_reno_cong_avoid(tp, sack_ctsn, bytes_acked);
+ return;
+ }
+
+ if (ca->rho == 0)
+ hybla_recalc_param(tp);
+
+ rho_fractions = ca->rho_3ls - (ca->rho << 3);
+
+ if (tp->cwnd < tp->ssthresh) {
+ /*
+ * slow start
+ * INC = 2^RHO - 1
+ * This is done by splitting the rho parameter
+ * into 2 parts: an integer part and a fraction part.
+ * Inrement<<7 is estimated by doing:
+ * [2^(int+fract)]<<7
+ * that is equal to:
+ * (2^int) * [(2^fract) <<7]
+ * 2^int is straightly computed as 1<<int,
+ * while we will use hybla_slowstart_fraction_increment() to
+ * calculate 2^fract in a <<7 value.
+ */
+ is_slowstart = 1;
+ increment = ((1 << min_t(u64, ca->rho, tp->asoc->pathmtu << 3)) *
+ hybla_fraction(rho_fractions)) - 128;
+ } else {
+ /*
+ * congestion avoidance
+ * INC = RHO^2 / W
+ * as long as increment is estimated as (rho<<7)/window
+ * it already is <<7 and we can easily count its fractions.
+ */
+ increment = ca->rho2_7ls / tp->cwnd;
+ if (increment < 128)
+ tp->partial_bytes_acked += bytes_acked;
+ }
+
+ odd = increment % 128;
+ tp->cwnd += increment >> 7;
+ ca->snd_cwnd_cents += odd;
+
+ /* check when fractions goes >=128 and increase cwnd by 1. */
+ while (ca->snd_cwnd_cents >= 128) {
+ tp->cwnd += min(bytes_acked, tp->asoc->pathmtu);
+ ca->snd_cwnd_cents -= 128;
+ tp->partial_bytes_acked = 0;
+ }
+ /* check when cwnd has not been incremented for a while */
+ if (increment == 0 && odd == 0 && tp->partial_bytes_acked >= tp->cwnd) {
+ tp->cwnd += tp->asoc->pathmtu;
+ tp->partial_bytes_acked = 0;
+ }
+ /* clamp down slowstart cwnd to ssthresh value. */
+ if (is_slowstart)
+ tp->cwnd = min(tp->cwnd, tp->ssthresh);
+}
+
+static struct sctp_congestion_ops sctp_hybla __read_mostly = {
+ .init = hybla_init,
+ .ssthresh = sctp_reno_ssthresh,
+ .min_cwnd = sctp_reno_min_cwnd,
+ .cong_avoid = hybla_cong_avoid,
+ .set_state = hybla_state,
+
+ .owner = THIS_MODULE,
+ .name = "hybla"
+};
+
+static int __init hybla_register(void)
+{
+ BUILD_BUG_ON(sizeof(struct hybla) > ICSK_CA_PRIV_SIZE);
+ return sctp_register_congestion_control(&sctp_hybla);
+}
+
+static void __exit hybla_unregister(void)
+{
+ sctp_unregister_congestion_control(&sctp_hybla);
+}
+
+module_init(hybla_register);
+module_exit(hybla_unregister);
+
+MODULE_AUTHOR("Daniele Lacamera");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("SCTP Hybla");
diff --git a/net/sctp/congestion.c b/net/sctp/congestion.c
new file mode 100644
index 0000000..13cabac
--- /dev/null
+++ b/net/sctp/congestion.c
@@ -0,0 +1,576 @@
+/* SCTP kernel implementation
+ * (C) Copyright IBM Corp. 2001, 2004
+ * Copyright (c) 1999-2000 Cisco, Inc.
+ * Copyright (c) 1999-2001 Motorola, Inc.
+ * Copyright (c) 2001-2003 Intel Corp.
+ * Copyright (C) 2005 Stephen Hemminger <shemminger@osdl.org>
+ * Copyright (C) 2011 Morten Hustveit <mortehu@ping.uio.no>
+ *
+ * This SCTP implementation is free software;
+ * you can redistribute it and/or modify it under the terms of
+ * the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * This SCTP implementation is distributed in the hope that it
+ * will be useful, but WITHOUT ANY WARRANTY; without even the implied
+ * ************************
+ * warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ * See the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with GNU CC; see the file COPYING. If not, write to
+ * the Free Software Foundation, 59 Temple Place - Suite 330,
+ * Boston, MA 02111-1307, USA.
+ *
+ * Please send any bug reports or fixes you make to the
+ * email address(es):
+ * lksctp developers <lksctp-developers@lists.sourceforge.net>
+ *
+ * Or submit a bug report through the following website:
+ * http://www.sf.net/projects/lksctp
+ *
+ * Any bugs reported given to us we will try to fix... any fixes shared will
+ * be incorporated into the next SCTP release.
+ */
+
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/gfp.h>
+#include <linux/rculist.h>
+#include <net/sctp/congestion.h>
+#include <net/sctp/sm.h>
+
+static DEFINE_SPINLOCK(sctp_cong_list_lock);
+static LIST_HEAD(sctp_cong_list);
+
+/* Simple linear search, don't expect many entries! */
+static struct sctp_congestion_ops *sctp_ca_find(const char *name)
+{
+ struct sctp_congestion_ops *e;
+
+ list_for_each_entry_rcu(e, &sctp_cong_list, list) {
+ if (strcmp(e->name, name) == 0)
+ return e;
+ }
+
+ return NULL;
+}
+
+/*
+ * Attach new congestion control algorithm to the list
+ * of available options.
+ */
+int sctp_register_congestion_control(struct sctp_congestion_ops *ca)
+{
+ int ret = 0;
+
+ /* all algorithms must implement ssthresh and cong_avoid ops */
+ if (!ca->ssthresh || !ca->cong_avoid) {
+ printk(KERN_ERR "SCTP %s does not implement required ops\n",
+ ca->name);
+ return -EINVAL;
+ }
+
+ spin_lock(&sctp_cong_list_lock);
+ if (sctp_ca_find(ca->name)) {
+ printk(KERN_NOTICE "SCTP %s already registered\n", ca->name);
+ ret = -EEXIST;
+ } else {
+ list_add_tail_rcu(&ca->list, &sctp_cong_list);
+ printk(KERN_INFO "SCTP %s registered\n", ca->name);
+ }
+ spin_unlock(&sctp_cong_list_lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(sctp_register_congestion_control);
+
+/*
+ * Remove congestion control algorithm, called from
+ * the module's remove function. Module ref counts are used
+ * to ensure that this can't be done till all sockets using
+ * that method are closed.
+ */
+void sctp_unregister_congestion_control(struct sctp_congestion_ops *ca)
+{
+ spin_lock(&sctp_cong_list_lock);
+ list_del_rcu(&ca->list);
+ spin_unlock(&sctp_cong_list_lock);
+}
+EXPORT_SYMBOL_GPL(sctp_unregister_congestion_control);
+
+/* Assign choice of congestion control. */
+void sctp_init_congestion_control(struct sctp_transport *tp)
+{
+ struct sctp_congestion_ops *ca;
+
+ /* if no choice made yet assign the current value set as default */
+ if (tp->ca_ops == &sctp_init_congestion_ops) {
+ rcu_read_lock();
+ list_for_each_entry_rcu(ca, &sctp_cong_list, list) {
+ if (try_module_get(ca->owner)) {
+ tp->ca_ops = ca;
+ break;
+ }
+
+ /* fallback to next available */
+ }
+ rcu_read_unlock();
+ }
+
+ if (tp->ca_ops->init)
+ tp->ca_ops->init(tp);
+}
+
+/* Manage refcounts on socket close. */
+void sctp_cleanup_congestion_control(struct sctp_transport *tp)
+{
+ if (tp->ca_ops->release)
+ tp->ca_ops->release(tp);
+ module_put(tp->ca_ops->owner);
+}
+
+/* Used by sysctl to change default congestion control */
+int sctp_set_default_congestion_control(const char *name)
+{
+ struct sctp_congestion_ops *ca;
+ int ret = -ENOENT;
+
+ printk("Setting congestion control to '%s'", name);
+
+ spin_lock(&sctp_cong_list_lock);
+ ca = sctp_ca_find(name);
+#ifdef CONFIG_MODULES
+ if (!ca && capable(CAP_NET_ADMIN)) {
+ spin_unlock(&sctp_cong_list_lock);
+
+ request_module("sctp_%s", name);
+ spin_lock(&sctp_cong_list_lock);
+ ca = sctp_ca_find(name);
+ }
+#endif
+
+ if (ca) {
+ ca->flags |= SCTP_CONG_NON_RESTRICTED; /* default is always allowed */
+ list_move(&ca->list, &sctp_cong_list);
+ ret = 0;
+ }
+ spin_unlock(&sctp_cong_list_lock);
+
+ return ret;
+}
+
+/* Set default value from kernel configuration at bootup */
+static int __init sctp_congestion_default(void)
+{
+ return sctp_set_default_congestion_control(CONFIG_DEFAULT_SCTP_CONG);
+}
+late_initcall(sctp_congestion_default);
+
+/* Build string with list of available congestion control values */
+void sctp_get_available_congestion_control(char *buf, size_t maxlen)
+{
+ struct sctp_congestion_ops *ca;
+ size_t offs = 0;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(ca, &sctp_cong_list, list) {
+ offs += snprintf(buf + offs, maxlen - offs,
+ "%s%s",
+ offs == 0 ? "" : " ", ca->name);
+
+ }
+ rcu_read_unlock();
+}
+
+/* Get current default congestion control */
+void sctp_get_default_congestion_control(char *name)
+{
+ struct sctp_congestion_ops *ca;
+ /* We will always have reno... */
+ BUG_ON(list_empty(&sctp_cong_list));
+
+ rcu_read_lock();
+ ca = list_entry(sctp_cong_list.next, struct sctp_congestion_ops, list);
+ strncpy(name, ca->name, SCTP_CA_NAME_MAX);
+ rcu_read_unlock();
+}
+
+/* Built list of non-restricted congestion control values */
+void sctp_get_allowed_congestion_control(char *buf, size_t maxlen)
+{
+ struct sctp_congestion_ops *ca;
+ size_t offs = 0;
+
+ *buf = '\0';
+ rcu_read_lock();
+ list_for_each_entry_rcu(ca, &sctp_cong_list, list) {
+ if (!(ca->flags & SCTP_CONG_NON_RESTRICTED))
+ continue;
+ offs += snprintf(buf + offs, maxlen - offs,
+ "%s%s",
+ offs == 0 ? "" : " ", ca->name);
+
+ }
+ rcu_read_unlock();
+}
+
+/* Change list of non-restricted congestion control */
+int sctp_set_allowed_congestion_control(char *val)
+{
+ struct sctp_congestion_ops *ca;
+ char *saved_clone, *clone, *name;
+ int ret = 0;
+
+ saved_clone = clone = kstrdup(val, GFP_USER);
+ if (!clone)
+ return -ENOMEM;
+
+ spin_lock(&sctp_cong_list_lock);
+ /* pass 1 check for bad entries */
+ while ((name = strsep(&clone, " ")) && *name) {
+ ca = sctp_ca_find(name);
+ if (!ca) {
+ ret = -ENOENT;
+ goto out;
+ }
+ }
+
+ /* pass 2 clear old values */
+ list_for_each_entry_rcu(ca, &sctp_cong_list, list)
+ ca->flags &= ~SCTP_CONG_NON_RESTRICTED;
+
+ /* pass 3 mark as allowed */
+ while ((name = strsep(&val, " ")) && *name) {
+ ca = sctp_ca_find(name);
+ WARN_ON(!ca);
+ if (ca)
+ ca->flags |= SCTP_CONG_NON_RESTRICTED;
+ }
+out:
+ spin_unlock(&sctp_cong_list_lock);
+ kfree(saved_clone);
+
+ return ret;
+}
+
+/* Change congestion control for socket */
+int sctp_set_congestion_control(struct sctp_transport *tp, const char *name)
+{
+ struct sock *sk = tp->asoc->base.sk;
+ struct sctp_congestion_ops *ca;
+ int err = 0;
+
+ rcu_read_lock();
+ ca = sctp_ca_find(name);
+
+ /* no change asking for existing value */
+ if (ca == tp->ca_ops)
+ goto out;
+
+#ifdef CONFIG_MODULES
+ /* not found attempt to autoload module */
+ if (!ca && capable(CAP_NET_ADMIN)) {
+ rcu_read_unlock();
+ request_module("sctp_%s", name);
+ rcu_read_lock();
+ ca = sctp_ca_find(name);
+ }
+#endif
+ if (!ca)
+ err = -ENOENT;
+
+ else if (!((ca->flags & SCTP_CONG_NON_RESTRICTED) || capable(CAP_NET_ADMIN)))
+ err = -EPERM;
+
+ else if (!try_module_get(ca->owner))
+ err = -EBUSY;
+
+ else {
+ sctp_cleanup_congestion_control(tp);
+ tp->ca_ops = ca;
+
+ if (sk->sk_state != SCTP_SS_CLOSING && tp->ca_ops->init)
+ tp->ca_ops->init(tp);
+ }
+ out:
+ rcu_read_unlock();
+ return err;
+}
+
+/* RFC2861 Check whether we are limited by application or congestion window
+ * This is the inverse of cwnd check in sctp_tso_should_defer
+ */
+int sctp_is_cwnd_limited(const struct sctp_transport *tp)
+{
+ static const int sysctl_sctp_tso_win_divisor = 3;
+ struct sock *sk = tp->asoc->base.sk;
+ u32 left;
+
+ if (tp->flight_size >= tp->cwnd)
+ return 1;
+
+ /* XXX: What is this? -- mortehu */
+
+ left = tp->cwnd - tp->flight_size;
+ if (sk_can_gso(tp->asoc->base.sk) &&
+ left * sysctl_sctp_tso_win_divisor < tp->cwnd && /* && XXX: What is TSO frame? -- mortehu */
+ left < sk->sk_gso_max_size)
+ return 1;
+ return left <= tp->burst_limited;
+}
+EXPORT_SYMBOL_GPL(sctp_is_cwnd_limited);
+
+void sctp_cong_raise_cwnd(struct sctp_transport *transport, __u32 sack_ctsn, __u32 bytes_acked)
+{
+ mortehu_debug_add(transport->asoc->base.sk, mortehu_acked, sack_ctsn, bytes_acked, 0, 0);
+
+ transport->ca_ops->cong_avoid(transport, sack_ctsn, bytes_acked);
+
+ mortehu_debug_add(transport->asoc->base.sk, mortehu_new_cwnd, transport->cwnd, transport->ssthresh, transport->partial_bytes_acked, 0);
+}
+
+void sctp_cong_lower_cwnd(struct sctp_transport *transport, sctp_lower_cwnd_t reason)
+{
+ struct sctp_association *asoc = transport->asoc;
+
+ switch (reason) {
+ case SCTP_LOWER_CWND_T3_RTX:
+ mortehu_debug_add(transport->asoc->base.sk, mortehu_congestion, reason, 0, 0, 0);
+ sctp_set_ca_state(transport, SCTP_CA_Loss);
+
+ /* RFC 2960 Section 7.2.3, sctpimpguide
+ * When the T3-rtx timer expires on an address, SCTP should
+ * perform slow start by:
+ * ssthresh = max(cwnd/2, 4*MTU)
+ * cwnd = 1*MTU
+ * partial_bytes_acked = 0
+ */
+ transport->ssthresh = transport->ca_ops->ssthresh(transport);
+ transport->cwnd = asoc->pathmtu;
+
+ /* T3-rtx also clears fast recovery */
+ asoc->fast_recovery = 0;
+ break;
+
+ case SCTP_LOWER_CWND_FAST_RTX:
+ /* RFC 2960 7.2.4 Adjust the ssthresh and cwnd of the
+ * destination address(es) to which the missing DATA chunks
+ * were last sent, according to the formula described in
+ * Section 7.2.3.
+ *
+ * RFC 2960 7.2.3, sctpimpguide Upon detection of packet
+ * losses from SACK (see Section 7.2.4), An endpoint
+ * should do the following:
+ * ssthresh = max(cwnd/2, 4*MTU)
+ * cwnd = ssthresh
+ * partial_bytes_acked = 0
+ */
+ if (asoc->fast_recovery)
+ return;
+
+ mortehu_debug_add(transport->asoc->base.sk, mortehu_congestion, reason, 0, 0, 0);
+ sctp_set_ca_state(transport, SCTP_CA_Recovery);
+
+ /* Mark Fast recovery */
+ asoc->fast_recovery = 1;
+ asoc->fast_recovery_exit = asoc->next_tsn - 1;
+
+ transport->ssthresh = transport->ca_ops->ssthresh(transport);
+ transport->cwnd = transport->ssthresh;
+ break;
+
+ case SCTP_LOWER_CWND_ECNE:
+ mortehu_debug_add(transport->asoc->base.sk, mortehu_congestion, reason, 0, 0, 0);
+ /* RFC 2481 Section 6.1.2.
+ * If the sender receives an ECN-Echo ACK packet
+ * then the sender knows that congestion was encountered in the
+ * network on the path from the sender to the receiver. The
+ * indication of congestion should be treated just as a
+ * congestion loss in non-ECN Capable TCP. That is, the TCP
+ * source halves the congestion window "cwnd" and reduces the
+ * slow start threshold "ssthresh".
+ * A critical condition is that TCP does not react to
+ * congestion indications more than once every window of
+ * data (or more loosely more than once every round-trip time).
+ */
+ if (time_after(jiffies, transport->last_time_ecne_reduced +
+ transport->rtt)) {
+ transport->ssthresh = transport->ca_ops->ssthresh(transport);
+ transport->cwnd = transport->ssthresh;
+ transport->last_time_ecne_reduced = jiffies;
+ sctp_set_ca_state(transport, SCTP_CA_CWR);
+ }
+ break;
+
+ case SCTP_LOWER_CWND_INACTIVE:
+ mortehu_debug_add(transport->asoc->base.sk, mortehu_congestion, reason, 0, 0, 0);
+ sctp_set_ca_state(transport, SCTP_CA_Open);
+
+ /* RFC 2960 Section 7.2.1, sctpimpguide
+ * When the endpoint does not transmit data on a given
+ * transport address, the cwnd of the transport address
+ * should be adjusted to max(cwnd/2, 4*MTU) per RTO.
+ * NOTE: Although the draft recommends that this check needs
+ * to be done every RTO interval, we do it every hearbeat
+ * interval.
+ */
+ transport->ssthresh = transport->ca_ops->ssthresh(transport);
+ break;
+ }
+
+ transport->partial_bytes_acked = 0;
+ SCTP_DEBUG_PRINTK("%s: transport: %p reason: %d cwnd: "
+ "%d ssthresh: %d\n", __func__,
+ transport, reason,
+ transport->cwnd, transport->ssthresh);
+
+ mortehu_debug_add(transport->asoc->base.sk, mortehu_new_cwnd, transport->cwnd, transport->ssthresh, transport->partial_bytes_acked, 0);
+}
+
+void sctp_slow_start(struct sctp_transport *transport, __u32 bytes_acked)
+{
+ struct sctp_association *asoc = transport->asoc;
+ __u32 cwnd, ssthresh, flight_size, pmtu;
+
+ cwnd = transport->cwnd;
+ flight_size = transport->flight_size;
+ ssthresh = transport->ssthresh;
+ pmtu = transport->asoc->pathmtu;
+
+ if (asoc->fast_recovery)
+ return;
+
+ if (bytes_acked > pmtu)
+ cwnd += pmtu;
+ else
+ cwnd += bytes_acked;
+ SCTP_DEBUG_PRINTK("%s: SLOW START: transport: %p, "
+ "bytes_acked: %d, cwnd: %d, ssthresh: %d, "
+ "flight_size: %d, pba: %d\n",
+ __func__,
+ transport, bytes_acked, cwnd,
+ ssthresh, flight_size, transport->partial_bytes_acked);
+
+ transport->cwnd = cwnd;
+}
+EXPORT_SYMBOL_GPL(sctp_slow_start);
+
+/* In theory this is tp->cwnd += 1 / tp->cwnd (or alternative w) */
+void sctp_cong_avoid_ai(struct sctp_transport *transport, __u32 bytes_acked)
+{
+ __u32 cwnd, ssthresh, flight_size, pba, pmtu;
+
+ cwnd = transport->cwnd;
+ flight_size = transport->flight_size;
+ ssthresh = transport->ssthresh;
+ pba = transport->partial_bytes_acked;
+ pmtu = transport->asoc->pathmtu;
+
+ /* RFC 2960 7.2.2 Whenever cwnd is greater than ssthresh,
+ * upon each SACK arrival that advances the Cumulative TSN Ack
+ * Point, increase partial_bytes_acked by the total number of
+ * bytes of all new chunks acknowledged in that SACK including
+ * chunks acknowledged by the new Cumulative TSN Ack and by
+ * Gap Ack Blocks.
+ *
+ * When partial_bytes_acked is equal to or greater than cwnd
+ * and before the arrival of the SACK the sender had cwnd or
+ * more bytes of data outstanding (i.e., before arrival of the
+ * SACK, flightsize was greater than or equal to cwnd),
+ * increase cwnd by MTU, and reset partial_bytes_acked to
+ * (partial_bytes_acked - cwnd).
+ */
+ pba += bytes_acked;
+ if (pba >= cwnd) {
+ cwnd += pmtu;
+ pba = ((cwnd < pba) ? (pba - cwnd) : 0);
+ }
+ SCTP_DEBUG_PRINTK("%s: CONGESTION AVOIDANCE: "
+ "transport: %p, bytes_acked: %d, cwnd: %d, "
+ "ssthresh: %d, flight_size: %d, pba: %d\n",
+ __func__,
+ transport, bytes_acked, cwnd,
+ ssthresh, flight_size, pba);
+
+ transport->cwnd = cwnd;
+ transport->partial_bytes_acked = pba;
+}
+
+EXPORT_SYMBOL_GPL(sctp_cong_avoid_ai);
+
+/* This routine updates the transport's cwnd and partial_bytes_acked
+ * parameters based on the bytes acked in the received SACK.
+ */
+void sctp_reno_cong_avoid(struct sctp_transport *transport, __u32 sack_ctsn, __u32 bytes_acked)
+{
+ struct sctp_association *asoc = transport->asoc;
+ __u32 cwnd, ssthresh, pba, pmtu;
+
+ /* See if we need to exit Fast Recovery first */
+ if (asoc->fast_recovery &&
+ TSN_lte(asoc->fast_recovery_exit, sack_ctsn))
+ {
+ mortehu_debug_add(transport->asoc->base.sk, mortehu_congestion, SCTP_LOWER_CWND_FAST_RTX, -1, 0, 0);
+ asoc->fast_recovery = 0;
+ }
+
+ /* The appropriate cwnd increase algorithm is performed if, and only
+ * if the cumulative TSN whould advanced and the congestion window is
+ * being fully utilized.
+ */
+ if (TSN_lte(sack_ctsn, transport->asoc->ctsn_ack_point) ||
+ !sctp_is_cwnd_limited(transport))
+ return;
+
+ cwnd = transport->cwnd;
+ ssthresh = transport->ssthresh;
+ pba = transport->partial_bytes_acked;
+ pmtu = transport->asoc->pathmtu;
+
+ if (cwnd <= ssthresh)
+ sctp_slow_start(transport, bytes_acked);
+ else
+ sctp_cong_avoid_ai(transport, bytes_acked);
+}
+EXPORT_SYMBOL_GPL(sctp_reno_cong_avoid);
+
+/* Slow start threshold is half the congestion window (min 4*MTU) */
+u32 sctp_reno_ssthresh(struct sctp_transport *tp)
+{
+ struct sctp_association *asoc = tp->asoc;
+ return max(tp->cwnd / 2, 4 * asoc->pathmtu);
+}
+EXPORT_SYMBOL_GPL(sctp_reno_ssthresh);
+
+/* Lower bound on congestion window with halving. */
+u32 sctp_reno_min_cwnd(const struct sctp_transport *tp)
+{
+ return tp->ssthresh / 2;
+}
+EXPORT_SYMBOL_GPL(sctp_reno_min_cwnd);
+
+struct sctp_congestion_ops sctp_reno = {
+ .flags = SCTP_CONG_NON_RESTRICTED,
+ .name = "reno",
+ .owner = THIS_MODULE,
+ .ssthresh = sctp_reno_ssthresh,
+ .cong_avoid = sctp_reno_cong_avoid,
+ .min_cwnd = sctp_reno_min_cwnd,
+};
+
+/* Initial congestion control used (until SYN)
+ * really reno under another name so we can tell difference
+ * during sctp_set_default_congestion_control
+ */
+struct sctp_congestion_ops sctp_init_congestion_ops = {
+ .name = "",
+ .owner = THIS_MODULE,
+ .ssthresh = sctp_reno_ssthresh,
+ .cong_avoid = sctp_reno_cong_avoid,
+ .min_cwnd = sctp_reno_min_cwnd,
+};
+EXPORT_SYMBOL_GPL(sctp_init_congestion_ops);
diff --git a/net/sctp/output.c b/net/sctp/output.c
index 60600d3..b6e3005 100644
--- a/net/sctp/output.c
+++ b/net/sctp/output.c
@@ -62,6 +62,7 @@
#include <net/sctp/sctp.h>
#include <net/sctp/sm.h>
#include <net/sctp/checksum.h>
+#include <net/sctp/congestion.h>
/* Forward declarations for private helpers. */
static sctp_xmit_t sctp_packet_can_append_data(struct sctp_packet *packet,
@@ -349,6 +350,7 @@ int sctp_packet_transmit(struct sctp_packet *packet)
struct sock *sk;
int err = 0;
int padding; /* How much padding do we need? */
+ int xmit_result;
__u8 has_data = 0;
struct dst_entry *dst = tp->dst;
unsigned char *auth = NULL; /* pointer to auth in skb data */
@@ -459,6 +461,10 @@ int sctp_packet_transmit(struct sctp_packet *packet)
memcpy(skb_put(nskb, chunk->skb->len),
chunk->skb->data, chunk->skb->len);
+ mortehu_debug_add(tp->asoc->base.sk, mortehu_tx_chunk,
+ chunk->has_tsn ? ntohl(chunk->subh.data_hdr->tsn) : 0,
+ chunk->skb->len, chunk->chunk_hdr->type, 0);
+
SCTP_DEBUG_PRINTK("%s %p[%s] %s 0x%x, %s %d, %s %d, %s %d\n",
"*** Chunk", chunk,
sctp_cname(SCTP_ST_CHUNK(
@@ -561,11 +567,15 @@ int sctp_packet_transmit(struct sctp_packet *packet)
}
}
+ mortehu_debug_add(tp->asoc->base.sk, mortehu_tx_packet, nskb->len, 0, 0, 0);
SCTP_DEBUG_PRINTK("***sctp_transmit_packet*** skb len %d\n",
nskb->len);
nskb->local_df = packet->ipfragok;
- (*tp->af_specific->sctp_xmit)(nskb, tp);
+ xmit_result = (*tp->af_specific->sctp_xmit)(nskb, tp);
+
+ if (unlikely(xmit_result > 0))
+ sctp_cong_lower_cwnd(tp, SCTP_LOWER_CWND_ECNE);
out:
sctp_packet_reset(packet);
diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 26dc005..1cd0414 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -57,6 +57,7 @@
#include <net/sctp/sctp.h>
#include <net/sctp/sm.h>
+#include <net/sctp/congestion.h>
/* Declare internal functions here. */
static int sctp_acked(struct sctp_sackhdr *sack, __u32 tsn);
@@ -485,7 +486,7 @@ void sctp_retransmit(struct sctp_outq *q, struct sctp_transport *transport,
switch(reason) {
case SCTP_RTXR_T3_RTX:
SCTP_INC_STATS(SCTP_MIB_T3_RETRANSMITS);
- sctp_transport_lower_cwnd(transport, SCTP_LOWER_CWND_T3_RTX);
+ sctp_cong_lower_cwnd(transport, SCTP_LOWER_CWND_T3_RTX);
/* Update the retran path if the T3-rtx timer has expired for
* the current retran path.
*/
@@ -496,7 +497,7 @@ void sctp_retransmit(struct sctp_outq *q, struct sctp_transport *transport,
break;
case SCTP_RTXR_FAST_RTX:
SCTP_INC_STATS(SCTP_MIB_FAST_RETRANSMITS);
- sctp_transport_lower_cwnd(transport, SCTP_LOWER_CWND_FAST_RTX);
+ sctp_cong_lower_cwnd(transport, SCTP_LOWER_CWND_FAST_RTX);
q->fast_rtx = 1;
break;
case SCTP_RTXR_PMTUD:
@@ -1602,8 +1603,7 @@ static void sctp_check_transmitted(struct sctp_outq *q,
SCTP_RECEIVED_SACK);
}
- sctp_transport_raise_cwnd(transport, sack_ctsn,
- bytes_acked);
+ sctp_cong_raise_cwnd(transport, sack_ctsn, bytes_acked);
transport->flight_size -= bytes_acked;
if (transport->flight_size == 0)
diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 152976e..40406fc 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -63,6 +63,7 @@
#include <net/ipv6.h>
#include <net/route.h>
#include <net/sctp/sctp.h>
+#include <net/sctp/congestion.h>
#include <net/addrconf.h>
#include <net/inet_common.h>
#include <net/inet_ecn.h>
@@ -1253,6 +1254,8 @@ SCTP_STATIC __init int sctp_init(void)
pr_info("Hash tables configured (established %d bind %d)\n",
sctp_assoc_hashsize, sctp_port_hashsize);
+ sctp_register_congestion_control(&sctp_reno);
+
/* Disable ADDIP by default. */
sctp_addip_enable = 0;
sctp_addip_noauth = 0;
diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index de98665..f9fc33f 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -2645,10 +2645,13 @@ __u32 sctp_generate_tag(const struct sctp_endpoint *ep)
/* Select an initial TSN to send during startup. */
__u32 sctp_generate_tsn(const struct sctp_endpoint *ep)
{
Starting the TSNs at zero makes it much easier to read the debug logs.
+ return 0; /* Confirmed random by fair coin toss */
+#if 0
__u32 retval;
get_random_bytes(&retval, sizeof(__u32));
return retval;
+#endif
}
/*
diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
index b21b218..047f49a 100644
--- a/net/sctp/sm_sideeffect.c
+++ b/net/sctp/sm_sideeffect.c
@@ -57,6 +57,7 @@
#include <net/sock.h>
#include <net/sctp/sctp.h>
#include <net/sctp/sm.h>
+#include <net/sctp/congestion.h>
static int sctp_cmd_interpreter(sctp_event_t event_type,
sctp_subtype_t subtype,
@@ -128,8 +129,7 @@ static struct sctp_chunk *sctp_do_ecn_ecne_work(struct sctp_association *asoc,
/* Update the congestion variables. */
if (transport)
- sctp_transport_lower_cwnd(transport,
- SCTP_LOWER_CWND_ECNE);
+ sctp_cong_lower_cwnd(transport, SCTP_LOWER_CWND_ECNE);
asoc->last_cwr_tsn = lowest_tsn;
}
@@ -253,12 +253,14 @@ void sctp_generate_t3_rtx_event(unsigned long peer)
/* Check whether a task is in the sock. */
sctp_bh_lock_sock(asoc->base.sk);
+
if (sock_owned_by_user(asoc->base.sk)) {
SCTP_DEBUG_PRINTK("%s:Sock is busy.\n", __func__);
/* Try again later. */
if (!mod_timer(&transport->T3_rtx_timer, jiffies + (HZ/20)))
sctp_transport_hold(transport);
+
goto out_unlock;
}
@@ -754,6 +756,8 @@ static void sctp_cmd_new_state(sctp_cmd_seq_t *cmds,
asoc->state = state;
+ mortehu_debug_add(sk, mortehu_state, state, 0, 0, 0);
+
SCTP_DEBUG_PRINTK("sctp_cmd_new_state: asoc %p[%s]\n",
asoc, sctp_state_tbl[state]);
@@ -1557,7 +1561,7 @@ static int sctp_cmd_interpreter(sctp_event_t event_type,
case SCTP_CMD_TRANSPORT_IDLE:
t = cmd->obj.transport;
- sctp_transport_lower_cwnd(t, SCTP_LOWER_CWND_INACTIVE);
+ sctp_cong_lower_cwnd(t, SCTP_LOWER_CWND_INACTIVE);
break;
case SCTP_CMD_TRANSPORT_HB_SENT:
diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
index 4b4eb7c..60fd693 100644
--- a/net/sctp/sm_statefuns.c
+++ b/net/sctp/sm_statefuns.c
@@ -3100,6 +3100,17 @@ sctp_disposition_t sctp_sf_eat_sack_6_2(const struct sctp_endpoint *ep,
chunk->subh.sack_hdr = sackh;
ctsn = ntohl(sackh->cum_tsn_ack);
+ if (sackh->num_gap_ack_blocks)
+ {
+ sctp_sack_variable_t *frags = sackh->variable;
+ int i;
+
+ for (i = 0; i < ntohs(sackh->num_gap_ack_blocks); ++i)
+ mortehu_debug_add(asoc->base.sk, mortehu_gap_ack, ctsn + ntohs(frags[i].gab.end), 0, 0, 0);
+ }
+
+ mortehu_debug_add(asoc->base.sk, mortehu_sack, ctsn, ntohl(sackh->a_rwnd), ntohs(sackh->num_gap_ack_blocks), ntohs(sackh->num_dup_tsns));
+
/* i) If Cumulative TSN Ack is less than the Cumulative TSN
* Ack Point, then drop the SACK. Since Cumulative TSN
* Ack is monotonically increasing, a SACK whose
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 3951a10..1b842a2 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -3765,6 +3765,9 @@ SCTP_STATIC int sctp_init_sock(struct sock *sk)
sp->ep = ep;
sp->hmac = NULL;
+ sp->mortehu_debug = kmalloc(sizeof(*sp->mortehu_debug) * MORTEHU_DEBUG_MAX, GFP_KERNEL);
+ sp->mortehu_debug_count = 0;
+
SCTP_DBG_OBJCNT_INC(sock);
local_bh_disable();
@@ -5415,6 +5418,19 @@ SCTP_STATIC int sctp_getsockopt(struct sock *sk, int level, int optname,
case SCTP_GET_ASSOC_NUMBER:
retval = sctp_getsockopt_assoc_number(sk, len, optval, optlen);
break;
The interface through which user space applications can fetch the SCTP debug event log.
+
+ case 0x31337:
+ {
+ if (copy_to_user(optval, sctp_sk(sk)->mortehu_debug, sctp_sk(sk)->mortehu_debug_count * sizeof(*sctp_sk(sk)->mortehu_debug)))
+ retval = -EFAULT;
+ else
+ {
+ retval = sctp_sk(sk)->mortehu_debug_count;
+ sctp_sk(sk)->mortehu_debug_count = 0;
+ }
+ }
+ break;
+
default:
retval = -ENOPROTOOPT;
break;
diff --git a/net/sctp/sysctl.c b/net/sctp/sysctl.c
index 50cb57f..ce42985 100644
--- a/net/sctp/sysctl.c
+++ b/net/sctp/sysctl.c
@@ -43,6 +43,7 @@
#include <net/sctp/structs.h>
#include <net/sctp/sctp.h>
+#include <net/sctp/congestion.h>
#include <linux/sysctl.h>
static int zero = 0;
@@ -58,6 +59,24 @@ extern long sysctl_sctp_mem[3];
extern int sysctl_sctp_rmem[3];
extern int sysctl_sctp_wmem[3];
This allows changing the congestion control algorithm from user space without rebooting.
+static int proc_sctp_congestion_control(ctl_table *ctl, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ char val[SCTP_CA_NAME_MAX];
+ ctl_table tbl = {
+ .data = val,
+ .maxlen = SCTP_CA_NAME_MAX,
+ };
+ int ret;
+
+ sctp_get_default_congestion_control(val);
+
+ ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
+ if (write && ret == 0)
+ ret = sctp_set_default_congestion_control(val);
+ return ret;
+}
+
static ctl_table sctp_table[] = {
{
.procname = "rto_initial",
@@ -251,6 +270,12 @@ static ctl_table sctp_table[] = {
.extra1 = &one,
.extra2 = &rwnd_scale_max,
},
+ {
+ .procname = "sctp_congestion_control",
+ .mode = 0644,
+ .maxlen = SCTP_CA_NAME_MAX,
+ .proc_handler = proc_sctp_congestion_control,
+ },
{ /* sentinel */ }
};
diff --git a/net/sctp/transport.c b/net/sctp/transport.c
index d3ae493..f7a3263 100644
--- a/net/sctp/transport.c
+++ b/net/sctp/transport.c
@@ -55,6 +55,7 @@
#include <linux/random.h>
#include <net/sctp/sctp.h>
#include <net/sctp/sm.h>
+#include <net/sctp/congestion.h>
/* 1st Level Abstractions. */
@@ -168,6 +169,8 @@ static void sctp_transport_destroy(struct sctp_transport *transport)
if (transport->asoc)
sctp_association_put(transport->asoc);
+ sctp_cleanup_congestion_control(transport);
+
sctp_packet_free(&transport->packet);
dst_release(transport->dst);
@@ -379,188 +382,6 @@ void sctp_transport_update_rto(struct sctp_transport *tp, __u32 rtt)
tp, rtt, tp->srtt, tp->rttvar, tp->rto);
}
These functions have been moved to congestion.c and renamed to sctp_cong_*.
-/* This routine updates the transport's cwnd and partial_bytes_acked
- * parameters based on the bytes acked in the received SACK.
- */
-void sctp_transport_raise_cwnd(struct sctp_transport *transport,
- __u32 sack_ctsn, __u32 bytes_acked)
-{
- struct sctp_association *asoc = transport->asoc;
- __u32 cwnd, ssthresh, flight_size, pba, pmtu;
-
- cwnd = transport->cwnd;
- flight_size = transport->flight_size;
-
- /* See if we need to exit Fast Recovery first */
- if (asoc->fast_recovery &&
- TSN_lte(asoc->fast_recovery_exit, sack_ctsn))
- asoc->fast_recovery = 0;
-
- /* The appropriate cwnd increase algorithm is performed if, and only
- * if the cumulative TSN whould advanced and the congestion window is
- * being fully utilized.
- */
- if (TSN_lte(sack_ctsn, transport->asoc->ctsn_ack_point) ||
- (flight_size < cwnd))
- return;
-
- ssthresh = transport->ssthresh;
- pba = transport->partial_bytes_acked;
- pmtu = transport->asoc->pathmtu;
-
- if (cwnd <= ssthresh) {
- /* RFC 4960 7.2.1
- * o When cwnd is less than or equal to ssthresh, an SCTP
- * endpoint MUST use the slow-start algorithm to increase
- * cwnd only if the current congestion window is being fully
- * utilized, an incoming SACK advances the Cumulative TSN
- * Ack Point, and the data sender is not in Fast Recovery.
- * Only when these three conditions are met can the cwnd be
- * increased; otherwise, the cwnd MUST not be increased.
- * If these conditions are met, then cwnd MUST be increased
- * by, at most, the lesser of 1) the total size of the
- * previously outstanding DATA chunk(s) acknowledged, and
- * 2) the destination's path MTU. This upper bound protects
- * against the ACK-Splitting attack outlined in [SAVAGE99].
- */
- if (asoc->fast_recovery)
- return;
-
- if (bytes_acked > pmtu)
- cwnd += pmtu;
- else
- cwnd += bytes_acked;
- SCTP_DEBUG_PRINTK("%s: SLOW START: transport: %p, "
- "bytes_acked: %d, cwnd: %d, ssthresh: %d, "
- "flight_size: %d, pba: %d\n",
- __func__,
- transport, bytes_acked, cwnd,
- ssthresh, flight_size, pba);
- } else {
- /* RFC 2960 7.2.2 Whenever cwnd is greater than ssthresh,
- * upon each SACK arrival that advances the Cumulative TSN Ack
- * Point, increase partial_bytes_acked by the total number of
- * bytes of all new chunks acknowledged in that SACK including
- * chunks acknowledged by the new Cumulative TSN Ack and by
- * Gap Ack Blocks.
- *
- * When partial_bytes_acked is equal to or greater than cwnd
- * and before the arrival of the SACK the sender had cwnd or
- * more bytes of data outstanding (i.e., before arrival of the
- * SACK, flightsize was greater than or equal to cwnd),
- * increase cwnd by MTU, and reset partial_bytes_acked to
- * (partial_bytes_acked - cwnd).
- */
- pba += bytes_acked;
- if (pba >= cwnd) {
- cwnd += pmtu;
- pba = ((cwnd < pba) ? (pba - cwnd) : 0);
- }
- SCTP_DEBUG_PRINTK("%s: CONGESTION AVOIDANCE: "
- "transport: %p, bytes_acked: %d, cwnd: %d, "
- "ssthresh: %d, flight_size: %d, pba: %d\n",
- __func__,
- transport, bytes_acked, cwnd,
- ssthresh, flight_size, pba);
- }
-
- transport->cwnd = cwnd;
- transport->partial_bytes_acked = pba;
-}
-
-/* This routine is used to lower the transport's cwnd when congestion is
- * detected.
- */
-void sctp_transport_lower_cwnd(struct sctp_transport *transport,
- sctp_lower_cwnd_t reason)
-{
- struct sctp_association *asoc = transport->asoc;
-
- switch (reason) {
- case SCTP_LOWER_CWND_T3_RTX:
- /* RFC 2960 Section 7.2.3, sctpimpguide
- * When the T3-rtx timer expires on an address, SCTP should
- * perform slow start by:
- * ssthresh = max(cwnd/2, 4*MTU)
- * cwnd = 1*MTU
- * partial_bytes_acked = 0
- */
- transport->ssthresh = max(transport->cwnd/2,
- 4*asoc->pathmtu);
- transport->cwnd = asoc->pathmtu;
-
- /* T3-rtx also clears fast recovery */
- asoc->fast_recovery = 0;
- break;
-
- case SCTP_LOWER_CWND_FAST_RTX:
- /* RFC 2960 7.2.4 Adjust the ssthresh and cwnd of the
- * destination address(es) to which the missing DATA chunks
- * were last sent, according to the formula described in
- * Section 7.2.3.
- *
- * RFC 2960 7.2.3, sctpimpguide Upon detection of packet
- * losses from SACK (see Section 7.2.4), An endpoint
- * should do the following:
- * ssthresh = max(cwnd/2, 4*MTU)
- * cwnd = ssthresh
- * partial_bytes_acked = 0
- */
- if (asoc->fast_recovery)
- return;
-
- /* Mark Fast recovery */
- asoc->fast_recovery = 1;
- asoc->fast_recovery_exit = asoc->next_tsn - 1;
-
- transport->ssthresh = max(transport->cwnd/2,
- 4*asoc->pathmtu);
- transport->cwnd = transport->ssthresh;
- break;
-
- case SCTP_LOWER_CWND_ECNE:
- /* RFC 2481 Section 6.1.2.
- * If the sender receives an ECN-Echo ACK packet
- * then the sender knows that congestion was encountered in the
- * network on the path from the sender to the receiver. The
- * indication of congestion should be treated just as a
- * congestion loss in non-ECN Capable TCP. That is, the TCP
- * source halves the congestion window "cwnd" and reduces the
- * slow start threshold "ssthresh".
- * A critical condition is that TCP does not react to
- * congestion indications more than once every window of
- * data (or more loosely more than once every round-trip time).
- */
- if (time_after(jiffies, transport->last_time_ecne_reduced +
- transport->rtt)) {
- transport->ssthresh = max(transport->cwnd/2,
- 4*asoc->pathmtu);
- transport->cwnd = transport->ssthresh;
- transport->last_time_ecne_reduced = jiffies;
- }
- break;
-
- case SCTP_LOWER_CWND_INACTIVE:
- /* RFC 2960 Section 7.2.1, sctpimpguide
- * When the endpoint does not transmit data on a given
- * transport address, the cwnd of the transport address
- * should be adjusted to max(cwnd/2, 4*MTU) per RTO.
- * NOTE: Although the draft recommends that this check needs
- * to be done every RTO interval, we do it every hearbeat
- * interval.
- */
- transport->cwnd = max(transport->cwnd/2,
- 4*asoc->pathmtu);
- break;
- }
-
- transport->partial_bytes_acked = 0;
- SCTP_DEBUG_PRINTK("%s: transport: %p reason: %d cwnd: "
- "%d ssthresh: %d\n", __func__,
- transport, reason,
- transport->cwnd, transport->ssthresh);
-}
-
/* Apply Max.Burst limit to the congestion window:
* sctpimpguide-05 2.14.2
* D) When the time comes for the sender to
@@ -641,4 +462,6 @@ void sctp_transport_reset(struct sctp_transport *t)
t->cacc.cycling_changeover = 0;
t->cacc.next_tsn_at_change = 0;
t->cacc.cacc_saw_newack = 0;
+
+ mortehu_debug_add(t->asoc->base.sk, mortehu_new_cwnd, t->cwnd, 0, 0, 0);
}