From 62ab22278308a40bcb7f4079e9719ab8b7fe11b5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ilpo=20J=C3=A4rvinen?= Date: Thu, 8 May 2008 01:09:11 -0700 Subject: tcp FRTO: SACK variant is errorneously used with NewReno MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Note: there's actually another bug in FRTO's SACK variant, which is the causing failure in NewReno case because of the error that's fixed here. I'll fix the SACK case separately (it's a separate bug really, though related, but in order to fix that I need to audit tp->snd_nxt usage a bit). There were two places where SACK variant of FRTO is getting incorrectly used even if SACK wasn't negotiated by the TCP flow. This leads to incorrect setting of frto_highmark with NewReno if a previous recovery was interrupted by another RTO. An eventual fallback to conventional recovery then incorrectly considers one or couple of segments as forward transmissions though they weren't, which then are not LOST marked during fallback making them "non-retransmittable" until the next RTO. In a bad case, those segments are really lost and are the only one left in the window. Thus TCP needs another RTO to continue. The next FRTO, however, could again repeat the same events making the progress of the TCP flow extremely slow. In order for these events to occur at all, FRTO must occur again in FRTOs step 3 while the key segments must be lost as well, which is not too likely in practice. It seems to most frequently with some small devices such as network printers that *seem* to accept TCP segments only in-order. In cases were key segments weren't lost, things get automatically resolved because those wrongly marked segments don't need to be retransmitted in order to continue. I found a reproducer after digging up relevant reports (few reports in total, none at netdev or lkml I know of), some cases seemed to indicate middlebox issues which seems now to be a false assumption some people had made. Bugzilla #10063 _might_ be related. Damon L. Chesser had a reproducable case and was kind enough to tcpdump it for me. With the tcpdump log it was quite trivial to figure out. Signed-off-by: Ilpo Järvinen Signed-off-by: David S. Miller --- net/ipv4/tcp_input.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) (limited to 'net') diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 8ac15a604e0..26c936930e9 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -114,8 +114,6 @@ int sysctl_tcp_abc __read_mostly; #define FLAG_FORWARD_PROGRESS (FLAG_ACKED|FLAG_DATA_SACKED) #define FLAG_ANY_PROGRESS (FLAG_FORWARD_PROGRESS|FLAG_SND_UNA_ADVANCED) -#define IsSackFrto() (sysctl_tcp_frto == 0x2) - #define TCP_REMNANT (TCP_FLAG_FIN|TCP_FLAG_URG|TCP_FLAG_SYN|TCP_FLAG_PSH) #define TCP_HP_BITS (~(TCP_RESERVED_BITS|TCP_FLAG_PSH)) @@ -1686,6 +1684,11 @@ static inline void tcp_reset_reno_sack(struct tcp_sock *tp) tp->sacked_out = 0; } +static int tcp_is_sackfrto(const struct tcp_sock *tp) +{ + return (sysctl_tcp_frto == 0x2) && !tcp_is_reno(tp); +} + /* F-RTO can only be used if TCP has never retransmitted anything other than * head (SACK enhanced variant from Appendix B of RFC4138 is more robust here) */ @@ -1702,7 +1705,7 @@ int tcp_use_frto(struct sock *sk) if (icsk->icsk_mtup.probe_size) return 0; - if (IsSackFrto()) + if (tcp_is_sackfrto(tp)) return 1; /* Avoid expensive walking of rexmit queue if possible */ @@ -1792,7 +1795,7 @@ void tcp_enter_frto(struct sock *sk) /* Earlier loss recovery underway (see RFC4138; Appendix B). * The last condition is necessary at least in tp->frto_counter case. */ - if (IsSackFrto() && (tp->frto_counter || + if (tcp_is_sackfrto(tp) && (tp->frto_counter || ((1 << icsk->icsk_ca_state) & (TCPF_CA_Recovery|TCPF_CA_Loss))) && after(tp->high_seq, tp->snd_una)) { tp->frto_highmark = tp->high_seq; @@ -3124,7 +3127,7 @@ static int tcp_process_frto(struct sock *sk, int flag) return 1; } - if (!IsSackFrto() || tcp_is_reno(tp)) { + if (!tcp_is_sackfrto(tp)) { /* RFC4138 shortcoming in step 2; should also have case c): * ACK isn't duplicate nor advances window, e.g., opposite dir * data, winupdate -- cgit v1.2.3 From c67fa02799bccca3d2e16582493da6d57812ec01 Mon Sep 17 00:00:00 2001 From: "J.H.M. Dassen (Ray)" Date: Thu, 8 May 2008 01:11:04 -0700 Subject: net/ipv4: correct RFC 1122 section reference in comment RFC 1122 does not have a section 3.1.2.2. The requirement to silently discard datagrams with a bad checksum is in section 3.2.1.2 instead. Addresses http://bugzilla.kernel.org/show_bug.cgi?id=10611 Signed-off-by: J.H.M. Dassen (Ray) Signed-off-by: Andrew Morton Signed-off-by: David S. Miller --- net/ipv4/ip_input.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'net') diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c index 7b4bad6d572..ff77a4a7f9e 100644 --- a/net/ipv4/ip_input.c +++ b/net/ipv4/ip_input.c @@ -397,7 +397,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, iph = ip_hdr(skb); /* - * RFC1122: 3.1.2.2 MUST silently discard any IP frame that fails the checksum. + * RFC1122: 3.2.1.2 MUST silently discard any IP frame that fails the checksum. * * Is the datagram acceptable? * -- cgit v1.2.3 From ef75d49f116bccbb80bccd423ecf3cb86c4509a5 Mon Sep 17 00:00:00 2001 From: Patrick McHardy Date: Thu, 8 May 2008 01:15:21 -0700 Subject: netfilter: nf_conntrack_sip: restrict RTP expect flushing on error to last request Some Inovaphone PBXs exhibit very stange behaviour: when dialing for example "123", the device sends INVITE requests for "1", "12" and "123" back to back. The first requests will elicit error responses from the receiver, causing the SIP helper to flush the RTP expectations even though we might still see a positive response. Note the sequence number of the last INVITE request that contained a media description and only flush the expectations when receiving a negative response for that sequence number. Signed-off-by: Patrick McHardy Signed-off-by: David S. Miller --- net/netfilter/nf_conntrack_sip.c | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-) (limited to 'net') diff --git a/net/netfilter/nf_conntrack_sip.c b/net/netfilter/nf_conntrack_sip.c index 9f490006956..2f9bbc058b4 100644 --- a/net/netfilter/nf_conntrack_sip.c +++ b/net/netfilter/nf_conntrack_sip.c @@ -870,6 +870,7 @@ static int process_sdp(struct sk_buff *skb, { enum ip_conntrack_info ctinfo; struct nf_conn *ct = nf_ct_get(skb, &ctinfo); + struct nf_conn_help *help = nfct_help(ct); unsigned int matchoff, matchlen; unsigned int mediaoff, medialen; unsigned int sdpoff; @@ -959,6 +960,9 @@ static int process_sdp(struct sk_buff *skb, if (nf_nat_sdp_session && ct->status & IPS_NAT_MASK) ret = nf_nat_sdp_session(skb, dptr, sdpoff, datalen, &rtp_addr); + if (ret == NF_ACCEPT && i > 0) + help->help.ct_sip_info.invite_cseq = cseq; + return ret; } static int process_invite_response(struct sk_buff *skb, @@ -967,14 +971,14 @@ static int process_invite_response(struct sk_buff *skb, { enum ip_conntrack_info ctinfo; struct nf_conn *ct = nf_ct_get(skb, &ctinfo); + struct nf_conn_help *help = nfct_help(ct); if ((code >= 100 && code <= 199) || (code >= 200 && code <= 299)) return process_sdp(skb, dptr, datalen, cseq); - else { + else if (help->help.ct_sip_info.invite_cseq == cseq) flush_expectations(ct, true); - return NF_ACCEPT; - } + return NF_ACCEPT; } static int process_update_response(struct sk_buff *skb, @@ -983,14 +987,14 @@ static int process_update_response(struct sk_buff *skb, { enum ip_conntrack_info ctinfo; struct nf_conn *ct = nf_ct_get(skb, &ctinfo); + struct nf_conn_help *help = nfct_help(ct); if ((code >= 100 && code <= 199) || (code >= 200 && code <= 299)) return process_sdp(skb, dptr, datalen, cseq); - else { + else if (help->help.ct_sip_info.invite_cseq == cseq) flush_expectations(ct, true); - return NF_ACCEPT; - } + return NF_ACCEPT; } static int process_prack_response(struct sk_buff *skb, @@ -999,14 +1003,14 @@ static int process_prack_response(struct sk_buff *skb, { enum ip_conntrack_info ctinfo; struct nf_conn *ct = nf_ct_get(skb, &ctinfo); + struct nf_conn_help *help = nfct_help(ct); if ((code >= 100 && code <= 199) || (code >= 200 && code <= 299)) return process_sdp(skb, dptr, datalen, cseq); - else { + else if (help->help.ct_sip_info.invite_cseq == cseq) flush_expectations(ct, true); - return NF_ACCEPT; - } + return NF_ACCEPT; } static int process_bye_request(struct sk_buff *skb, -- cgit v1.2.3 From f3261aff35cbc811fee0e23eaea277f1b7286eca Mon Sep 17 00:00:00 2001 From: Patrick McHardy Date: Thu, 8 May 2008 01:16:04 -0700 Subject: netfilter: Kconfig: default DCCP/SCTP conntrack support to the protocol config values When conntrack and DCCP/SCTP protocols are enabled, chances are good that people also want DCCP/SCTP conntrack and NAT support. Signed-off-by: Patrick McHardy Signed-off-by: David S. Miller --- net/netfilter/Kconfig | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'net') diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig index c1fc0f1a641..aa8d80c35e2 100644 --- a/net/netfilter/Kconfig +++ b/net/netfilter/Kconfig @@ -90,6 +90,7 @@ config NF_CT_PROTO_DCCP tristate 'DCCP protocol connection tracking support (EXPERIMENTAL)' depends on EXPERIMENTAL && NF_CONNTRACK depends on NETFILTER_ADVANCED + default IP_DCCP help With this option enabled, the layer 3 independent connection tracking code will be able to do state tracking on DCCP connections. @@ -104,6 +105,7 @@ config NF_CT_PROTO_SCTP tristate 'SCTP protocol connection tracking support (EXPERIMENTAL)' depends on EXPERIMENTAL && NF_CONNTRACK depends on NETFILTER_ADVANCED + default IP_SCTP help With this option enabled, the layer 3 independent connection tracking code will be able to do state tracking on SCTP connections. @@ -532,6 +534,7 @@ config NETFILTER_XT_MATCH_DCCP tristate '"dccp" protocol match support' depends on NETFILTER_XTABLES depends on NETFILTER_ADVANCED + default IP_DCCP help With this option enabled, you will be able to use the iptables `dccp' match in order to match on DCCP source/destination ports @@ -725,6 +728,7 @@ config NETFILTER_XT_MATCH_SCTP tristate '"sctp" protocol match support (EXPERIMENTAL)' depends on NETFILTER_XTABLES && EXPERIMENTAL depends on NETFILTER_ADVANCED + default IP_SCTP help With this option enabled, you will be able to use the `sctp' match in order to match on SCTP source/destination ports -- cgit v1.2.3 From aca51397d01474f80cab8fc978559b45f2e453ad Mon Sep 17 00:00:00 2001 From: Pavel Emelyanov Date: Thu, 8 May 2008 01:24:25 -0700 Subject: netns: Fix arbitrary net_device-s corruptions on net_ns stop. When a net namespace is destroyed, some devices (those, not killed on ns stop explicitly) are moved back to init_net. The problem, is that this net_ns change has one point of failure - the __dev_alloc_name() may be called if a name collision occurs (and this is easy to trigger). This allocator performs a likely-to-fail GFP_ATOMIC allocation to find a suitable number. Other possible conditions that may cause error (for device being ns local or not registered) are always false in this case. So, when this call fails, the device is unregistered. But this is *not* the right thing to do, since after this the device may be released (and kfree-ed) improperly. E. g. bridges require more actions (sysfs update, timer disarming, etc.), some other devices want to remove their private areas from lists, etc. I. e. arbitrary use-after-free cases may occur. The proposed fix is the following: since the only reason for the dev_change_net_namespace to fail is the name generation, we may give it a unique fall-back name w/o %d-s in it - the dev one, since ifindexes are still unique. So make this change, raise the failure-case printk loglevel to EMERG and replace the unregister_netdevice call with BUG(). [ Use snprintf() -DaveM ] Signed-off-by: Pavel Emelyanov Signed-off-by: David S. Miller --- net/core/dev.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) (limited to 'net') diff --git a/net/core/dev.c b/net/core/dev.c index d334446a8ea..4addaf0df96 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4480,17 +4480,19 @@ static void __net_exit default_device_exit(struct net *net) rtnl_lock(); for_each_netdev_safe(net, dev, next) { int err; + char fb_name[IFNAMSIZ]; /* Ignore unmoveable devices (i.e. loopback) */ if (dev->features & NETIF_F_NETNS_LOCAL) continue; /* Push remaing network devices to init_net */ - err = dev_change_net_namespace(dev, &init_net, "dev%d"); + snprintf(fb_name, IFNAMSIZ, "dev%d", dev->ifindex); + err = dev_change_net_namespace(dev, &init_net, fb_name); if (err) { - printk(KERN_WARNING "%s: failed to move %s to init_net: %d\n", + printk(KERN_EMERG "%s: failed to move %s to init_net: %d\n", __func__, dev->name, err); - unregister_netdevice(dev); + BUG(); } } rtnl_unlock(); -- cgit v1.2.3 From c2ab7ac225e29006b7117d6a9fe8f3be8d98b0c2 Mon Sep 17 00:00:00 2001 From: Oliver Hartkopp Date: Thu, 8 May 2008 02:49:55 -0700 Subject: can: Fix can_send() handling on dev_queue_xmit() failures The tx packet counting and the local loopback of CAN frames should only happen in the case that the CAN frame has been enqueued to the netdevice tx queue successfully. Thanks to Andre Naujoks for reporting this issue. Signed-off-by: Oliver Hartkopp Signed-off-by: Urs Thuermann Signed-off-by: David S. Miller --- net/can/af_can.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) (limited to 'net') diff --git a/net/can/af_can.c b/net/can/af_can.c index 2759b76f731..7e8ca283645 100644 --- a/net/can/af_can.c +++ b/net/can/af_can.c @@ -208,6 +208,7 @@ static int can_create(struct net *net, struct socket *sock, int protocol) */ int can_send(struct sk_buff *skb, int loop) { + struct sk_buff *newskb = NULL; int err; if (skb->dev->type != ARPHRD_CAN) { @@ -244,8 +245,7 @@ int can_send(struct sk_buff *skb, int loop) * If the interface is not capable to do loopback * itself, we do it here. */ - struct sk_buff *newskb = skb_clone(skb, GFP_ATOMIC); - + newskb = skb_clone(skb, GFP_ATOMIC); if (!newskb) { kfree_skb(skb); return -ENOMEM; @@ -254,7 +254,6 @@ int can_send(struct sk_buff *skb, int loop) newskb->sk = skb->sk; newskb->ip_summed = CHECKSUM_UNNECESSARY; newskb->pkt_type = PACKET_BROADCAST; - netif_rx(newskb); } } else { /* indication for the CAN driver: no loopback required */ @@ -266,11 +265,20 @@ int can_send(struct sk_buff *skb, int loop) if (err > 0) err = net_xmit_errno(err); + if (err) { + if (newskb) + kfree_skb(newskb); + return err; + } + + if (newskb) + netif_rx(newskb); + /* update statistics */ can_stats.tx_frames++; can_stats.tx_frames_delta++; - return err; + return 0; } EXPORT_SYMBOL(can_send); -- cgit v1.2.3 From e46b66bc42b6b1430b04cc5c207ecb2b2f4553dc Mon Sep 17 00:00:00 2001 From: Ben Hutchings Date: Thu, 8 May 2008 02:53:17 -0700 Subject: net: Added ASSERT_RTNL() to dev_open() and dev_close(). dev_open() and dev_close() must be called holding the RTNL, since they call device functions and netdevice notifiers that are promised the RTNL. Signed-off-by: Ben Hutchings Signed-off-by: David S. Miller --- net/core/dev.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'net') diff --git a/net/core/dev.c b/net/core/dev.c index 4addaf0df96..a1607bc0cd4 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -994,6 +994,8 @@ int dev_open(struct net_device *dev) { int ret = 0; + ASSERT_RTNL(); + /* * Is it already up? */ @@ -1060,6 +1062,8 @@ int dev_open(struct net_device *dev) */ int dev_close(struct net_device *dev) { + ASSERT_RTNL(); + might_sleep(); if (!(dev->flags & IFF_UP)) -- cgit v1.2.3