aboutsummaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)Author
2008-06-27net/inet_lro: remove setting skb->ip_summed when not LRO-ableEli Cohen
When an SKB cannot be chained to a session, the current code attempts to "restore" its ip_summed field from lro_mgr->ip_summed. However, lro_mgr->ip_summed does not hold the original value; in fact, we'd better not touch skb->ip_summed since it is not modified by the code in the path leading to a failure to chain it. Also use a cleaer comment to the describe the ip_summed field of struct net_lro_mgr. Issue raised by Or Gerlitz <ogerlitz@voltaire.com> Signed-off-by: Eli Cohen <eli@mellanox.co.il> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-27inet fragments: fix race between inet_frag_find and inet_frag_secret_rebuildPavel Emelyanov
The problem is that while we work w/o the inet_frags.lock even read-locked the secret rebuild timer may occur (on another CPU, since BHs are still disabled in the inet_frag_find) and change the rnd seed for ipv4/6 fragments. It was caused by my patch fd9e63544cac30a34c951f0ec958038f0529e244 ([INET]: Omit double hash calculations in xxx_frag_intern) late in the 2.6.24 kernel, so this should probably be queued to -stable. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-27tcp: /proc/net/tcp rto,ato values not scaled properly (v2)Stephen Hemminger
I found another case where we are sending information to userspace in the wrong HZ scale. This should have been fixed back in 2.5 :-( This means an ABI change but as it stands there is no way for an application like ss to get the right value. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-27tcp: calculate tcp_mem based on low memory instead of all memoryMiquel van Smoorenburg
The tcp_mem array which contains limits on the total amount of memory used by TCP sockets is calculated based on nr_all_pages. On a 32 bits x86 system, we should base this on the number of lowmem pages. Signed-off-by: Miquel van Smoorenburg <miquels@cistron.nl> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-17xfrm: fix fragmentation for ipv4 xfrm tunnelSteffen Klassert
When generating the ip header for the transformed packet we just copy the frag_off field of the ip header from the original packet to the ip header of the new generated packet. If we receive a packet as a chain of fragments, all but the last of the new generated packets have the IP_MF flag set. We have to mask the frag_off field to only keep the IP_DF flag from the original packet. This got lost with git commit 36cf9acf93e8561d9faec24849e57688a81eb9c5 ("[IPSEC]: Separate inner/outer mode processing on output") Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-17netfilter: nf_nat: fix RCU racesPatrick McHardy
Fix three ct_extend/NAT extension related races: - When cleaning up the extension area and removing it from the bysource hash, the nat->ct pointer must not be set to NULL since it may still be used in a RCU read side - When replacing a NAT extension area in the bysource hash, the nat->ct pointer must be assigned before performing the replacement - When reallocating extension storage in ct_extend, the old memory must not be freed immediately since it may still be used by a RCU read side Possibly fixes https://bugzilla.redhat.com/show_bug.cgi?id=449315 and/or http://bugzilla.kernel.org/show_bug.cgi?id=10875 Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-16ipv4: Remove unused definitions in net/ipv4/tcp_ipv4.c.Rami Rosen
1) Remove ICMP_MIN_LENGTH, as it is unused. 2) Remove unneeded tcp_v4_send_check() declaration. Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-16raw: Restore /proc/net/raw correct behaviorEric Dumazet
I just noticed "cat /proc/net/raw" was buggy, missing '\n' separators. I believe this was introduced by commit 8cd850efa4948d57a2ed836911cfd1ab299e89c6 ([RAW]: Cleanup IPv4 raw_seq_show.) This trivial patch restores correct behavior, and applies to current Linus tree (should also be applied to stable tree as well.) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-16tcp: Revert reset of deferred accept changes in 2.6.26David S. Miller
Ingo's system is still seeing strange behavior, and he reports that is goes away if the rest of the deferred accept changes are reverted too. Therefore this reverts e4c78840284f3f51b1896cf3936d60a6033c4d2c ("[TCP]: TCP_DEFER_ACCEPT updates - dont retxmt synack") and 539fae89bebd16ebeafd57a87169bc56eb530d76 ("[TCP]: TCP_DEFER_ACCEPT updates - defer timeout conflicts with max_thresh"). Just like the other revert, these ideas can be revisited for 2.6.27 Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: tcp: Revert 'process defer accept as established' changes. ipv6: Fix duplicate initialization of rawv6_prot.destroy bnx2x: Updating the Maintainer net: Eliminate flush_scheduled_work() calls while RTNL is held. drivers/net/r6040.c: correct bad use of round_jiffies() fec_mpc52xx: MPC52xx_MESSAGES_DEFAULT: 2nd NETIF_MSG_IFDOWN => IFUP ipg: fix receivemode IPG_RM_RECEIVEMULTICAST{,HASH} in ipg_nic_set_multicast_list() netfilter: nf_conntrack: fix ctnetlink related crash in nf_nat_setup_info() netfilter: Make nflog quiet when no one listen in userspace. ipv6: Fail with appropriate error code when setting not-applicable sockopt. ipv6: Check IPV6_MULTICAST_LOOP option value. ipv6: Check the hop limit setting in ancillary data. ipv6 route: Fix route lifetime in netlink message. ipv6 mcast: Check address family of gf_group in getsockopt(MS_FILTER). dccp: Bug in initial acknowledgment number assignment dccp ccid-3: X truncated due to type conversion dccp ccid-3: TFRC reverse-lookup Bug-Fix dccp ccid-2: Bug-Fix - Ack Vectors need to be ignored on request sockets dccp: Fix sparse warnings dccp ccid-3: Bug-Fix - Zero RTT is possible
2008-06-12tcp: Revert 'process defer accept as established' changes.David S. Miller
This reverts two changesets, ec3c0982a2dd1e671bad8e9d26c28dcba0039d87 ("[TCP]: TCP_DEFER_ACCEPT updates - process as established") and the follow-on bug fix 9ae27e0adbf471c7a6b80102e38e1d5a346b3b38 ("tcp: Fix slab corruption with ipv6 and tcp6fuzz"). This change causes several problems, first reported by Ingo Molnar as a distcc-over-loopback regression where connections were getting stuck. Ilpo Järvinen first spotted the locking problems. The new function added by this code, tcp_defer_accept_check(), only has the child socket locked, yet it is modifying state of the parent listening socket. Fixing that is non-trivial at best, because we can't simply just grab the parent listening socket lock at this point, because it would create an ABBA deadlock. The normal ordering is parent listening socket --> child socket, but this code path would require the reverse lock ordering. Next is a problem noticed by Vitaliy Gusev, he noted: ---------------------------------------- >--- a/net/ipv4/tcp_timer.c >+++ b/net/ipv4/tcp_timer.c >@@ -481,6 +481,11 @@ static void tcp_keepalive_timer (unsigned long data) > goto death; > } > >+ if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) { >+ tcp_send_active_reset(sk, GFP_ATOMIC); >+ goto death; Here socket sk is not attached to listening socket's request queue. tcp_done() will not call inet_csk_destroy_sock() (and tcp_v4_destroy_sock() which should release this sk) as socket is not DEAD. Therefore socket sk will be lost for freeing. ---------------------------------------- Finally, Alexey Kuznetsov argues that there might not even be any real value or advantage to these new semantics even if we fix all of the bugs: ---------------------------------------- Hiding from accept() sockets with only out-of-order data only is the only thing which is impossible with old approach. Is this really so valuable? My opinion: no, this is nothing but a new loophole to consume memory without control. ---------------------------------------- So revert this thing for now. Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (42 commits) net: Fix routing tables with id > 255 for legacy software sky2: Hold RTNL while calling dev_close() s2io iomem annotations atl1: fix suspend regression qeth: start dev queue after tx drop error qeth: Prepare-function to call s390dbf was wrong qeth: reduce number of kernel messages qeth: Use ccw_device_get_id(). qeth: layer 3 Oops in ip event handler virtio: use callback on empty in virtio_net virtio: virtio_net free transmit skbs in a timer virtio: Fix typo in virtio_net_hdr comments virtio_net: Fix skb->csum_start computation ehea: set mac address fix sfc: Recover from RX queue flush failure add missing lance_* exports ixgbe: fix typo forcedeth: msi interrupts ipsec: pfkey should ignore events when no listeners pppoe: Unshare skb before anything else ...
2008-06-10net: Fix routing tables with id > 255 for legacy softwareKrzysztof Piotr Oledzki
Most legacy software do not like tables > 255 as rtm_table is u8 so tb_id is sent &0xff and it is possible to mismatch for example table 510 with table 254 (main). This patch introduces RT_TABLE_COMPAT=252 so the code uses it if tb_id > 255. It makes such old applications happy, new ones are still able to use RTA_TABLE to get a proper table id. Signed-off-by: Krzysztof Piotr Oledzki <ole@ans.pl> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-10inet{6}_request_sock: Init ->opt and ->pktopts in the constructorArnaldo Carvalho de Melo
Wei Yongjun noticed that we may call reqsk_free on request sock objects where the opt fields may not be initialized, fix it by introducing inet_reqsk_alloc where we initialize ->opt to NULL and set ->pktopts to NULL in inet6_reqsk_alloc. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-05asn1: additional sanity checking during BER decodingChris Wright
- Don't trust a length which is greater than the working buffer. An invalid length could cause overflow when calculating buffer size for decoding oid. - An oid length of zero is invalid and allows for an off-by-one error when decoding oid because the first subid actually encodes first 2 subids. - A primitive encoding may not have an indefinite length. Thanks to Wei Wang from McAfee for report. Cc: Steven French <sfrench@us.ibm.com> Cc: stable@kernel.org Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-04tcp: Fix for race due to temporary drop of the socket lock in skb_splice_bits.Octavian Purdila
skb_splice_bits temporary drops the socket lock while iterating over the socket queue in order to break a reverse locking condition which happens with sendfile. This, however, opens a window of opportunity for tcp_collapse() to aggregate skbs and thus potentially free the current skb used in skb_splice_bits and tcp_read_sock. This patch fixes the problem by (re-)getting the same "logical skb" after the lock has been temporary dropped. Based on idea and initial patch from Evgeniy Polyakov. Signed-off-by: Octavian Purdila <opurdila@ixiacom.com> Acked-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-04tcp: Increment OUTRSTS in tcp_send_active_reset()Sridhar Samudrala
TCP "resets sent" counter is not incremented when a TCP Reset is sent via tcp_send_active_reset(). Signed-off-by: Sridhar Samudrala <sri@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-04raw: Raw socket leak.Denis V. Lunev
The program below just leaks the raw kernel socket int main() { int fd = socket(PF_INET, SOCK_RAW, IPPROTO_UDP); struct sockaddr_in addr; memset(&addr, 0, sizeof(addr)); inet_aton("127.0.0.1", &addr.sin_addr); addr.sin_family = AF_INET; addr.sin_port = htons(2048); sendto(fd, "a", 1, MSG_MORE, &addr, sizeof(addr)); return 0; } Corked packet is allocated via sock_wmalloc which holds the owner socket, so one should uncork it and flush all pending data on close. Do this in the same way as in UDP. Signed-off-by: Denis V. Lunev <den@openvz.org> Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-04Merge branch 'net-2.6-misc-20080605a' of ↵David S. Miller
git://git.linux-ipv6.org/gitroot/yoshfuji/linux-2.6-fix
2008-06-04tcp: fix skb vs fack_count out-of-sync conditionIlpo Järvinen
This bug is able to corrupt fackets_out in very rare cases. In order for this to cause corruption: 1) DSACK in the middle of previous SACK block must be generated. 2) In order to take that particular branch, part or all of the DSACKed segment must already be SACKed so that we have that in cache in the first place. 3) The new info must be top enough so that fackets_out will be updated on this iteration. ...then fack_count is updated while skb wasn't, then we walk again that particular segment thus updating fack_count twice for a single skb and finally that value is assigned to fackets_out by tcp_sacktag_one. It is safe to call tcp_sacktag_one just once for a segment (at DSACK), no need to call again for plain SACK. Potential problem of the miscount are limited to premature entry to recovery and to inflated reordering metric (which could even cancel each other out in the most the luckiest scenarios :-)). Both are quite insignificant in worst case too and there exists also code to reset them (fackets_out once sacked_out becomes zero and reordering metric on RTO). This has been reported by a number of people, because it occurred quite rarely, it has been very evasive. Andy Furniss was able to get it to occur couple of times so that a bit more info was collected about the problem using a debug patch, though it still required lot of checking around. Thanks also to others who have tried to help here. This is listed as Bugzilla #10346. The bug was introduced by me in commit 68f8353b48 ([TCP]: Rewrite SACK block processing & sack_recv_cache use), I probably thought back then that there's need to scan that entry twice or didn't dare to make it go through it just once there. Going through twice would have required restoring fack_count after the walk but as noted above, I chose to drop the additional walk step altogether here. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-05[IPV6]: inet_sk(sk)->cork.opt leakDenis V. Lunev
IPv6 UDP sockets wth IPv4 mapped address use udp_sendmsg to send the data actually. In this case ip_flush_pending_frames should be called instead of ip6_flush_pending_frames. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-06-05[IPV4] TUNNEL4: Fix incoming packet length check for inter-protocol tunnel.YOSHIFUJI Hideaki
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-06-04tcp: Fix inconsistency source (CA_Open only when !tcp_left_out(tp))Ilpo Järvinen
It is possible that this skip path causes TCP to end up into an invalid state where ca_state was left to CA_Open while some segments already came into sacked_out. If next valid ACK doesn't contain new SACK information TCP fails to enter into tcp_fastretrans_alert(). Thus at least high_seq is set incorrectly to a too high seqno because some new data segments could be sent in between (and also, limited transmit is not being correctly invoked there). Reordering in both directions can easily cause this situation to occur. I guess we would want to use tcp_moderate_cwnd(tp) there as well as it may be possible to use this to trigger oversized burst to network by sending an old ACK with huge amount of SACK info, but I'm a bit unsure about its effects (mainly to FlightSize), so to be on the safe side I just currently fixed it minimally to keep TCP's state consistent (obviously, such nasty ACKs have been possible this far). Though it seems that FlightSize is already underestimated by some amount, so probably on the long term we might want to trigger recovery there too, if appropriate, to make FlightSize calculation to resemble reality at the time when the losses where discovered (but such change scares me too much now and requires some more thinking anyway how to do that as it likely involves some code shuffling). This bug was found by Brian Vowell while running my TCP debug patch to find cause of another TCP issue (fackets_out miscount). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-03route: Remove unused ifa_anycast fieldThomas Graf
The field was supposed to allow the creation of an anycast route by assigning an anycast address to an address prefix. It was never implemented so this field is unused and serves no purpose. Remove it. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-03route: Mark unused routing attributes as suchThomas Graf
Also removes an unused policy entry for an attribute which is only used in kernel->user direction. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-03route: Mark unused route cache flags as such.Thomas Graf
Also removes an obsolete check for the unused flag RTCF_MASQ. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-21net: The world is not perfect patch.Rami Rosen
Unless there will be any objection here, I suggest consider the following patch which simply removes the code for the -DI_WISH_WORLD_WERE_PERFECT in the three methods which use it. The compilation errors we get when using -DI_WISH_WORLD_WERE_PERFECT show that this code was not built and not used for really a long time. Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-21net/ipv4/arp.c: Use common hex_asc helpersDenis Cheng
Here the local hexbuf is a duplicate of global const char hex_asc from lib/hexdump.c, except the hex letters' cases: const char hexbuf[] = "0123456789ABCDEF"; const char hex_asc[] = "0123456789abcdef"; and here to print HW addresses, the hex cases are not significant. Thanks to Harvey Harrison to introduce the hex_asc_hi/hex_asc_lo helpers. Signed-off-by: Denis Cheng <crquan@gmail.com> Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-21tcp: TCP connection times out if ICMP frag needed is delayedSridhar Samudrala
We are seeing an issue with TCP in handling an ICMP frag needed message that is received after net.ipv4.tcp_retries1 retransmits. The default value of retries1 is 3. So if the path mtu changes and ICMP frag needed is lost for the first 3 retransmits or if it gets delayed until 3 retransmits are done, TCP doesn't update MSS correctly and continues to retransmit the orginal message until it timesout after tcp_retries2 retransmits. I am seeing this issue even with the latest 2.6.25.4 kernel. In tcp_retransmit_timer(), when retransmits counter exceeds tcp_retries1 value, the dst cache entry of the socket is reset. At this time, if we receive an ICMP frag needed message, the dst entry gets updated with the new MTU, but the TCP sockets dst_cache entry remains NULL. So the next time when we try to retransmit after the ICMP frag needed is received, tcp_retransmit_skb() gets called. Here the cur_mss value is calculated at the start of the routine with a NULL sk_dst_cache. Instead we should call tcp_current_mss after the rebuild_header that caches the dst entry with the updated mtu. Also the rebuild_header should be called before tcp_fragment so that skb is fragmented if the mss goes down. Signed-off-by: Sridhar Samudrala <sri@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-20ipsec: Use the correct ip_local_out functionHerbert Xu
Because the IPsec output function xfrm_output_resume does its own dst_output call it should always call __ip_local_output instead of ip_local_output as the latter may invoke dst_output directly. Otherwise the return values from nf_hook and dst_output may clash as they both use the value 1 but for different purposes. When that clash occurs this can cause a packet to be used after it has been freed which usually leads to a crash. Because the offending value is only returned from dst_output with qdiscs such as HTB, this bug is normally not visible. Thanks to Marco Berizzi for his perseverance in tracking this down. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (73 commits) net: Fix typo in net/core/sock.c. ppp: Do not free not yet unregistered net device. netfilter: xt_iprange: module aliases for xt_iprange netfilter: ctnetlink: dump conntrack ID in event messages irda: Fix a misalign access issue. (v2) sctp: Fix use of uninitialized pointer cipso: Relax too much careful cipso hash function. tcp FRTO: work-around inorder receivers tcp FRTO: Fix fallback to conventional recovery New maintainer for Intel ethernet adapters DM9000: Use delayed work to update MII PHY state DM9000: Update and fix driver debugging messages DM9000: Add __devinit and __devexit attributes to probe and remove sky2: fix simple define thinko [netdrvr] sfc: sfc: Add self-test support [netdrvr] sfc: Increment rx_reset when reported as driver event [netdrvr] sfc: Remove unused macro EFX_XAUI_RETRAIN_MAX [netdrvr] sfc: Fix code formatting [netdrvr] sfc: Remove kernel-doc comments for removed members of struct efx_nic [netdrvr] sfc: Remove garbage from comment ...
2008-05-13cipso: Relax too much careful cipso hash function.Pavel Emelyanov
The cipso_v4_cache is allocated to contain CIPSO_V4_CACHE_BUCKETS buckets. The CIPSO_V4_CACHE_BUCKETS = 1 << CIPSO_V4_CACHE_BUCKETBITS, where CIPSO_V4_CACHE_BUCKETBITS = 7. The bucket-selection function for this hash is calculated like this: bkt = hash & (CIPSO_V4_CACHE_BUCKETBITS - 1); ^^^ i.e. picking only 4 buckets of possible 128 :) Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Paul Moore <paul.moore@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-13tcp FRTO: work-around inorder receiversIlpo Järvinen
If receiver consumes segments successfully only in-order, FRTO fallback to conventional recovery produces RTO loop because FRTO's forward transmissions will always get dropped and need to be resent, yet by default they're not marked as lost (which are the only segments we will retransmit in CA_Loss). Price to pay about this is occassionally unnecessarily retransmitting the forward transmission(s). SACK blocks help a bit to avoid this, so it's mainly a concern for NewReno case though SACK is not fully immune either. This change has a side-effect of fixing SACKFRTO problem where it didn't have snd_nxt of the RTO time available anymore when fallback become necessary (this problem would have only occured when RTO would occur for two or more segments and ECE arrives in step 3; no need to figure out how to fix that unless the TODO item of selective behavior is considered in future). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Reported-by: Damon L. Chesser <damon@damtek.com> Tested-by: Damon L. Chesser <damon@damtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-13tcp FRTO: Fix fallback to conventional recoveryIlpo Järvinen
It seems that commit 009a2e3e4ec ("[TCP] FRTO: Improve interoperability with other undo_marker users") run into another land-mine which caused fallback to conventional recovery to break: 1. Cumulative ACK arrives after FRTO retransmission 2. tcp_try_to_open sees zero retrans_out, clears retrans_stamp which should be kept like in CA_Loss state it would be 3. undo_marker change allowed tcp_packet_delayed to return true because of the cleared retrans_stamp once FRTO is terminated causing LossUndo to occur, which means all loss markings FRTO made are reverted. This means that the conventional recovery basically recovered one loss per RTT, which is not that efficient. It was quite unobvious that the undo_marker change broken something like this, I had a quite long session to track it down because of the non-intuitiviness of the bug (luckily I had a trivial reproducer at hand and I was also able to learn to use kprobes in the process as well :-)). This together with the NewReno+FRTO fix and FRTO in-order workaround this fixes Damon's problems, this and the first mentioned are enough to fix Bugzilla #10063. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Reported-by: Damon L. Chesser <damon@damtek.com> Tested-by: Damon L. Chesser <damon@damtek.com> Tested-by: Sebastian Hyrwall <zibbe@cisko.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-12net: Allow netdevices to specify needed head/tailroomJohannes Berg
This patch adds needed_headroom/needed_tailroom members to struct net_device and updates many places that allocate sbks to use them. Not all of them can be converted though, and I'm sure I missed some (I mostly grepped for LL_RESERVED_SPACE) Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (32 commits) net: Added ASSERT_RTNL() to dev_open() and dev_close(). can: Fix can_send() handling on dev_queue_xmit() failures netns: Fix arbitrary net_device-s corruptions on net_ns stop. netfilter: Kconfig: default DCCP/SCTP conntrack support to the protocol config values netfilter: nf_conntrack_sip: restrict RTP expect flushing on error to last request macvlan: Fix memleak on device removal/crash on module removal net/ipv4: correct RFC 1122 section reference in comment tcp FRTO: SACK variant is errorneously used with NewReno e1000e: don't return half-read eeprom on error ucc_geth: Don't use RX clock as TX clock. cxgb3: Use CAP_SYS_RAWIO for firmware pcnet32: delete non NAPI code from driver. fs_enet: Fix a memory leak in fs_enet_mdio_probe [netdrvr] eexpress: IPv6 fails - multicast problems 3c59x: use netstats in net_device structure 3c980-TX needs EXTRA_PREAMBLE fix warning in drivers/net/appletalk/cops.c e1000e: Add support for BM PHYs on ICH9 uli526x: fix endianness issues in the setup frame uli526x: initialize the hardware prior to requesting interrupts ...
2008-05-08net/ipv4: correct RFC 1122 section reference in commentJ.H.M. Dassen (Ray)
RFC 1122 does not have a section 3.1.2.2. The requirement to silently discard datagrams with a bad checksum is in section 3.2.1.2 instead. Addresses http://bugzilla.kernel.org/show_bug.cgi?id=10611 Signed-off-by: J.H.M. Dassen (Ray) <jdassen@debian.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-08tcp FRTO: SACK variant is errorneously used with NewRenoIlpo Järvinen
Note: there's actually another bug in FRTO's SACK variant, which is the causing failure in NewReno case because of the error that's fixed here. I'll fix the SACK case separately (it's a separate bug really, though related, but in order to fix that I need to audit tp->snd_nxt usage a bit). There were two places where SACK variant of FRTO is getting incorrectly used even if SACK wasn't negotiated by the TCP flow. This leads to incorrect setting of frto_highmark with NewReno if a previous recovery was interrupted by another RTO. An eventual fallback to conventional recovery then incorrectly considers one or couple of segments as forward transmissions though they weren't, which then are not LOST marked during fallback making them "non-retransmittable" until the next RTO. In a bad case, those segments are really lost and are the only one left in the window. Thus TCP needs another RTO to continue. The next FRTO, however, could again repeat the same events making the progress of the TCP flow extremely slow. In order for these events to occur at all, FRTO must occur again in FRTOs step 3 while the key segments must be lost as well, which is not too likely in practice. It seems to most frequently with some small devices such as network printers that *seem* to accept TCP segments only in-order. In cases were key segments weren't lost, things get automatically resolved because those wrongly marked segments don't need to be retransmitted in order to continue. I found a reproducer after digging up relevant reports (few reports in total, none at netdev or lkml I know of), some cases seemed to indicate middlebox issues which seems now to be a false assumption some people had made. Bugzilla #10063 _might_ be related. Damon L. Chesser <damon@damtek.com> had a reproducable case and was kind enough to tcpdump it for me. With the tcpdump log it was quite trivial to figure out. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: net_cls_act: act_simple dont ignore realloc code iwlwifi: make IWLWIFI a tristate Revert "atm: Do not free already unregistered net device." dccp: return -EINVAL on invalid feature length irda: fix !PNP support for drivers/net/irda/smsc-ircc2.c irda: fix !PNP support in drivers/net/irda/nsc-ircc.c net_cls_act: Make act_simple use of netlink policy. ip: Use inline function dst_metric() instead of direct access to dst->metric[] ip: Make use of the inline function dst_metric_locked() atm: Bad locking on br2684_devs modifications. atm: Do not free already unregistered net device. mac80211: Do not free net device after it is unregistered. bridge: Consolidate error paths in br_add_bridge(). bridge: Net device leak in br_add_bridge(). niu: Fix probing regression for maramba on-board chips. lapbeth: Release ->ethdev when unregistering device. xfrm: convert empty xfrm_audit_* macros to functions net: Fix useless comment reference loop. sch_htb: remove from event queue in htb_parent_to_leaf()
2008-05-04ip: Use inline function dst_metric() instead of direct access to dst->metric[]Satoru SATOH
There are functions to refer to the value of dst->metric[THE_METRIC-1] directly without use of a inline function "dst_metric" defined in net/dst.h. The following patch changes them to use the inline function consistently. Signed-off-by: Satoru SATOH <satoru.satoh@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-04ip: Make use of the inline function dst_metric_locked()Satoru SATOH
Signed-off-by: Satoru SATOH <satoru.satoh@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (47 commits) rose: Wrong list_lock argument in rose_node seqops netns: Fix reassembly timer to use the right namespace netns: Fix device renaming for sysfs bnx2: Update version to 1.7.5. bnx2: Update RV2P firmware for 5709. bnx2: Zero out context memory for 5709. bnx2: Fix register test on 5709. bnx2: Fix remote PHY initial link state. bnx2: Refine remote PHY locking. bridge: forwarding table information for >256 devices tg3: Update version to 3.92 tg3: Add link state reporting to UMP firmware tg3: Fix ethtool loopback test for 5761 BX devices tg3: Fix 5761 NVRAM sizes tg3: Use constant 500KHz MI clock on adapters with a CPMU hci_usb.h: fix hard-to-trigger race dccp: ccid2.c, ccid3.c use clamp(), clamp_t() net: remove NR_CPUS arrays in net/core/dev.c net: use get/put_unaligned_* helpers bluetooth: use get/put_unaligned_* helpers ...
2008-05-02net: use get/put_unaligned_* helpersHarvey Harrison
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-02ipv4: assign PDE->data before gluing PDE into /proc treeDenis V. Lunev
The check for PDE->data != NULL becomes useless after the replacement of proc_net_fops_create with proc_create_data. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-02netfilter: assign PDE->data before gluing PDE into /proc treeDenis V. Lunev
Simply replace proc_create and further data assigned with proc_create_data. Signed-off-by: Denis V. Lunev <den@openvz.org> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-01rename div64_64 to div64_u64Roman Zippel
Rename div64_64 to div64_u64 to make it consistent with the other divide functions, so it clearly includes the type of the divide. Move its definition to math64.h as currently no architecture overrides the generic implementation. They can still override it of course, but the duplicated declarations are avoided. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: Avi Kivity <avi@qumranet.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: David Howells <dhowells@redhat.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-01net: fix returning void-valued expression warningsHarvey Harrison
drivers/net/8390.c:37:2: warning: returning void-valued expression drivers/net/bnx2.c:1635:3: warning: returning void-valued expression drivers/net/xen-netfront.c:1806:2: warning: returning void-valued expression net/ipv4/tcp_hybla.c:105:3: warning: returning void-valued expression net/ipv4/tcp_vegas.c:171:3: warning: returning void-valued expression net/ipv4/tcp_veno.c:123:3: warning: returning void-valued expression net/sysctl_net.c:85:2: warning: returning void-valued expression Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Alan Cox <alan@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (53 commits) tcp: Overflow bug in Vegas [IPv4] UFO: prevent generation of chained skb destined to UFO device iwlwifi: move the selects to the tristate drivers ipv4: annotate a few functions __init in ipconfig.c atm: ambassador: vcc_sf semaphore to mutex MAINTAINERS: The socketcan-core list is subscribers-only. netfilter: nf_conntrack: padding breaks conntrack hash on ARM ipv4: Update MTU to all related cache entries in ip_rt_frag_needed() sch_sfq: use del_timer_sync() in sfq_destroy() net: Add compat support for getsockopt (MCAST_MSFILTER) net: Several cleanups for the setsockopt compat support. ipvs: fix oops in backup for fwmark conn templates bridge: kernel panic when unloading bridge module bridge: fix error handling in br_add_if() netfilter: {nfnetlink,ip,ip6}_queue: fix skb_over_panic when enlarging packets netfilter: x_tables: fix net namespace leak when reading /proc/net/xxx_tables_names netfilter: xt_TCPOPTSTRIP: signed tcphoff for ipv6_skip_exthdr() retval tcp: Limit cwnd growth when deferring for GSO tcp: Allow send-limited cwnd to grow up to max_burst when gso disabled [netdrvr] gianfar: Determine TBIPA value dynamically ...
2008-04-30tcp: Overflow bug in VegasLachlan Andrew
From: Lachlan Andrew <lachlan.andrew@gmail.com> There is an overflow bug in net/ipv4/tcp_vegas.c for large BDPs (e.g. 400Mbit/s, 400ms). The multiplication (old_wnd * vegas->baseRTT) << V_PARAM_SHIFT overflows a u32. [ Fix tcp_veno.c too, it has similar calculations. -DaveM ] Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-29[IPv4] UFO: prevent generation of chained skb destined to UFO deviceKostya B
Problem: ip_append_data() could wrongly generate a chained skb for devices which support UFO. When sk_write_queue is not empty (e.g. MSG_MORE), __instead__ of appending data into the next nr_frag of the queued skb, a new chained skb is created. I would normally assume UFO device should get data in nr_frags and not in frag_list. Later the udp4_hwcsum_outgoing() resets csum to NONE and skb_gso_segment() has oops. Proposal: 1. Even length is less than mtu, employ ip_ufo_append_data() and append data to the __existed__ skb in the sk_write_queue. 2. ip_ufo_append_data() is fixed due to a wrong manipulation of peek-ing and later enqueue-ing of the same skb. Now, enqueuing is always performed, because on error the further ip_flush_pending_frames() would release the queued skb. Signed-off-by: Kostya B <bkostya@hotmail.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>