aboutsummaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)Author
2008-04-14[NETFILTER]: nf_conntrack: use bool type in struct nf_conntrack_l4protoJan Engelhardt
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de> Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14[NETFILTER]: nf_conntrack: use bool type in struct nf_conntrack_l3protoJan Engelhardt
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de> Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14[NETFILTER]: nf_conntrack: add tuplehash l3num/protonum accessorsPatrick McHardy
Add accessors for l3num and protonum and get rid of some overly long expressions. Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14[NETFILTER]: nf_nat: kill helper and seq_adjust hooksPatrick McHardy
Connection tracking helpers (specifically FTP) need to be called before NAT sequence numbers adjustments are performed to be able to compare them against previously seen ones. We've introduced two new hooks around 2.6.11 to maintain this ordering when NAT modules were changed to get called from conntrack helpers directly. The cost of netfilter hooks is quite high and sequence number adjustments are only rarely needed however. Add a RCU-protected sequence number adjustment function pointer and call it from IPv4 conntrack after calling the helper. Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14[NETFILTER]: nf_nat: don't add NAT extension for confirmed conntracksPatrick McHardy
Adding extensions to confirmed conntracks is not allowed to avoid races on reallocation. Don't setup NAT for confirmed conntracks in case NAT module is loaded late. The has one side-effect, the connections existing before the NAT module was loaded won't enter the bysource hash. The only case where this actually makes a difference is in case of SNAT to a multirange where the IP before NAT is also part of the range. Since old connections don't enter the bysource hash the first new connection from the IP will have a new address selected. This shouldn't matter at all. Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14[NETFILTER]: nf_nat: remove obsolete check for ICMP redirectsPatrick McHardy
Locally generated ICMP packets have a reference to the conntrack entry of the original packet manually attached by icmp_send(). Therefore the check for locally originated untracked ICMP redirects can never be true. Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14[NETFILTER]: nf_nat: add SCTP protocol supportPatrick McHardy
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14[NETFILTER]: nf_nat: add DCCP protocol supportPatrick McHardy
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14[NETFILTER]: Add partial checksum validation helperPatrick McHardy
Move the UDP-Lite conntrack checksum validation to a generic helper similar to nf_checksum() and make it fall back to nf_checksum() in case the full packet is to be checksummed and hardware checksums are available. This is to be used by DCCP conntrack, which also needs to verify partial checksums. Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14[NETFILTER]: nf_nat: add UDP-Lite supportPatrick McHardy
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14[NETFILTER]: nf_nat: remove unused name from struct nf_nat_protocolPatrick McHardy
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14[NETFILTER]: nf_conntrack_netlink: clean up NAT protocol parsingPatrick McHardy
Move responsibility for setting the IP_NAT_RANGE_PROTO_SPECIFIED flag to the NAT protocol, properly propagate errors and get rid of ugly return value convention. Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14[NETFILTER]: nf_nat: move NAT ctnetlink helpers to nf_nat_proto_commonPatrick McHardy
Move to nf_nat_proto_common and rename to nf_nat_proto_... since they're also used by protocols that don't have port numbers. Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14[NETFILTER]: nf_nat: fix random mode not to overwrite port roverPatrick McHardy
The port rover should not get overwritten when using random mode, otherwise other rules will also use more or less random ports. Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14[NETFILTER]: nf_nat: add helpers for common NAT protocol operationsPatrick McHardy
Add generic ->in_range and ->unique_tuple ops to avoid duplicating them again and again for future NAT modules and save a few bytes of text: net/ipv4/netfilter/nf_nat_proto_tcp.c: tcp_in_range | -62 (removed) tcp_unique_tuple | -259 # 271 -> 12, # inlines: 1 -> 0, size inlines: 7 -> 0 2 functions changed, 321 bytes removed net/ipv4/netfilter/nf_nat_proto_udp.c: udp_in_range | -62 (removed) udp_unique_tuple | -259 # 271 -> 12, # inlines: 1 -> 0, size inlines: 7 -> 0 2 functions changed, 321 bytes removed net/ipv4/netfilter/nf_nat_proto_gre.c: gre_in_range | -62 (removed) 1 function changed, 62 bytes removed vmlinux: 5 functions changed, 704 bytes removed Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14[NETFILTER]: {ip,ip6,arp}_tables: return EAGAIN for invalid SO_GET_ENTRIES sizePatrick McHardy
Rule dumping is performed in two steps: first userspace gets the ruleset size using getsockopt(SO_GET_INFO) and allocates memory, then it calls getsockopt(SO_GET_ENTRIES) to actually dump the ruleset. When another process changes the ruleset in between the sizes from the first getsockopt call doesn't match anymore and the kernel aborts. Unfortunately it returns EAGAIN, as for multiple other possible errors, so userspace can't distinguish this case from real errors. Return EAGAIN so userspace can retry the operation. Fixes (with current iptables SVN version) netfilter bugzilla #104. Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14[NETFILTER]: Explicitly initialize .priority in arptable_filterJan Engelhardt
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de> Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14[NETFILTER]: remove arpt_(un)register_target indirection macrosJan Engelhardt
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de> Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14[NETFILTER]: remove arpt_target indirection macroJan Engelhardt
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de> Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14[NETFILTER]: remove arpt_table indirection macroJan Engelhardt
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de> Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14[NETFILTER]: annotate rest of nf_nat_* with constJan Engelhardt
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de> Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14[NETFILTER]: annotate {arp,ip,ip6,x}tables with constJan Engelhardt
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de> Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14[NETFILTER]: annotate xtables targets with const and remove castsJan Engelhardt
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de> Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14[NETFILTER]: Use non-deprecated __RW_LOCK_UNLOCKED macroRobert P. J. Day
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14[NETFILTER]: ip_tables: per-netns FILTER/MANGLE/RAW tables for realAlexey Dobriyan
Commit 9335f047fe61587ec82ff12fbb1220bcfdd32006 aka "[NETFILTER]: ip_tables: per-netns FILTER, MANGLE, RAW" added per-netns _view_ of iptables rules. They were shown to user, but ignored by filtering code. Now that it's possible to at least ping loopback, per-netns tables can affect filtering decisions. netns is taken in case of PRE_ROUTING, LOCAL_IN -- from in device, POST_ROUTING, LOCAL_OUT -- from out device, FORWARD -- from in device which should be equal to out device's netns. This code is relatively new, so BUG_ON was plugged. Wrappers were added to a) keep code the same from CONFIG_NET_NS=n users (overwhelming majority), b) consolidate code in one place -- similar changes will be done in ipv6 and arp netfilter code. Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14[NETFILTER]: {ip,ip6}t_LOG: print MARK value in log outputPatrick McHardy
Dump the mark value in log messages similar to nfnetlink_log. This is useful for debugging complex setups where marks are used for routing or traffic classification. Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14[NETFILTER]: ipt_CLUSTERIP: fix race between clusterip_config_find_get and ↵Pavel Emelyanov
_entry_put Consider we are putting a clusterip_config entry with the "entries" count == 1, and on the other CPU there's a clusterip_config_find_get in progress: CPU1: CPU2: clusterip_config_entry_put: clusterip_config_find_get: if (atomic_dec_and_test(&c->entries)) { /* true */ read_lock_bh(&clusterip_lock); c = __clusterip_config_find(clusterip); /* found - it's still in list */ ... atomic_inc(&c->entries); read_unlock_bh(&clusterip_lock); write_lock_bh(&clusterip_lock); list_del(&c->list); write_unlock_bh(&clusterip_lock); ... dev_put(c->dev); Oops! We have an entry returned by the clusterip_config_find_get, which is a) not in list b) has a stale dev pointer. The problems will happen when the CPU2 will release the entry - it will remove it from the list for the 2nd time, thus spoiling it, and will put a stale dev pointer. The fix is to make atomic_dec_and_test under the clusterip_lock. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-04-14[SKB]: __skb_append = __skb_queue_after Gerrit Renker
This expresses __skb_append in terms of __skb_queue_after, exploiting that __skb_append(old, new, list) = __skb_queue_after(list, old, new). Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-13[TCP]: Remove owner from tcp_seq_afinfo.Denis V. Lunev
Move it to tcp_seq_afinfo->seq_fops as should be. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-13[TCP]: Place file operations directly into tcp_seq_afinfo.Denis V. Lunev
No need to have separate never-used variable. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-13[TCP]: Cleanup /proc/tcp[6] creation/removal.Denis V. Lunev
Replace seq_open with seq_open_net and remove tcp_seq_release completely. seq_release_net will do this job just fine. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-13[TCP]: Move seq_ops from tcp_iter_state to tcp_seq_afinfo.Denis V. Lunev
No need to create seq_operations for each instance of 'netstat'. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-13[TCP]: No need to check afinfo != NULL in tcp_proc_(un)register.Denis V. Lunev
tcp_proc_register/tcp_proc_unregister are called with a static pointer only. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-13[TCP]: Replace struct net on tcp_iter_state with seq_net_private.Denis V. Lunev
Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-13[INET]: sk_reuse is valboolGerrit Renker
sk_reuse is declared as "unsigned char", but is set as type valbool in net/core/sock.c. There is no other place in net/ where sk->sk_reuse is set to a value > 1, so the test "sk_reuse > 1" can not be true. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-11Merge branch 'docs' of git://git.lwn.net/linux-2.6Linus Torvalds
* 'docs' of git://git.lwn.net/linux-2.6: Add additional examples in Documentation/spinlocks.txt Move sched-rt-group.txt to scheduler/ Documentation: move rpc-cache.txt to filesystems/ Documentation: move nfsroot.txt to filesystems/ Spell out behavior of atomic_dec_and_lock() in kerneldoc Fix a typo in highres.txt Fixes to the seq_file document Fill out information on patch tags in SubmittingPatches Add the seq_file documentation
2008-04-11Documentation: move nfsroot.txt to filesystems/J. Bruce Fields
Documentation/ is a little large, and filesystems/ seems an obvious place for this file. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2008-04-10[NETNS][IPV6] tcp - assign the netns for timewait socketsDaniel Lezcano
Copy the network namespace from the socket to the timewait socket. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Acked-by: Mark Lord <mlord@pobox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-10IPV4: use xor rather than multiple ands for route compareStephen Hemminger
The comparison in ip_route_input is a hot path, by recoding the C "and" as bit operations, fewer conditional branches get generated so the code should be faster. Maybe someday Gcc will be smart enough to do this? Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-10ipv4: fib_trie leaf free optimizationStephen Hemminger
Avoid unneeded test in the case where object to be freed has to be a leaf. Don't need to use the generic tnode_free() function, instead just setup leaf to be freed. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-10ipv4: fib_trie remove unused argumentStephen Hemminger
The trie pointer is passed down to flush_list and flush_leaf but never used. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-10[Syncookies]: Add support for TCP options via timestamps.Florian Westphal
Allow the use of SACK and window scaling when syncookies are used and the client supports tcp timestamps. Options are encoded into the timestamp sent in the syn-ack and restored from the timestamp echo when the ack is received. Based on earlier work by Glenn Griffin. This patch avoids increasing the size of structs by encoding TCP options into the least significant bits of the timestamp and by not using any 'timestamp offset'. The downside is that the timestamp sent in the packet after the synack will increase by several seconds. changes since v1: don't duplicate timestamp echo decoding function, put it into ipv4/syncookie.c and have ipv6/syncookies.c use it. Feedback from Glenn Griffin: fix line indented with spaces, kill redundant if () Reviewed-by: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-10IPV4: fib_trie use vmalloc for large tnodesStephen Hemminger
Use vmalloc rather than alloc_pages to avoid wasting memory. The problem is that tnode structure has a power of 2 sized array, plus a header. So the current code wastes almost half the memory allocated because it always needs the next bigger size to hold that small header. This is similar to an earlier patch by Eric, but instead of a list and lock, I used a workqueue to handle the fact that vfree can't be done in interrupt context. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-10IPV4: route rekey timer can be deferrableStephen Hemminger
No urgency on the rehash interval timer, so mark it as deferrable. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-10IPV4: route use jhash3Stephen Hemminger
Since route hash is a triple, use jhash_3words rather doing the mixing directly. This should be as fast and give better distribution. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-10IPV4: route inline changesStephen Hemminger
Don't mark functions that are large as inline, let compiler decide. Also, use inline rather than __inline__. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-10[IPV4]: Fix byte value boundary check in do_ip_getsockopt().David S. Miller
This fixes kernel bugzilla 10371. As reported by M.Piechaczek@osmosys.tv, if we try to grab a char sized socket option value, as in: unsigned char ttl = 255; socklen_t len = sizeof(ttl); setsockopt(socket, IPPROTO_IP, IP_MULTICAST_TTL, &ttl, &len); getsockopt(socket, IPPROTO_IP, IP_MULTICAST_TTL, &ttl, &len); The ttl returned will be wrong on big-endian, and on both little- endian and big-endian the next three bytes in userspace are written with garbage. It's because of this test in do_ip_getsockopt(): if (len < sizeof(int) && len > 0 && val>=0 && val<255) { It should allow a 'val' of 255 to pass here, but it doesn't so it copies a full 'int' back to userspace. On little-endian that will write the correct value into the location but it spams on the next three bytes in userspace. On big endian it writes the wrong value into the location and spams the next three bytes. Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-09[NETFILTER]: nf_nat: autoload IPv4 connection trackingJan Engelhardt
Without this patch, the generic L3 tracker would kick in if nf_conntrack_ipv4 was not loaded before nf_nat, which would lead to translation problems with ICMP errors. NAT does not make sense without IPv4 connection tracking anyway, so just add a call to need_ipv4_conntrack(). Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-07[TCP]: Don't allow FRTO to take place while MTU is being probedIlpo Järvinen
MTU probe can cause some remedies for FRTO because the normal packet ordering may be violated allowing FRTO to make a wrong decision (it might not be that serious threat for anything though). Thus it's safer to not run FRTO while MTU probe is underway. It seems that the basic FRTO variant should also look for an skb at probe_seq.start to check if that's retransmitted one but I didn't implement it now (plain seqno in window check isn't robust against wraparounds). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-07[TCP]: tcp_simple_retransmit can cause S+LIlpo Järvinen
This fixes Bugzilla #10384 tcp_simple_retransmit does L increment without any checking whatsoever for overflowing S+L when Reno is in use. The simplest scenario I can currently think of is rather complex in practice (there might be some more straightforward cases though). Ie., if mss is reduced during mtu probing, it may end up marking everything lost and if some duplicate ACKs arrived prior to that sacked_out will be non-zero as well, leading to S+L > packets_out, tcp_clean_rtx_queue on the next cumulative ACK or tcp_fastretrans_alert on the next duplicate ACK will fix the S counter. More straightforward (but questionable) solution would be to just call tcp_reset_reno_sack() in tcp_simple_retransmit but it would negatively impact the probe's retransmission, ie., the retransmissions would not occur if some duplicate ACKs had arrived. So I had to add reno sacked_out reseting to CA_Loss state when the first cumulative ACK arrives (this stale sacked_out might actually be the explanation for the reports of left_out overflows in kernel prior to 2.6.23 and S+L overflow reports of 2.6.24). However, this alone won't be enough to fix kernel before 2.6.24 because it is building on top of the commit 1b6d427bb7e ([TCP]: Reduce sacked_out with reno when purging write_queue) to keep the sacked_out from overflowing. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Reported-by: Alessandro Suardi <alessandro.suardi@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>