aboutsummaryrefslogtreecommitdiff
path: root/net/dccp
AgeCommit message (Collapse)Author
2005-10-31[DCCP]: Set socket owner iff packet is not dataHerbert Xu
Here is a complimentary insurance policy for those feeling a bit insecure. You don't have to accept this. However, if you do, you can't blame me for it :) > 1) dccp_transmit_skb sets the owner for all packets except data packets. We can actually verify this by looking at pkt_type. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-31[DCCP]: Simplify skb_set_owner_w semanticsHerbert Xu
While we're at it let's reorganise the set_owner_w calls a little so that: 1) dccp_transmit_skb sets the owner for all packets except data packets. 2) Add dccp_skb_entail to set owner for packets queued for retransmission. 3) Make dccp_transmit_skb static. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-28[PATCH] gfp_t: net/*Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-20[DCCP]: Clear the IPCB areaHerbert Xu
Turns out the problem has nothing to do with use-after-free or double-free. It's just that we're not clearing the CB area and DCCP unlike TCP uses a CB format that's incompatible with IP. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Ian McDonald <imcdnzl@gmail.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-20[DCCP]: Make dccp_write_xmit always free the packetHerbert Xu
icmp_send doesn't use skb->sk at all so even if skb->sk has already been freed it can't cause crash there (it would've crashed somewhere else first, e.g., ip_queue_xmit). I found a double-free on an skb that could explain this though. dccp_sendmsg and dccp_write_xmit are a little confused as to what should free the packet when something goes wrong. Sometimes they both go for the ball and end up in each other's way. This patch makes dccp_write_xmit always free the packet no matter what. This makes sense since dccp_transmit_skb which in turn comes from the fact that ip_queue_xmit always frees the packet. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-20[DCCP]: Use skb_set_owner_w in dccp_transmit_skb when skb->sk is NULLHerbert Xu
David S. Miller <davem@davemloft.net> wrote: > One thing you can probably do for this bug is to mark data packets > explicitly somehow, perhaps in the SKB control block DCCP already > uses for other data. Put some boolean in there, set it true for > data packets. Then change the test in dccp_transmit_skb() as > appropriate to test the boolean flag instead of "skb_cloned(skb)". I agree. In fact we already have that flag, it's called skb->sk. So here is patch to test that instead of skb_cloned(). Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Ian McDonald <imcdnzl@gmail.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-10[DCCP]: Transition from PARTOPEN to OPEN when receiving DATA packetsArnaldo Carvalho de Melo
Noticed by Andrea Bittau, that provided a patch that was modified to not transition from RESPOND to OPEN when receiving DATA packets. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-10[CCID]: Check if ccid is NULL in the hc_[tr]x_exit functionsArnaldo Carvalho de Melo
For consistency with ccid_exit and to fix a bug when IP_DCCP_UNLOAD_HACK is enabled as the control sock is not associated to any CCID. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-08[PATCH] gfp flags annotations - part 1Al Viro
- added typedef unsigned int __nocast gfp_t; - replaced __nocast uses for gfp flags with gfp_t - it gives exactly the same warnings as far as sparse is concerned, doesn't change generated code (from gcc point of view we replaced unsigned int with typedef) and documents what's going on far better. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-03[INET]: speedup inet (tcp/dccp) lookupsEric Dumazet
Arnaldo and I agreed it could be applied now, because I have other pending patches depending on this one (Thank you Arnaldo) (The other important patch moves skc_refcnt in a separate cache line, so that the SMP/NUMA performance doesnt suffer from cache line ping pongs) 1) First some performance data : -------------------------------- tcp_v4_rcv() wastes a *lot* of time in __inet_lookup_established() The most time critical code is : sk_for_each(sk, node, &head->chain) { if (INET_MATCH(sk, acookie, saddr, daddr, ports, dif)) goto hit; /* You sunk my battleship! */ } The sk_for_each() does use prefetch() hints but only the begining of "struct sock" is prefetched. As INET_MATCH first comparison uses inet_sk(__sk)->daddr, wich is far away from the begining of "struct sock", it has to bring into CPU cache cold cache line. Each iteration has to use at least 2 cache lines. This can be problematic if some chains are very long. 2) The goal ----------- The idea I had is to change things so that INET_MATCH() may return FALSE in 99% of cases only using the data already in the CPU cache, using one cache line per iteration. 3) Description of the patch --------------------------- Adds a new 'unsigned int skc_hash' field in 'struct sock_common', filling a 32 bits hole on 64 bits platform. struct sock_common { unsigned short skc_family; volatile unsigned char skc_state; unsigned char skc_reuse; int skc_bound_dev_if; struct hlist_node skc_node; struct hlist_node skc_bind_node; atomic_t skc_refcnt; + unsigned int skc_hash; struct proto *skc_prot; }; Store in this 32 bits field the full hash, not masked by (ehash_size - 1) Using this full hash as the first comparison done in INET_MATCH permits us immediatly skip the element without touching a second cache line in case of a miss. Suppress the sk_hashent/tw_hashent fields since skc_hash (aliased to sk_hash and tw_hash) already contains the slot number if we mask with (ehash_size - 1) File include/net/inet_hashtables.h 64 bits platforms : #define INET_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\ (((__sk)->sk_hash == (__hash)) ((*((__u64 *)&(inet_sk(__sk)->daddr)))== (__cookie)) && \ ((*((__u32 *)&(inet_sk(__sk)->dport))) == (__ports)) && \ (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif)))) 32bits platforms: #define TCP_IPV4_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\ (((__sk)->sk_hash == (__hash)) && \ (inet_sk(__sk)->daddr == (__saddr)) && \ (inet_sk(__sk)->rcv_saddr == (__daddr)) && \ (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif)))) - Adds a prefetch(head->chain.first) in __inet_lookup_established()/__tcp_v4_check_established() and __inet6_lookup_established()/__tcp_v6_check_established() and __dccp_v4_check_established() to bring into cache the first element of the list, before the {read|write}_lock(&head->lock); Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-18[DCCP]: Introduce CCID getsockopt for the CCIDsArnaldo Carvalho de Melo
Allocation for the optnames is similar to the DCCP options, with a range for rx and tx half connection CCIDs. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-18[DCCP]: Don't use necessarily the same CCID for tx and rxArnaldo Carvalho de Melo
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-18[CCID3]: Introduce include/linux/tfrc.hArnaldo Carvalho de Melo
Moving the TFRC sender and receiver variables to separate structs, so that we can copy these structs to userspace thru getsockopt, dccp_diag, etc. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-18[DCCP]: Move the ack vector code to net/dccp/ackvec.[ch]Arnaldo Carvalho de Melo
Isolating it, that will be used when we introduce a CCID2 (TCP-Like) implementation. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-16[DCCP]: Introduce DCCP_SOCKOPT_SERVICEArnaldo Carvalho de Melo
As discussed in the dccp@vger mailing list: Now applications have to use setsockopt(DCCP_SOCKOPT_SERVICE, service[s]), prior to calling listen() and connect(). An array of unsigned ints can be passed meaning that the listening sock accepts connection requests for several services. With this we can ditch struct sockaddr_dccp and use only sockaddr_in (and sockaddr_in6 in the future). Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-16[DCCP]: More precisely set reset_code when sending RESET packetsArnaldo Carvalho de Melo
Moving the setting of DCCP_SKB_CB(skb)->dccpd_reset_code to the places where events happen that trigger sending a RESET packet. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-13[DCCP]: Handle SYNC packets in dccp_rcv_state_processArnaldo Carvalho de Melo
Eliciting a SYNCACK in response, we were handling SYNC packets only in the DCCP_OPEN state, in dccp_rcv_established. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-09-13[DCCP]: Check if already in the CLOSING state in dccp_rcv_closereqArnaldo Carvalho de Melo
It is possible to receive more than one CLOSEREQ packet if the CLOSE packet sent in response is somehow lost, change the state to DCCP_CLOSING only on the first CLOSEREQ packet received. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-09-12[CCID3]: Listen socks doesn't have a private CCID blockArnaldo Carvalho de Melo
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-09[CCID3] Cleanup ccid3 debug callsArnaldo Carvalho de Melo
Also use some BUG_ON where appropriate and use LIMIT_NETDEBUG for the unlikely cases where we, at this stage, want to know about, that in my tests hasn't appeared in the radar. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-09-09[DCCP] Only call the HC _exit() routines in dccp_v4_destroy_sockArnaldo Carvalho de Melo
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-09-09[CCID3] Initialize ccid3hctx_t_ipi to 250msArnaldo Carvalho de Melo
To match more closely what is described in RFC 3448. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: Ian McDonald <iam4@cs.waikato.ac.nz>
2005-09-09[CCID3] Introduce ccid3_hc_[rt]x_sk() for overal consistencyArnaldo Carvalho de Melo
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-09-09[DCCP] Introduce dccp_timestampArnaldo Carvalho de Melo
To start the timestamps with 0.0ms, easing the integer maths in the CCIDs, this probably will be reworked to use the to be introduced struct timeval_offset infrastructure out of skb_get_timestamp, etc. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-09-09[CCID3] Initialize more fields in ccid3_hc_rx_initArnaldo Carvalho de Melo
The initialization of ccid3hcrx_rtt to 5ms is just a bandaid, I'll continue auditing the CCID3 HC rx codebase to fix this properly, probably I'll add a feedback timer as suggested in the CCID3 draft. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-09-09[CCID3] Make the ccid3hcrx_rtt calc look more like the ccid3hctx_rtt oneArnaldo Carvalho de Melo
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-09-09[CCID3] Use ELAPSED_TIME in the HC TX RTT estimationArnaldo Carvalho de Melo
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-09-09[DCCP] Give precedence to the biggest ELAPSED_TIMEArnaldo Carvalho de Melo
We can get this value in an TIMESTAMP_ECHO and/or in an ELAPSED_TIME option, if receiving both give precendence to the biggest one. In my tests they are very close if not equal at all times, so we may well think about removing the code in CCID3 that inserts this option and leaving this to the core, and perhaps even use just TIMESTAMP_ECHO including the elapsed time. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-09-09[CCID3] Calculate ccid3hcrx_x_recv using usecs_divArnaldo Carvalho de Melo
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-09-09[CCID] Only call the HC insert_options methods when requestedArnaldo Carvalho de Melo
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-09-09[CCID3] Avoid unsigned integer overflows in usecs_divArnaldo Carvalho de Melo
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-08-29[CCID3]: Call sk->sk_write_space(sk) when receiving a feedback packetArnaldo Carvalho de Melo
This makes the send rate calculations behave way more closely to what is specified, with the jitter previously seen on x and x_recv disappearing completely on non lossy setups. This resembles the tcp_data_snd_check code, that possibly we'll end up using in DCCP as well, perhaps moving this code to inet_connection_sock. For now I'm doing the simplest implementation tho. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29[DCCP]: Introduce DCCP_SOCKOPT_PACKET_SIZEArnaldo Carvalho de Melo
So that applications can set dccp_sock->dccps_pkt_size, that in turn is used in the CCID3 half connection init routines to set ccid3hc[tr]x_s and use it in its rate calculations. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29[CCID3]: Move ccid3_hc_rx_detect_loss to packet_history.cArnaldo Carvalho de Melo
Renaming it to dccp_rx_hist_detect_loss. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29[CCID3]: Move ccid3_hc_rx_add_hist to packet_history.cArnaldo Carvalho de Melo
Renaming it to dccp_rx_hist_add_packet. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29[DCCP]: Move the calc_X routines to dccp_tfrc_libArnaldo Carvalho de Melo
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29[DCCP]: Introduce dccp_tfrc_lib module with net/dccp/ccids/lib/*.cArnaldo Carvalho de Melo
I'll now take a look at the other proposed TFRC DCCP CCIDs to find more code that is now in ccid3.c and move to this module, the loss event rate, calc_X, etc most probably will be moved there. The main goal of these changes is to pave the way for the implementation of more TFRC based DCCP CCIDs and to shrink ccid3.c, reducing its complexity and helping in getting it rock solid. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29[DCCP]: Just move packet_history.[ch] to net/dccp/ccids/lib/Arnaldo Carvalho de Melo
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29[CCID3]: Move the loss interval code to loss_interval.[ch]Arnaldo Carvalho de Melo
And put this into net/dccp/ccids/lib/, where packet_history.[ch] will also be moved and then we'll have a tfrc_lib.ko module that will be used by dccp_ccid3.ko and other CCIDs that are variations of TFRC (RFC 3448). Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29[CCID3]: Move the CCID3 defines to ccid3.hArnaldo Carvalho de Melo
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29[CCID3]: Introduce usecs_divArnaldo Carvalho de Melo
To avoid open coding this all over the place. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29[CCID3]: Reorganise timeval handlingArnaldo Carvalho de Melo
Introducing functions to add to or subtract from a timeval variable and renaming now_delta to timeval_new_delta that calls do_gettimeofday and then timeval_delta, that should be used when there are several deltas made relative to the current time or setting variables to it, so as to avoid calling do_gettimeofday excessively. I'm leaving these "timeval_" prefixed funcions internal to DCCP for a while till we're sure there are no subtle bugs in it. It also is more correct as it checks if the number of usecs added to or subtracted from a tv_usec field is more than 2 seconds. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29[CCID3]: Reflow to mostly fit under 80 columnsArnaldo Carvalho de Melo
No code changes. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29[DCCP]: Introduce dccp_wait_for_ccid and use it in dccp_write_xmitArnaldo Carvalho de Melo
This is not quite what I think we should have long term but improves performance for now, so lets use it till we get CCID3 working well, then we can think about using sk_write_queue, perhaps using some ideas from Juwen Lai's old stack for 2.4.20. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29[DCCP]: Make the Debug Menu available when DCCP is statically linked tooArnaldo Carvalho de Melo
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29[NET]: use __read_mostly on kmem_cache_t , DEFINE_SNMP_STAT pointersEric Dumazet
This patch puts mostly read only data in the right section (read_mostly), to help sharing of these data between CPUS without memory ping pongs. On one of my production machine, tcp_statistics was sitting in a heavily modified cache line, so *every* SNMP update had to force a reload. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29[DCCP]: Initial dccp_poll implementationArnaldo Carvalho de Melo
Tested with a patched netcat, no horror stories so far 8) Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29[DCCP]: Call the HC exit routines at dccp_v4_destroy_sockArnaldo Carvalho de Melo
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29[DCCP]: Introduce dccp_get_infoArnaldo Carvalho de Melo
And also hc_tx and hc_rx get_info functions for the CCIDs to fill in information that is specific to them. For now reusing struct tcp_info, later I'll try to figure out a better solution, for now its really nice to get this kind of info: [root@qemu ~]# ./ss -danemi State Recv-Q Send-Q Local Addr:Port Peer Addr:Port LISTEN 0 0 *:5001 *:* ino:628 sk:c1340040 mem:(r0,w0,f0,t0) cwnd:0 ssthresh:0 ESTAB 0 0 172.20.0.2:5001 172.20.0.1:32785 ino:629 sk:c13409a0 mem:(r0,w0,f0,t0) ts rto:1000 rtt:0.004/0 cwnd:0 ssthresh:0 rcv_rtt:61.377 This, for instance, shows that we're not congestion controlling ACKs, as the above output is in the ttcp receiving host, and ttcp is a one way app, i.e. the received never calls sendmsg, so ccid_hc_tx_send_packet is never called, so the TX half connection stays in TFRC_SSTATE_NO_SENT state and hctx_rtt is never calculated, stays with the value set in ccid3_hc_tx_init, 4us, as show above in milliseconds (0.004ms), upcoming patches will fix this. rcv_rtt seems sane tho, matching ping results :-) Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29[CCID3]: Calculate the RTT in the RX half connectionArnaldo Carvalho de Melo
Using TIMESTAMP_ECHO and ELAPSED_TIME options received. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>