aboutsummaryrefslogtreecommitdiff
path: root/net/sched/sch_api.c
AgeCommit message (Collapse)Author
2008-07-30pkt_sched: Fix OOPS on ingress qdisc add.David S. Miller
Bug report from Steven Jan Springl: Issuing the following command causes a kernel oops: tc qdisc add dev eth0 handle ffff: ingress The problem mostly stems from all of the special case handling of ingress qdiscs. So, to fix this, do the grafting operation the same way we do for TX qdiscs. Which means that dev_activate() and dev_deactivate() now do the "qdisc_sleeping <--> qdisc" transitions on dev->rx_queue too. Future simplifications are possible now, mainly because it is impossible for dev_queue->{qdisc,qdisc_sleeping} to be NULL. There are NULL checks all over to handle the ingress qdisc special case that used to exist before this commit. Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-22pkt_sched: make qdisc_class_hash_alloc() staticAdrian Bunk
This patch makes the needlessly global qdisc_class_hash_alloc() static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-20net_sched: Add size table for qdiscsJussi Kivilinna
Add size table functions for qdiscs and calculate packet size in qdisc_enqueue(). Based on patch by Patrick McHardy http://marc.info/?l=linux-netdev&m=115201979221729&w=2 Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-18pkt_sched: Manage qdisc list inside of root qdisc.David S. Miller
Idea is from Patrick McHardy. Instead of managing the list of qdiscs on the device level, manage it in the root qdisc of a netdev_queue. This solves all kinds of visibility issues during qdisc destruction. The way to iterate over all qdiscs of a netdev_queue is to visit the netdev_queue->qdisc, and then traverse it's list. The only special case is to ignore builting qdiscs at the root when dumping or doing a qdisc_lookup(). That was not needed previously because builtin qdiscs were not added to the device's qdisc_list. Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17pkt_sched: Add multiqueue handling to qdisc_graft().David S. Miller
Move the destruction of the old queue into qdisc_graft(). When operating on a root qdisc (ie. "parent == NULL"), apply the operation to all queues. The caller has grabbed a single implicit reference for this graft, therefore when we apply the change to more than one queue we must grab additional qdisc references. Otherwise, we are operating on a class of a specific parent qdisc, and therefore no multiqueue handling is necessary. Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17pkt_sched: Make qdisc grafting locking more specific.David S. Miller
Lock the root of the qdisc being operated upon. All explicit references to qdisc_tree_lock() are now gone. The only remaining uses are via the sch_tree_{lock,unlock}() and tcf_tree_{lock,unlock}() macros. Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17netdevice: Move qdisc_list back into net_device proper.David S. Miller
And give it it's own lock. Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17pkt_sched: Schedule qdiscs instead of netdev_queue.David S. Miller
When we have shared qdiscs, packets come out of the qdiscs for multiple transmit queues. Therefore it doesn't make any sense to schedule the transmit queue when logically we cannot know ahead of time the TX queue of the SKB that the qdisc->dequeue() will give us. Just for sanity I added a BUG check to make sure we never get into a state where the noop_qdisc is scheduled. Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17pkt_sched: Add and use qdisc_root() and qdisc_root_lock().David S. Miller
When code wants to lock the qdisc tree state, the logic operation it's doing is locking the top-level qdisc that sits of the root of the netdev_queue. Add qdisc_root_lock() to represent this and convert the easiest cases. In order for this to work out in all cases, we have to hook up the noop_qdisc to a dummy netdev_queue. Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-17netdev: Allocate multiple queues for TX.David S. Miller
alloc_netdev_mq() now allocates an array of netdev_queue structures for TX, based upon the queue_count argument. Furthermore, all accesses to the TX queues are now vectored through the netdev_get_tx_queue() and netdev_for_each_tx_queue() interfaces. This makes it easy to grep the tree for all things that want to get to a TX queue of a net device. Problem spots which are not really multiqueue aware yet, and only work with one queue, can easily be spotted by grepping for all netdev_get_tx_queue() calls that pass in a zero index. Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-08netdev: Make netif_schedule() routines work with netdev_queue objects.David S. Miller
Only plain netif_schedule() remains taking a net_device, mostly as a compatability item while we transition the rest of these interfaces. Everything else calls netif_schedule_queue() or __netif_schedule(), both of which take a netdev_queue pointer. Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-08pkt_sched: Kill stats_lock member of struct Qdisc.David S. Miller
It is always equal to qdisc->dev_queue->lock Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-08netdev: Kill qdisc_ingress, use netdev->rx_queue.qdisc instead.David S. Miller
Now that our qdisc management is bi-directional, per-queue, and fully orthogonal, there is no reason to have a special ingress qdisc pointer in struct net_device. Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-08netdev: Move rest of qdisc state into struct netdev_queueDavid S. Miller
Now qdisc, qdisc_sleeping, and qdisc_list also live there. Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-08netdev: The ingress_lock member is no longer needed.David S. Miller
Every qdisc is assosciated with a queue, and in the case of ingress qdiscs that will now be netdev->rx_queue so using that queue's lock is the thing to do. Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-08netdev: Move queue_lock into struct netdev_queue.David S. Miller
The lock is now an attribute of the device queue. One thing to notice is that "suspicious" places emerge which will need specific training about multiple queue handling. They are so marked with explicit "netdev->rx_queue" and "netdev->tx_queue" references. Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-08pkt_sched: Remove 'dev' member of struct Qdisc.David S. Miller
It can be obtained via the netdev_queue. So create a helper routine, qdisc_dev(), to make the transformations nicer looking. Now, qdisc_alloc() now no longer needs a net_device pointer argument. Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-08netdev: Create netdev_queue abstraction.David S. Miller
A netdev_queue is an entity managed by a qdisc. Currently there is one RX and one TX queue, and a netdev_queue merely contains a backpointer to the net_device. The Qdisc struct is augmented with a netdev_queue pointer as well. Eventually the 'dev' Qdisc member will go away and we will have the resulting hierarchy: net_device --> netdev_queue --> Qdisc Also, qdisc_alloc() and qdisc_create_dflt() now take a netdev_queue pointer argument. Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-08pkt_sched: Remove comment reference to old style TX locking.David S. Miller
We haven't had netdev->tbusy in many years :) Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-05net-sched: add dynamically sized qdisc class hash helpersPatrick McHardy
Currently all qdiscs which allow to create classes uses a fixed sized hash table with size 16 to hash the classes. This causes a large bottleneck when using thousands of classes and unbound filters. Add helpers for dynamically sized class hashes to fix this. The following patches will convert the qdiscs to use them. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-01net-sched: change tcf_destroy_chain() to clear start of filter listPatrick McHardy
Pass double tcf_proto pointers to tcf_destroy_chain() to make it clear the start of the filter list for more consistency. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-17Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6
2008-04-14[NET_SCHED] sch_api: fix qdisc_tree_decrease_qlen() loopJarek Poplawski
TC_H_MAJ(parentid) for root classes is the same as for ingress, and if ingress qdisc is created qdisc_lookup() returns its pointer (without ingress NULL is returned). After this all qdisc_lookups give the same, and we get endless loop. (I don't know how this could hide for so long - it should trigger with every leaf class deleted if it's qdisc isn't empty.) After this fix qdisc_lookup() is omitted both for ingress and root parents, but looking for root is only wasting a little time here... Many thanks to Enrico Demarin for finding a test for catching this bug, which probably bothered quite a lot of admins. Reported-by: Enrico Demarin <enrico@superclick.com>, Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-26[NET] NETNS: Omit sock->sk_net without CONFIG_NET_NS.YOSHIFUJI Hideaki
Introduce per-sock inlines: sock_net(), sock_net_set() and per-inet_timewait_sock inlines: twsk_net(), twsk_net_set(). Without CONFIG_NET_NS, no namespace other than &init_net exists. Let's explicitly define them to help compiler optimizations. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-01-28[NET_SCHED]: sch_api: introduce constant for rate table sizePatrick McHardy
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NET_SCHED]: Use NLA_PUT_STRING for string dumpingPatrick McHardy
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NET_SCHED]: Convert packet schedulers from rtnetlink to new netlink APIPatrick McHardy
Convert packet schedulers to use the netlink API. Unfortunately a gradual conversion is not possible without breaking compilation in the middle or adding lots of casts, so this patch converts them all in one step. The patch has been mostly generated automatically with some minor edits to at least allow seperate conversion of classifiers and actions. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NET_SCHED]: Move EXPORT_SYMBOL next to exported symbolPatrick McHardy
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NET]: Make rtnetlink infrastructure network namespace aware (v3)Denis V. Lunev
After this patch none of the netlink callback support anything except the initial network namespace but the rtnetlink infrastructure now handles multiple network namespaces. Changes from v2: - IPv6 addrlabel processing Changes from v1: - no need for special rtnl_unlock handling - fixed IPv6 ndisc Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NET]: Modify all rtnetlink methods to only work in the initial namespace (v2)Denis V. Lunev
Before I can enable rtnetlink to work in all network namespaces I need to be certain that something won't break. So this patch deliberately disables all of the rtnletlink methods in everything except the initial network namespace. After the methods have been audited this extra check can be disabled. Changes from v1: - added IPv6 addrlabel protection Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2008-01-28[NET]: Move Qdisc_class_ops and Qdisc_ops in appropriate sections.Eric Dumazet
Qdisc_class_ops are const, and Qdisc_ops are mostly read. Using "const" and "__read_mostly" qualifiers helps to reduce false sharing. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET_SCHED]: Show timer resolution instead of clock resolution in ↵Patrick McHardy
/proc/net/psched The fourth parameter of /proc/net/psched is supposed to show the timer resultion and is used by HTB userspace to calculate the necessary burst rate. Currently we show the clock resolution, which results in a too low burst rate when the two differ. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: Make the device list and device lookups per namespace.Eric W. Biederman
This patch makes most of the generic device layer network namespace safe. This patch makes dev_base_head a network namespace variable, and then it picks up a few associated variables. The functions: dev_getbyhwaddr dev_getfirsthwbytype dev_get_by_flags dev_get_by_name __dev_get_by_name dev_get_by_index __dev_get_by_index dev_ioctl dev_ethtool dev_load wireless_process_ioctl were modified to take a network namespace argument, and deal with it. vlan_ioctl_set and brioctl_set were modified so their hooks will receive a network namespace argument. So basically anthing in the core of the network stack that was affected to by the change of dev_base was modified to handle multiple network namespaces. The rest of the network stack was simply modified to explicitly use &init_net the initial network namespace. This can be fixed when those components of the network stack are modified to handle multiple network namespaces. For now the ifindex generator is left global. Fundametally ifindex numbers are per namespace, or else we will have corner case problems with migration when we get that far. At the same time there are assumptions in the network stack that the ifindex of a network device won't change. Making the ifindex number global seems a good compromise until the network stack can cope with ifindex changes when you change namespaces, and the like. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[NET]: Make /proc/net per network namespaceEric W. Biederman
This patch makes /proc/net per network namespace. It modifies the global variables proc_net and proc_net_stat to be per network namespace. The proc_net file helpers are modified to take a network namespace argument, and all of their callers are fixed to pass &init_net for that argument. This ensures that all of the /proc/net files are only visible and usable in the initial network namespace until the code behind them has been updated to be handle multiple network namespaces. Making /proc/net per namespace is necessary as at least some files in /proc/net depend upon the set of network devices which is per network namespace, and even more files in /proc/net have contents that are relevant to a single network namespace. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-31[NET]: Fix sch_api to properly set sch->parent on the root.Patrick McHardy
Fix sch_api to correctly set sch->parent for both ingress and egress qdiscs in qdisc_create(). Signed-off-by: Patrick McHardy <trash@kaber.net> Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-15[NET_SCHED]: act_api: qdisc internal reclassify supportPatrick McHardy
The behaviour of NET_CLS_POLICE for TC_POLICE_RECLASSIFY was to return it to the qdisc, which could handle it internally or ignore it. With NET_CLS_ACT however, tc_classify starts over at the first classifier and never returns it to the qdisc. This makes it impossible to support qdisc-internal reclassification, which in turn makes it impossible to remove the old NET_CLS_POLICE code without breaking compatibility since we have two qdiscs (CBQ and ATM) that support this. This patch adds a tc_classify_compat function that handles reclassification the old way and changes CBQ and ATM to use it. This again is of course not fully backwards compatible with the previous NET_CLS_ACT behaviour. Unfortunately there is no way to fully maintain compatibility *and* support qdisc internal reclassification with NET_CLS_ACT, but this seems like the better choice over keeping the two incompatible options around forever. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-14[NET_SCHED]: Revert "avoid transmit softirq on watchdog wakeup" optimizationPatrick McHardy
As noticed by Ranko Zivojnovic <ranko@spidernet.net>, calling qdisc_run from the timer handler can result in deadlock: > CPU#0 > > qdisc_watchdog() fires and gets dev->queue_lock > qdisc_run()...qdisc_restart()... > -> releases dev->queue_lock and enters dev_hard_start_xmit() > > CPU#1 > > tc del qdisc dev ... > qdisc_graft()...dev_graft_qdisc()...dev_deactivate()... > -> grabs dev->queue_lock ... > > qdisc_reset()...{cbq,hfsc,htb,netem,tbf}_reset()...qdisc_watchdog_cancel()... > -> hrtimer_cancel() - waiting for the qdisc_watchdog() to exit, while still > holding dev->queue_lock > > CPU#0 > > dev_hard_start_xmit() returns ... > -> wants to get dev->queue_lock(!) > > DEADLOCK! The entire optimization is a bit questionable IMO, it moves potentially large parts of NET_TX_SOFTIRQ work to TIMER_SOFTIRQ/HRTIMER_SOFTIRQ, which kind of defeats the separation of them. Signed-off-by: Patrick McHardy <kaber@trash.net> Acked-by: Ranko Zivojnovic <ranko@spidernet.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-10[NET_SCHED]: Remove unnecessary includesPatrick McHardy
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-10[NET_SCHED]: Remove CONFIG_NET_ESTIMATOR optionPatrick McHardy
The generic estimator is always built in anways and all the config options does is prevent including a minimal amount of code for setting it up. Additionally the option is already automatically selected for most cases. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-03[NET]: Rework dev_base via list_head (v3)Pavel Emelianov
Cleanup of dev_base list use, with the aim to simplify making device list per-namespace. In almost every occasion, use of dev_base variable and dev->next pointer could be easily replaced by for_each_netdev loop. A few most complicated places were converted to using first_netdev()/next_netdev(). Signed-off-by: Pavel Emelianov <xemul@openvz.org> Acked-by: Kirill Korotaev <dev@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25[NET_SCHED]: ingress: switch back to using ingress_lockPatrick McHardy
Switch ingress queueing back to use ingress_lock. qdisc_lock_tree now locks both the ingress and egress qdiscs on the device. All changes to data that might be used on both ingress and egress needs to be protected by using qdisc_lock_tree instead of manually taking dev->queue_lock. Additionally the qdisc stats_lock needs to be initialized to ingress_lock for ingress qdiscs. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25[NET_SCHED]: Eliminate qdisc_tree_lockPatrick McHardy
Since we're now holding the rtnl during the entire dump operation, we can remove qdisc_tree_lock, whose only purpose is to protect dump callbacks from concurrent changes to the qdisc tree. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25[NET_SCHED]: qdisc: remove unnecessary memory barriersPatrick McHardy
We're holding dev->queue_lock in qdisc_watchdog_schedule and qdisc_watchdog_cancel, no need for the barriers. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25[NET_SCHED]: Unline tcf_destroyPatrick McHardy
Uninline tcf_destroy and add a helper function to destroy an entire filter chain. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25[NET_SCHED] qdisc: avoid transmit softirq on watchdog wakeupStephen Hemminger
If possible, avoid having to do a transmit softirq when a qdisc watchdog decides to re-enable. The watchdog routine runs off a timer, so it is already in the same effective context as the softirq. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25[NETEM]: avoid excessive requeuesStephen Hemminger
The netem code would call getnstimeofday() and dequeue/requeue after every packet, even if it was waiting. Avoid this overhead by using the throttled flag. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25[PKT_SCHED] qdisc: Use rtnl registration interfaceThomas Graf
Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25[NETLINK]: Use nlmsg_trim() where appropriateArnaldo Carvalho de Melo
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25[SK_BUFF]: Convert skb->tail to sk_buff_data_tArnaldo Carvalho de Melo
So that it is also an offset from skb->head, reduces its size from 8 to 4 bytes on 64bit architectures, allowing us to combine the 4 bytes hole left by the layer headers conversion, reducing struct sk_buff size to 256 bytes, i.e. 4 64byte cachelines, and since the sk_buff slab cache is SLAB_HWCACHE_ALIGN... :-) Many calculations that previously required that skb->{transport,network, mac}_header be first converted to a pointer now can be done directly, being meaningful as offsets or pointers. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25[NET_SCHED]: Fix warningPatrick McHardy
net/sched/sch_api.c: In function 'psched_show': net/sched/sch_api.c:1219: warning: format '%08x' expects type 'unsigned int', but argument 6 has type 's64' Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>