HISTORY & INTRODUCTION
HFSC - Hierarchical Fair Service Curve was first presented at SIG-
COMM'97. Developed as a part of ALTQ (ALTernative Queuing) on NetBSD,
found its way quickly to other BSD systems, and then a few years ago
became part of the linux kernel. Still, it's not the most popular
scheduling algorithm - especially if compared to HTB - and it's not
well documented from enduser's perspective. This introduction aims to
explain how HFSC works without going to deep into math side of things
(although some if it will be inevitable).
In short HFSC aims to:
1) guarantee precise bandwidth and delay allocation for all leaf
classes (realtime criterion)
2) allocate excess bandwidth fairly as specified by class hierar-
chy (linkshare & upperlimit criterion)
3) minimize any discrepancy between the service curve and the
actual amount of service provided during linksharing
The main "selling" point of HFSC is feature (1), which is achieved by
using nonlinear service curves (more about what it actually is later).
This is particularly useful in VoIP or games, where not only guarantee
of consistent bandwidth is important, but initial delay of a data
stream as well. Note that it matters only for leaf classes (where the
actual queues are) - thus class hierarchy is ignored in realtime case.
Feature (2) is well, obvious - any algorithm featuring class hierarchy
(such as HTB or CBQ) strives to achieve that. HFSC does that well,
although you might end with unusual situations, if you define service
curves carelessly - see section CORNER CASES for examples.
Feature (3) is mentioned due to the nature of the problem. There may be
situations where it's either not possible to guarantee service of all
curves at the same time, and/or it's impossible to do so fairly. Both
will be explained later. Note that this is mainly related to interior
(aka aggregate) classes, as the leafs are already handled by (1). Still
- it's perfectly possible to create a leaf class w/o realtime service,
and in such case - the caveats will naturally extend to leaf classes as
For the remaining part of the document, we'll use following shortcuts:
RT - realtime
LS - linkshare
UL - upperlimit
SC - service curve
BASICS OF HFSC
To understand how HFSC works, we must first introduce a service curve.
Overall, it's a nondecreasing function of some time unit, returning
vice curve is violated, maybe it isn't if we count excess band-
width received during earlier active period(s)
Let's define the criterion as follows:
(1) For each t1, there must exist t0 in set B, so S(t1-t0) <= w(t0,t1)
Here 'w' denotes the amount of service received during some time period
between t0 and t1. B is a set of all times, where a session becomes
active after idling period (further denoted as 'becoming backlogged').
For a clearer picture, imagine two situations:
a) our session was active during two periods, with a small time
gap between them
b) as in (a), but with a larger gap
Consider (a) - if the service received during both periods meets (1),
then all is good. But what if it doesn't do so during the 2nd period ?
If the amount of service received during the 1st period is bigger than
the service curve, then it might compensate for smaller service during
the 2nd period and the gap - if the gap is small enough.
If the gap is larger (b) - then it's less likely to happen (unless the
excess bandwidth allocated during the 1st part was really large).
Still, the larger the gap - the less interesting is what happened in
the past (e.g. 10 minutes ago) - what matters is the current traffic
that just started.
From HFSC's perspective, more interesting is answering the following
question: when should we start transferring packets, so a service curve
of a class is not violated. Or rephrasing it: How much X() amount of
service should a session receive by time t, so the service curve is not
violated. Function X() defined as below is the basic building block of
HFSC, used in: eligible, deadline, virtual-time and fit-time curves. Of
course, X() is based on equation (1) and is defined recursively:
o At the 1st backlogged period beginning function X is initial-
ized to generic service curve assigned to a class
o At any subsequent backlogged period, X() is:
min(X() from previous period ; w(t0)+S(t-t0) for t>=t0),
... where t0 denotes the beginning of the current backlogged
HFSC uses either linear, or two-piece linear service curves. In case of
linear or two-piece linear convex functions (first slope < second
slope), min() in X's definition reduces to the 2nd argument. But in
case of two-piece concave functions, the 1st argument might quickly
become lesser for some t>=t0. Note, that for some backlogged period,
X() is defined only from that period's beginning. We also define
X^(-1)(w) as smallest t>=t0, for which X(t) = w. We have to define it
this way, as X() is usually not an injection.
Based on RT service curve.
V() In linkshare criterion, arbitrates which packet to send next.
Note that V() is function of a virtual time - see LINKSHARE
CRITERION section for details. Virtual time 'vt' corresponds to
packets' heads (vt = V^(-1)(w)). Based on LS service curve.
F() An extension to linkshare criterion, used to limit at which
speed linkshare criterion is allowed to dequeue. Fit-time 'ft'
corresponds to packets' heads as well (ft = F^(-1)(w)). Based
on UL service curve.
Be sure to make clean distinction between session's RT, LS and UL ser-
vice curves and the above "utility" functions.
RT criterion ignores class hierarchy and guarantees precise bandwidth
and delay allocation. We say that packet is eligible for sending, when
current real time is bigger than eligible time. From all packets eligi-
ble, the one most suited for sending, is the one with the smallest
deadline time. Sounds simply, but consider following example:
Interface 10mbit, two classes, both with two-piece linear service
o 1st class - 2mbit for 100ms, then 7mbit (convex - 1st slope <
o 2nd class - 7mbit for 100ms, then 2mbit (concave - 1st slope >
Assume for a moment, that we only use D() for both finding eligible
packets, and choosing the most fitting one, thus eligible time would be
computed as D^(-1)(w) and deadline time would be computed as
D^(-1)(w+l). If the 2nd class starts sending packets 1 second after the
1st class, it's of course impossible to guarantee 14mbit, as the inter-
face capability is only 10mbit. The only workaround in this scenario
is to allow the 1st class to send the packets earlier that would nor-
mally be allowed. That's where separate E() comes to help. Putting all
the math aside (see HFSC paper for details), E() for RT concave service
curve is just like D(), but for the RT convex service curve - it's con-
structed using only RT service curve's 2nd slope (in our example -
The effect of such E() - packets will be sent earlier, and at the same
time D() will be updated - so current deadline time calculated from it
will be bigger. Thus, when the 2nd class starts sending packets later,
both the 1st and the 2nd class will be eligible, but the 2nd session's
deadline time will be smaller and its packets will be sent first. When
the 1st class becomes idle at some later point, the 2nd class will be
able to "buffer" up again for later active period of the 1st class.
A short remark - in a situation, where the total amount of bandwidth
available on the interface is bigger than the allocated total realtime
to LS one - see UPPERLIMIT CRITERION section).
Anyway - careless specification of LS and RT service curves can lead to
potentially undesired situations (see CORNER CASES for examples). This
wasn't the case in HFSC paper where LS and RT service curves couldn't
be specified separately.
LS criterion's task is to distribute bandwidth according to specified
class hierarchy. Contrary to RT criterion, there're no comparisons
between current real time and virtual time - the decision is based
solely on direct comparison of virtual times of all active subclasses -
the one with the smallest vt wins and gets scheduled. One immediate
conclusion from this fact is that absolute values don't matter - only
ratios between them (so for example, two children classes with simple
linear 1mbit service curves will get the same treatment from LS crite-
rion's perspective, as if they were 5mbit). The other conclusion is,
that in perfectly fluid system with linear curves, all virtual times
across whole class hierarchy would be equal.
Why is VC defined in term of virtual time (and what is it) ?
Imagine an example: class A with two children - A1 and A2, both with
let's say 10mbit SCs. If A2 is idle, A1 receives all the bandwidth of A
(and update its V() in the process). When A2 becomes active, A1's vir-
tual time is already far bigger than A2's one. Considering the type of
decision made by LS criterion, A1 would become idle for a lot of time.
We can workaround this situation by adjusting virtual time of the class
becoming active - we do that by getting such time "up to date". HFSC
uses a mean of the smallest and the biggest virtual time of currently
active children fit for sending. As it's not real time anymore (exclud-
ing trivial case of situation where all classes become active at the
same time, and never become idle), it's called virtual time.
Such approach has its price though. The problem is analogous to what
was presented in previous section and is caused by non-linearity of
1) either it's impossible to guarantee service curves and satisfy
fairness during certain time periods:
Recall the example from RT section, slightly modified (with 3mbit
slopes instead of 2mbit ones):
o 1st class - 3mbit for 100ms, then 7mbit (convex - 1st slope <
o 2nd class - 7mbit for 100ms, then 3mbit (concave - 1st slope >
They sum up nicely to 10mbit - interface's capacity. But if we
This is similar to the above case, but a bit more subtle. We will
consider two subtrees, arbitrated by their common (root here) par-
R (root) - 10mbit
A - 7mbit, then 3mbit
A1 - 5mbit, then 2mbit
A2 - 2mbit, then 1mbit
B - 3mbit, then 7mbit
R arbitrates between left subtree (A) and right (B). Assume that A2
and B are constantly backlogged, and at some later point A1 becomes
backlogged (when all other classes are in their 2nd linear part).
What happens now ? B (choice made by R) will always get 7 mbit as R
is only (obviously) concerned with the ratio between its direct
children. Thus A subtree gets 3mbit, but its children would want
(at the point when A1 became backlogged) 5mbit + 1mbit. That's of
course impossible, as they can only get 3mbit due to interface lim-
In the left subtree - we have the same situation as previously
(fair split between A1 and A2, but violated guarantees), but in the
whole tree - there's no fairness (B got 7mbit, but A1 and A2 have
to fit together in 3mbit) and there's no guarantees for all classes
(only B got what it wanted). Even if we violated fairness in the A
subtree and set A2's service curve to 0, A1 would still not get the
UL criterion is an extensions to LS one, that permits sending packets
only if current real time is bigger than fit-time ('ft'). So the modi-
fied LS criterion becomes: choose the smallest virtual time from all
active children, such that fit-time < current real time also holds.
Fit-time is calculated from F(), which is based on UL service curve. As
you can see, it's role is kinda similar to E() used in RT criterion.
Also, for obvious reasons - you can't specify UL service curve without
Main purpose of UL service curve is to limit HFSC to bandwidth avail-
able on the upstream router (think adsl home modem/router, and linux
server as nat/firewall/etc. with 100mbit+ connection to mentioned
modem/router). Typically, it's used to create a single class directly
under root, setting linear UL service curve to available bandwidth -
and then creating your class structure from that class downwards. Of
course, you're free to add UL service (linear or not) curve to any
class with LS criterion.
Important part about UL service curve is, that whenever at some point
in time a class doesn't qualify for linksharing due to its fit-time,
the next time it does qualify, it will update its virtual time to the
B - ls 2.5mbit
C - ls 2.5mbit, ul 2.5mbit
If B was idle, while A and C were constantly backlogged, they would
normally (as far as LS criterion is concerned) divide bandwidth in 2:1
ratio. But due to UL service curve in place, C would get at most
2.5mbit, and A would get the remaining 7.5mbit. The longer the back-
logged period, the more virtual times of A and C would drift apart. If
B became backlogged at some later point in time, its virtual time would
be set to (A's vt + C's vt)/2, thus blocking A from sending any traf-
fic, until B's virtual time catches up with A.
SEPARATE LS / RT SCs
Another difference from original HFSC paper, is that RT and LS SCs can
be specified separately. Moreover - leaf classes are allowed to have
only either RT SC or LS SC. For interior classes, only LS SCs make
sense - Any RT SC will be ignored.
Separate service curves for LS and RT criteria can lead to certain
traps, that come from "fighting" between ideal linksharing and enforced
realtime guarantees. Those situations didn't exist in original HFSC
paper, where specifying separate LS / RT service curves was not dis-
Consider interface with capacity 10mbit, with following leaf classes:
A - ls 5.0mbit, rt 8mbit
B - ls 2.5mbit
C - ls 2.5mbit
Imagine A and C are constantly backlogged. As B is idle, A and C would
divide bandwidth in 2:1 ratio, considering LS service curve (so in the-
ory - 6.66 and 3.33). Alas RT criterion takes priority, so A will get
8mbit and LS will be able to compensate class C for only 2 mbit - this
will cause discrepancy between virtual times of A and C.
Assume this situation lasts for a lot of time with no idle periods, and
suddenly B becomes active. B's virtual time will be updated to
(A's vt + C's vt)/2, effectively landing in the middle between A's and
C's virtual time. The effect - B, having no RT guarantees, will be pun-
ished and will not be allowed to transfer until C's virtual time
If the interface had higher capacity - for example 100mbit, this exam-
ple would behave perfectly fine though.
Let's look a bit closer at the above example - it "cleverly" invali-
dates one of the basic things LS criterion tries to achieve - equality
of all virtual times across class hierarchy. Leaf classes without RT
service curves are literally left to their own fate (governed by messed
up virtual times).
Also - it doesn't make much sense. Class A will always be guaranteed up
You can quickly "workaround" it by making sure each leaf class has RT
service curve assigned (thus guaranteeing all of them will get some
bandwidth), but it doesn't make it any more valid.
Keep in mind - if you use nonlinear curves and irregularities explained
above happen only in the first segment, then there's little wrong with
"overusing" RT curve a bit:
A - ls 5.0mbit, rt 9mbit/30ms, then 1mbit
B - ls 2.5mbit
C - ls 2.5mbit
Here, the vt of A will "spike" in the initial period, but then A will
never get more than 1mbit, until B & C catch up. Then everything will
be back to normal.
LINUX AND TIMER RESOLUTION
In certain situations, the scheduler can throttle itself and setup so
called watchdog to wakeup dequeue function at some time later. In case
of HFSC it happens when for example no packet is eligible for schedul-
ing, and UL service curve is used to limit the speed at which LS crite-
rion is allowed to dequeue packets. It's called throttling, and accu-
racy of it is dependent on how the kernel is compiled.
There're 3 important options in modern kernels, as far as timers' reso-
lution goes: 'tickless system', 'high resolution timer support' and
If you have 'tickless system' enabled, then the timer interrupt will
trigger as slowly as possible, but each time a scheduler throttles
itself (or any other part of the kernel needs better accuracy), the
rate will be increased as needed / possible. The ceiling is either
'timer frequency' if 'high resolution timer support' is not available
or not compiled in, or it's hardware dependent and can go far beyond
the highest 'timer frequency' setting available.
If 'tickless system' is not enabled, the timer will trigger at a fixed
rate specified by 'timer frequency' - regardless if high resolution
timers are or aren't available.
This is important to keep those settings in mind, as in scenario like:
no tickless, no HR timers, frequency set to 100hz - throttling accuracy
would be at 10ms. It doesn't automatically mean you would be limited to
~0.8mbit/s (assuming packets at ~1KB) - as long as your queues are pre-
pared to cover for timer inaccuracy. Of course, in case of e.g. locally
generated udp traffic - appropriate socket size is needed as well.
Short example to make it more understandable (assume hardcore
anti-schedule settings - HZ=100, no HR timers, no tickless):
tc qdisc add dev eth0 root handle 1:0 hfsc default 1
tc class add dev eth0 parent 1:0 classid 1:1 hfsc rt m2 10mbit
Assuming packet of ~1KB size and HZ=100, that averages to ~0.8mbit -
anything beyond it (e.g. the above example with specified rate over 10x
tc qdisc add dev eth0 parent root handle 1:0 hfsc default 1
tc class add dev eth0 paretn 1:0 classid 1:1 hfsc rt m2 300mbit
nc -u dst.host.com 54321 </dev/zero
nc -l -p 54321 >/dev/null
...will yield following effects over period of ~10 seconds (taken from
319: 42124229 0 HPET_MSI-edge hpet2 (before)
319: 42436214 0 HPET_MSI-edge hpet2 (after 10s.)
That's roughly 31000/s. Now compare it with HZ=1000 setting. The obvi-
ous drawback of it is that cpu load can be rather extensive with ser-
vicing that many timer interrupts. Example with 300mbit RT service
curve on 1gbit link is particularly ugly, as it requires a lot of
throttling with minuscule delays.
Also note that it's just an example showing capability of current hard-
ware. The above example (essentially 300mbit TBF emulator) is point-
less on internal interface to begin with - you will pretty much always
want regular LS service curve there, and in such scenario HFSC simply
doesn't throttle at all.
300mbit RT service curve (selected columns from mpstat -P ALL 1):
10:56:43 PM CPU %sys %irq %soft %idle
10:56:44 PM all 20.10 6.53 34.67 37.19
10:56:44 PM 0 35.00 0.00 63.00 0.00
10:56:44 PM 1 4.95 12.87 6.93 73.27
So, in rare case you need those speeds with only RT service curve, or
with UL service curve - remember about drawbacks.
CAVEAT: RANDOM ONLINE EXAMPLES
For reasons unknown (though well guessed), many examples you can google
love to overuse UL criterion and stuff it in every node possible. This
makes no sense and works against what HFSC tries to do (and does pretty
damn well). Use UL where it makes sense - on the uppermost node to
match upstream router's uplink capacity. Or - in special cases, such as
testing (limit certain subtree to some speed) or customers that must
never get more than certain speed. In the last case you can usually
achieve the same by just using RT criterion without LS+UL on leaf
As for router case - remember it's good to differentiate between "traf-
fic to router" (remote console, web config, etc.) and "outgoing traf-
fic", so for example:
tc qdisc add dev eth0 root handle 1:0 hfsc default 0x8002
tc class add dev eth0 parent 1:0 classid 1:999 hfsc rt m2 50mbit
tc class add dev eth0 parent 1:0 classid 1:1 hfsc ls m2 2mbit ul m2 2mbit
Manpage created by Michal Soltys (sol...@ziu.info)
iproute2 31 October 2011 HFSC(7)
Man Pages Copyright Respective Owners. Site Copyright (C) 1994 - 2017
All Rights Reserved.