tc-hfsc

HISTORY & INTRODUCTION
       HFSC  -  Hierarchical  Fair  Service  Curve was first presented at SIG-
       COMM'97. Developed as a part of ALTQ (ALTernative Queuing)  on  NetBSD,
       found  its  way  quickly to other BSD systems, and then a few years ago
       became part of the linux kernel.  Still,  it's  not  the  most  popular
       scheduling  algorithm  -  especially  if compared to HTB - and it's not
       well documented from enduser's perspective. This introduction  aims  to
       explain  how  HFSC works without going to deep into math side of things
       (although some if it will be inevitable).

       In short HFSC aims to:

           1)  guarantee precise bandwidth and delay allocation for  all  leaf
               classes (realtime criterion)

           2)  allocate  excess bandwidth fairly as specified by class hierar-
               chy (linkshare & upperlimit criterion)

           3)  minimize any discrepancy between  the  service  curve  and  the
               actual amount of service provided during linksharing

       The  main  "selling" point of HFSC is feature (1), which is achieved by
       using nonlinear service curves (more about what it actually is  later).
       This  is particularly useful in VoIP or games, where not only guarantee
       of consistent bandwidth is important,  but  initial  delay  of  a  data
       stream  as  well. Note that it matters only for leaf classes (where the
       actual queues are) - thus class hierarchy is ignored in realtime case.

       Feature (2) is well, obvious - any algorithm featuring class  hierarchy
       (such  as  HTB  or  CBQ)  strives to achieve that. HFSC does that well,
       although you might end with unusual situations, if you  define  service
       curves carelessly - see section CORNER CASES for examples.

       Feature (3) is mentioned due to the nature of the problem. There may be
       situations where it's either not possible to guarantee service  of  all
       curves  at  the same time, and/or it's impossible to do so fairly. Both
       will be explained later. Note that this is mainly related  to  interior
       (aka aggregate) classes, as the leafs are already handled by (1). Still
       - it's perfectly possible to create a leaf class w/o realtime  service,
       and in such case - the caveats will naturally extend to leaf classes as
       well.


ABBREVIATIONS
       For the remaining part of the document, we'll use following shortcuts:

           RT - realtime
           LS - linkshare
           UL - upperlimit
           SC - service curve

BASICS OF HFSC
       To understand how HFSC works, we must first introduce a service  curve.
       Overall,  it's  a  nondecreasing  function of some time unit, returning
               vice curve is violated, maybe it isn't if we count excess band-
               width received during earlier active period(s)

       Let's define the criterion as follows:

           (1) For each t1, there must exist t0 in set B, so S(t1-t0) <= w(t0,t1)

       Here 'w' denotes the amount of service received during some time period
       between t0 and t1. B is a set of all times,  where  a  session  becomes
       active  after idling period (further denoted as 'becoming backlogged').
       For a clearer picture, imagine two situations:

           a)  our session was active during two periods, with  a  small  time
               gap between them

           b)  as in (a), but with a larger gap

       Consider  (a)  - if the service received during both periods meets (1),
       then all is good. But what if it doesn't do so during the 2nd period  ?
       If  the amount of service received during the 1st period is bigger than
       the service curve, then it might compensate for smaller service  during
       the 2nd period and the gap - if the gap is small enough.

       If  the gap is larger (b) - then it's less likely to happen (unless the
       excess bandwidth allocated during  the  1st  part  was  really  large).
       Still,  the  larger  the gap - the less interesting is what happened in
       the past (e.g. 10 minutes ago) - what matters is  the  current  traffic
       that just started.

       From  HFSC's  perspective,  more interesting is answering the following
       question: when should we start transferring packets, so a service curve
       of  a  class  is not violated. Or rephrasing it: How much X() amount of
       service should a session receive by time t, so the service curve is not
       violated.  Function X() defined as below is the basic building block of
       HFSC, used in: eligible, deadline, virtual-time and fit-time curves. Of
       course, X() is based on equation (1) and is defined recursively:


           o   At  the  1st backlogged period beginning function X is initial-
               ized to generic service curve assigned to a class

           o   At any subsequent backlogged period, X() is:
               min(X() from previous period ; w(t0)+S(t-t0) for t>=t0),
               ... where t0 denotes the beginning of  the  current  backlogged
               period.

       HFSC uses either linear, or two-piece linear service curves. In case of
       linear or two-piece linear  convex  functions  (first  slope  <  second
       slope),  min()  in  X's  definition reduces to the 2nd argument. But in
       case of two-piece concave functions, the  1st  argument  might  quickly
       become  lesser  for  some t>=t0. Note, that for some backlogged period,
       X() is defined only  from  that  period's  beginning.  We  also  define
       X^(-1)(w)  as  smallest t>=t0, for which X(t) = w. We have to define it
       this way, as X() is usually not an injection.
               Based on RT service curve.

           V() In  linkshare  criterion, arbitrates which packet to send next.
               Note that V() is function of a virtual  time  -  see  LINKSHARE
               CRITERION section for details. Virtual time 'vt' corresponds to
               packets' heads (vt = V^(-1)(w)). Based on LS service curve.

           F() An extension to linkshare criterion, used  to  limit  at  which
               speed  linkshare criterion is allowed to dequeue. Fit-time 'ft'
               corresponds to packets' heads as well (ft =  F^(-1)(w)).  Based
               on UL service curve.

       Be  sure to make clean distinction between session's RT, LS and UL ser-
       vice curves and the above "utility" functions.

REALTIME CRITERION
       RT criterion ignores class hierarchy and guarantees  precise  bandwidth
       and  delay allocation. We say that packet is eligible for sending, when
       current real time is bigger than eligible time. From all packets eligi-
       ble,  the  one  most  suited  for sending, is the one with the smallest
       deadline time. Sounds simply, but consider following example:

       Interface 10mbit, two  classes,  both  with  two-piece  linear  service
       curves:

           o   1st  class  - 2mbit for 100ms, then 7mbit (convex - 1st slope <
               2nd slope)

           o   2nd class - 7mbit for 100ms, then 2mbit (concave - 1st slope  >
               2nd slope)

       Assume  for  a  moment,  that we only use D() for both finding eligible
       packets, and choosing the most fitting one, thus eligible time would be
       computed   as   D^(-1)(w)  and  deadline  time  would  be  computed  as
       D^(-1)(w+l). If the 2nd class starts sending packets 1 second after the
       1st class, it's of course impossible to guarantee 14mbit, as the inter-
       face capability is only 10mbit.  The only workaround in  this  scenario
       is  to  allow the 1st class to send the packets earlier that would nor-
       mally be allowed. That's where separate E() comes to help. Putting  all
       the math aside (see HFSC paper for details), E() for RT concave service
       curve is just like D(), but for the RT convex service curve - it's con-
       structed  using  only  RT  service  curve's 2nd slope (in our example -
       7mbit).

       The effect of such E() - packets will be sent earlier, and at the  same
       time  D() will be updated - so current deadline time calculated from it
       will be bigger. Thus, when the 2nd class starts sending packets  later,
       both  the 1st and the 2nd class will be eligible, but the 2nd session's
       deadline time will be smaller and its packets will be sent first.  When
       the  1st  class becomes idle at some later point, the 2nd class will be
       able to "buffer" up again for later active period of the 1st class.

       A short remark - in a situation, where the total  amount  of  bandwidth
       available  on the interface is bigger than the allocated total realtime
       to LS one - see UPPERLIMIT CRITERION section).

       Anyway - careless specification of LS and RT service curves can lead to
       potentially  undesired situations (see CORNER CASES for examples). This
       wasn't the case in HFSC paper where LS and RT service  curves  couldn't
       be specified separately.


LINKSHARING CRITERION
       LS  criterion's  task is to distribute bandwidth according to specified
       class hierarchy. Contrary to  RT  criterion,  there're  no  comparisons
       between  current  real  time  and  virtual time - the decision is based
       solely on direct comparison of virtual times of all active subclasses -
       the  one  with  the  smallest vt wins and gets scheduled. One immediate
       conclusion from this fact is that absolute values don't matter  -  only
       ratios  between  them (so for example, two children classes with simple
       linear 1mbit service curves will get the same treatment from LS  crite-
       rion's  perspective,  as  if they were 5mbit). The other conclusion is,
       that in perfectly fluid system with linear curves,  all  virtual  times
       across whole class hierarchy would be equal.

       Why is VC defined in term of virtual time (and what is it) ?

       Imagine  an  example:  class A with two children - A1 and A2, both with
       let's say 10mbit SCs. If A2 is idle, A1 receives all the bandwidth of A
       (and  update its V() in the process). When A2 becomes active, A1's vir-
       tual time is already far bigger than A2's one. Considering the type  of
       decision  made by LS criterion, A1 would become idle for a lot of time.
       We can workaround this situation by adjusting virtual time of the class
       becoming  active  -  we do that by getting such time "up to date". HFSC
       uses a mean of the smallest and the biggest virtual time  of  currently
       active children fit for sending. As it's not real time anymore (exclud-
       ing trivial case of situation where all classes become  active  at  the
       same time, and never become idle), it's called virtual time.

       Such  approach  has  its price though. The problem is analogous to what
       was presented in previous section and is  caused  by  non-linearity  of
       service curves:

       1)  either  it's  impossible  to  guarantee  service curves and satisfy
           fairness during certain time periods:

           Recall the example from RT section, slightly modified  (with  3mbit
           slopes instead of 2mbit ones):


           o   1st  class  - 3mbit for 100ms, then 7mbit (convex - 1st slope <
               2nd slope)

           o   2nd class - 7mbit for 100ms, then 3mbit (concave - 1st slope  >
               2nd slope)


           They  sum  up  nicely  to  10mbit - interface's capacity. But if we

           This  is  similar to the above case, but a bit more subtle. We will
           consider two subtrees, arbitrated by their common (root here)  par-
           ent:

           R (root) - 10mbit

           A  - 7mbit, then 3mbit
           A1 - 5mbit, then 2mbit
           A2 - 2mbit, then 1mbit

           B  - 3mbit, then 7mbit

           R arbitrates between left subtree (A) and right (B). Assume that A2
           and B are constantly backlogged, and at some later point A1 becomes
           backlogged (when all other classes are in their 2nd linear part).

           What happens now ? B (choice made by R) will always get 7 mbit as R
           is only (obviously) concerned with the  ratio  between  its  direct
           children.  Thus  A  subtree gets 3mbit, but its children would want
           (at the point when A1 became backlogged) 5mbit + 1mbit.  That's  of
           course impossible, as they can only get 3mbit due to interface lim-
           itation.

           In the left subtree - we have  the  same  situation  as  previously
           (fair split between A1 and A2, but violated guarantees), but in the
           whole tree - there's no fairness (B got 7mbit, but A1 and  A2  have
           to fit together in 3mbit) and there's no guarantees for all classes
           (only B got what it wanted). Even if we violated fairness in the  A
           subtree and set A2's service curve to 0, A1 would still not get the
           required bandwidth.

UPPERLIMIT CRITERION
       UL criterion is an extensions to LS one, that permits  sending  packets
       only  if current real time is bigger than fit-time ('ft'). So the modi-
       fied LS criterion becomes: choose the smallest virtual  time  from  all
       active  children,  such  that  fit-time < current real time also holds.
       Fit-time is calculated from F(), which is based on UL service curve. As
       you  can  see,  it's role is kinda similar to E() used in RT criterion.
       Also, for obvious reasons - you can't specify UL service curve  without
       LS one.

       Main  purpose  of UL service curve is to limit HFSC to bandwidth avail-
       able on the upstream router (think adsl home  modem/router,  and  linux
       server  as  nat/firewall/etc.  with  100mbit+  connection  to mentioned
       modem/router).  Typically, it's used to create a single class  directly
       under  root,  setting  linear UL service curve to available bandwidth -
       and then creating your class structure from that  class  downwards.  Of
       course,  you're  free  to  add  UL service (linear or not) curve to any
       class with LS criterion.

       Important part about UL service curve is, that whenever at  some  point
       in  time  a  class doesn't qualify for linksharing due to its fit-time,
       the next time it does qualify, it will update its virtual time  to  the
       B - ls 2.5mbit
       C - ls 2.5mbit, ul 2.5mbit

       If B was idle, while A and C were  constantly  backlogged,  they  would
       normally  (as far as LS criterion is concerned) divide bandwidth in 2:1
       ratio. But due to UL service curve  in  place,  C  would  get  at  most
       2.5mbit,  and  A  would get the remaining 7.5mbit. The longer the back-
       logged period, the more virtual times of A and C would drift apart.  If
       B became backlogged at some later point in time, its virtual time would
       be set to (A's vt + C's vt)/2, thus blocking A from sending  any  traf-
       fic, until B's virtual time catches up with A.

SEPARATE LS / RT SCs
       Another  difference from original HFSC paper, is that RT and LS SCs can
       be specified separately. Moreover - leaf classes are  allowed  to  have
       only  either  RT  SC  or  LS SC. For interior classes, only LS SCs make
       sense - Any RT SC will be ignored.

CORNER CASES
       Separate service curves for LS and RT  criteria  can  lead  to  certain
       traps, that come from "fighting" between ideal linksharing and enforced
       realtime guarantees. Those situations didn't  exist  in  original  HFSC
       paper,  where  specifying  separate LS / RT service curves was not dis-
       cussed.

       Consider interface with capacity 10mbit, with following leaf classes:

       A - ls 5.0mbit, rt 8mbit
       B - ls 2.5mbit
       C - ls 2.5mbit

       Imagine A and C are constantly backlogged. As B is idle, A and C  would
       divide bandwidth in 2:1 ratio, considering LS service curve (so in the-
       ory - 6.66 and 3.33). Alas RT criterion takes priority, so A  will  get
       8mbit  and LS will be able to compensate class C for only 2 mbit - this
       will cause discrepancy between virtual times of A and C.

       Assume this situation lasts for a lot of time with no idle periods, and
       suddenly  B  becomes  active.  B's  virtual  time  will  be  updated to
       (A's vt + C's vt)/2, effectively landing in the middle between A's  and
       C's virtual time. The effect - B, having no RT guarantees, will be pun-
       ished and will not be  allowed  to  transfer  until  C's  virtual  time
       catches up.

       If  the interface had higher capacity - for example 100mbit, this exam-
       ple would behave perfectly fine though.

       Let's look a bit closer at the above example -  it  "cleverly"  invali-
       dates  one of the basic things LS criterion tries to achieve - equality
       of all virtual times across class hierarchy. Leaf  classes  without  RT
       service curves are literally left to their own fate (governed by messed
       up virtual times).

       Also - it doesn't make much sense. Class A will always be guaranteed up
       You  can  quickly "workaround" it by making sure each leaf class has RT
       service curve assigned (thus guaranteeing all of  them  will  get  some
       bandwidth), but it doesn't make it any more valid.

       Keep in mind - if you use nonlinear curves and irregularities explained
       above happen only in the first segment, then there's little wrong  with
       "overusing" RT curve a bit:

       A - ls 5.0mbit, rt 9mbit/30ms, then 1mbit
       B - ls 2.5mbit
       C - ls 2.5mbit

       Here,  the  vt of A will "spike" in the initial period, but then A will
       never get more than 1mbit, until B & C catch up. Then  everything  will
       be back to normal.

LINUX AND TIMER RESOLUTION
       In  certain  situations, the scheduler can throttle itself and setup so
       called watchdog to wakeup dequeue function at some time later. In  case
       of  HFSC it happens when for example no packet is eligible for schedul-
       ing, and UL service curve is used to limit the speed at which LS crite-
       rion  is  allowed to dequeue packets. It's called throttling, and accu-
       racy of it is dependent on how the kernel is compiled.

       There're 3 important options in modern kernels, as far as timers' reso-
       lution  goes:  'tickless  system',  'high resolution timer support' and
       'timer frequency'.

       If you have 'tickless system' enabled, then the  timer  interrupt  will
       trigger  as  slowly  as  possible,  but each time a scheduler throttles
       itself (or any other part of the kernel  needs  better  accuracy),  the
       rate  will  be  increased  as  needed / possible. The ceiling is either
       'timer frequency' if 'high resolution timer support' is  not  available
       or  not  compiled  in, or it's hardware dependent and can go far beyond
       the highest 'timer frequency' setting available.

       If 'tickless system' is not enabled, the timer will trigger at a  fixed
       rate  specified  by  'timer  frequency' - regardless if high resolution
       timers are or aren't available.

       This is important to keep those settings in mind, as in scenario  like:
       no tickless, no HR timers, frequency set to 100hz - throttling accuracy
       would be at 10ms. It doesn't automatically mean you would be limited to
       ~0.8mbit/s (assuming packets at ~1KB) - as long as your queues are pre-
       pared to cover for timer inaccuracy. Of course, in case of e.g. locally
       generated  udp  traffic  -  appropriate  socket size is needed as well.
       Short  example  to  make  it  more  understandable   (assume   hardcore
       anti-schedule settings - HZ=100, no HR timers, no tickless):

       tc qdisc add dev eth0 root handle 1:0 hfsc default 1
       tc class add dev eth0 parent 1:0 classid 1:1 hfsc rt m2 10mbit

       Assuming  packet  of  ~1KB size and HZ=100, that averages to ~0.8mbit -
       anything beyond it (e.g. the above example with specified rate over 10x
       tc qdisc add dev eth0 parent root handle 1:0 hfsc default 1
       tc class add dev eth0 paretn 1:0 classid 1:1 hfsc rt m2 300mbit

       And simple:

       nc -u dst.host.com 54321 </dev/zero
       nc -l -p 54321 >/dev/null

       ...will  yield following effects over period of ~10 seconds (taken from
       /proc/interrupts):

       319: 42124229   0  HPET_MSI-edge  hpet2 (before)
       319: 42436214   0  HPET_MSI-edge  hpet2 (after 10s.)

       That's roughly 31000/s. Now compare it with HZ=1000 setting. The  obvi-
       ous  drawback  of it is that cpu load can be rather extensive with ser-
       vicing that many timer interrupts.  Example  with  300mbit  RT  service
       curve  on  1gbit  link  is  particularly  ugly, as it requires a lot of
       throttling with minuscule delays.

       Also note that it's just an example showing capability of current hard-
       ware.   The  above example (essentially 300mbit TBF emulator) is point-
       less on internal interface to begin with - you will pretty much  always
       want  regular  LS service curve there, and in such scenario HFSC simply
       doesn't throttle at all.

       300mbit RT service curve (selected columns from mpstat -P ALL 1):

       10:56:43 PM  CPU  %sys     %irq   %soft   %idle
       10:56:44 PM  all  20.10    6.53   34.67   37.19
       10:56:44 PM    0  35.00    0.00   63.00    0.00
       10:56:44 PM    1   4.95   12.87    6.93   73.27

       So, in rare case you need those speeds with only RT service  curve,  or
       with UL service curve - remember about drawbacks.

CAVEAT: RANDOM ONLINE EXAMPLES
       For reasons unknown (though well guessed), many examples you can google
       love to overuse UL criterion and stuff it in every node possible.  This
       makes no sense and works against what HFSC tries to do (and does pretty
       damn well). Use UL where it makes sense -  on  the  uppermost  node  to
       match upstream router's uplink capacity. Or - in special cases, such as
       testing (limit certain subtree to some speed) or  customers  that  must
       never  get  more  than  certain speed. In the last case you can usually
       achieve the same by just using  RT  criterion  without  LS+UL  on  leaf
       nodes.

       As for router case - remember it's good to differentiate between "traf-
       fic to router" (remote console, web config, etc.) and  "outgoing  traf-
       fic", so for example:

       tc qdisc add dev eth0 root handle 1:0 hfsc default 0x8002
       tc class add dev eth0 parent 1:0 classid 1:999 hfsc rt m2 50mbit
       tc class add dev eth0 parent 1:0 classid 1:1 hfsc ls m2 2mbit ul m2 2mbit
       Manpage created by Michal Soltys (sol...@ziu.info)



iproute2                        31 October 2011                        HFSC(7)
Man Pages Copyright Respective Owners. Site Copyright (C) 1994 - 2017 Hurricane Electric. All Rights Reserved.