Patches to Linux's traffic control engine
to allow it to accurately calculate
ATM traffic rates
Time has moved on, and the Linux Traffic control engine now has ATM support, largely due to Jesper Brouer. As a consequence this page is now only of historical interest.
Currently the traffic control engine in Linux can't accurately calculate the time required to send data over an ATM link. For large packets (say >1K bytes) the error generated by the current methods is less than 5%, but for small packets the error rises to over 40%. The patches here allow Linux to calculate ATM traffic rates with no error.
Current patches:
iproute2-20060301-tcatm-1.0.patch | This patch for the iproute2 package (sometimes called iproute) adds an option to the "tc" program that configures the kernel to calculate ATM traffic rates. It works best if combined with the kernel patch that follows. |
linux-2.6.16-tcatm-1.0.patch | This patch for the Linux 2.6.16 kernel enables it to accurately calculate how long it will take to send a packet over an ATM link. |
iproute2-20060301-htb-nohyst-1.0.patch | This patch for the iproute2 package (sometimes called iproute) adds an option to the "tc" program that changes the HTB qdisc hysteresis option. It assumes the tcatm patch has been applied. It will only work on kernels that have had the following patch applied. |
linux-2.6.11-htb-nohyst-1.0.patch | This patch for the Linux 2.6.11 kernels onwards allows the HTB hysteresis option to be changed dynamically per HTB class. |
Patches for older kernels.
linux-2.6.14-tcatm-1.0.patch | For 2.6.14 and 2.6.15 kernels. |
linux-2.6.11-tcatm-1.0.patch | For 2.6.11, 2.6.12 and 2.6.13 kernels. |
linux-2.6.8-tcatm-1.0.patch | For 2.6.8, 2.6.9 and 2.6.10 kernels. |
linux-2.6.8-htb-nohyst-1.0.patch | For 2.6.8, 2.6.9 and 2.6.10 kernels. |
Debian users can find patches for Sarge in my Debian repository. There you will find compiled kernels and the iproute package, both patched with tcatm, htb-nohyst, and the IMQ patch. If you want to apply tcatm to the patch to the kernel yourself, look at the kernel-patch-tcatm package. Ditto for the kernel-patch-imq package.
The tcatm patch adds or changes the following options in the tc program:
The htb-nohyst patch adds one option to the tc program:
HTB_HYSTERESIS
, in sch_htb.c. The option will
only work on kernels that have had the htb-nohyst patch
applied. It has no effect on unpatched kernels, and you
don't get a warning telling you it hasn't worked.
Use the nohyst option to get more accurate packet scheduling (see Jesper's thesis [1] for documentation).
This increased accuracy comes at the expense of increased CPU usage. It is safe to use the nohyst option for slow links. Slow here means anything slower than a LAN, for example a broadband internet connection. On these links the amount of CPU used by HTB isn't significant and you probably want as much control over the link as you can get. As a general rule if you have spare CPU cycles (and any modern CPU doing server duty will), you should turn hysteresis off.
There are three things to be aware of regarding the atm option. The first is whether you need it or not. If all you are doing is receiving large packets (ie you are just arbitrating between www, p2p and email) then you probably don't need it. In that case if you use wget to estimate the underlying rate your link can handle, reduce that by 5% and pass it onto the qdisc, then it will probably work just fine. On the other hand, if you are loading down your link with small packets then you really do need it. There is only one application I know of that will saturate your link with small packets: VOIP.
The second issue is knowing if your internet link is using ATM. This is easy: if you are using ADSL then your link is using ATM. Otherwise it isn't [2].
The final thing to be aware of is the atm option will only work accurately if you specify the link speed and overheads correctly. The link speed is not critical providing you err on the small side. The speed your ISP gives you probably errs on the large side, so don't use it directly. You can either just cut your ISP's figure by 10% , or you can measure it. Measuring it isn't technically hard - just wget a large compressed file to and from a local (ie fast) site, then use [3] the formula in last column in the table below to convert from wget's KiBytes/second figure to the raw Kbits/second figure your ADSL line runs at. The tricky bit is finding someone willing to download a large compressed file from you.
The overhead is the amount of extra data the various protocols add before sending your packets over the ATM link. Calculating the overhead values requires a working knowledge of the protocol stacks used, so I have summarised in the table below the values to use for the most common setups. Also shown is the optimal MTU to use. This is usually smaller than the largest possible MTU because of the padding ATM inserts into the last cell.
The mpu is smallest number of bytes the protocol stack will send. It includes the overheads. If there isn't sufficient data to build a packet of this size padding will be added. protocols add before sending your packets over the ATM link. Calculating the overhead values requires a working knowledge of the protocol stacks used, so I have summarised in the table below the values to use for the most common setups. Also shown is the optimal MTU to use. This is usually smaller than the largest possible MTU because of the padding ATM inserts into the last cell.
Connection | Overhead H |
Optimal MTU | MPU | wget speed formula | |
IPoA, VC/Mux | 8 | -6 | ???? | - | wget_rate / (MTU-52.) * math.floor((MTU+55)/48) * 434.176 |
IPoA, LLC/SNAP | 16 | 2 | ???? | - | wget_rate / (MTU-52.) * math.floor((MTU+63)/48) * 434.176 |
Bridged, VC/Mux | 24 | 10 | ???? | - | wget_rate / (MTU-52.) * math.floor((MTU+71)/48) * 434.176 |
Bridged, VC/Mux+FCS | 28 | 14 | ???? | 82 | wget_rate / (MTU-52.) * math.floor((MTU+75)/48) * 434.176 |
Bridged, LLC/SNAP | 32 | 18 | ???? | - | wget_rate / (MTU-52.) * math.floor((MTU+79)/48) * 434.176 |
Bridged, LLC/SNAP+FCS | 36 | 22 | ???? | 70 | wget_rate / (MTU-52.) * math.floor((MTU+83)/48) * 434.176 |
PPPoA, VC/Mux | 10 | -4 | 1478 | - | wget_rate / (MTU-52.) * math.floor((MTU+57)/48) * 434.176 |
PPPoA, LLC/SNAP | 14 | 0 | ???? | - | wget_rate / (MTU-52.) * math.floor((MTU+61)/48) * 434.176 |
PPPoE, VC/Mux | 32 | 18 | ???? | - | wget_rate / (MTU-52.) * math.floor((MTU+79)/48) * 434.176 |
PPPoE, VC/Mux+FCS | 36 | 22 | ???? | 82 | wget_rate / (MTU-52.) * math.floor((MTU+85)/48) * 434.176 |
PPPoE, LLC/SNAP | 40 | 26 | ???? | - | wget_rate / (MTU-52.) * math.floor((MTU+87)/48) * 434.176 |
PPPoE, LLC/SNAP+FCS | 44 | 30 | ???? | 90 | wget_rate / (MTU-52.) * math.floor((MTU+91)/48) * 434.176 |
There are a lot of these, but don't let that faze you. Including the ethernet frame checksum in the protocol is very rare, so you can ignore the rows with +FCS in them. As for the rest - look at how your ADSL modem is configured. This is typically done via a web browser. On one of the modems web pages the words in the first column will pop out. Beware that if your modem is in bridged mode and you are running a PPPoE client on your computer then you need to use a PPPoE row, not the bridged row.
Only use the second Overhead column (ie the one with the smaller value) if all these things are true:
This patch is based on the adsl-optimizer patch, and should perform identically. The adsl-optimizer patch was created by Jesper Brouer as part of his thesis. A detailed explanation of the patch with an empirical analysis of how it performs can be found in the thesis, which is available on-line. A brief summary follows.
Before this patch, the kernel + tc combined to calculate the time to send a packet as:
rate_table[packet_length] = (packet_length + overhead) / link_speed [4]
That formula is 100% accurate for most technologies, but not for ATM. ATM always sends data as cells. Cells have a fixed length: 53 bytes, of which 5 bytes are overhead. If a packet doesn't fit into an exact number of cells then the last cell has padding bytes added so it does. This padding causes simple calculation above to be wrong in most cases. The important question is "how wrong"?
It is always possible to tweak [5] the
overhead
and link_speed
in the above
parameters to get the kernel to correctly calculate the
transmission time for any given packet size. Let us assume we
have done the said tweaking so that it works for packet length
+ overhead of L
. Let us also assume that when the
packet length increases by 1 byte a new ATM cell is required
[6].
Ergo the percentage error for packet length plus overhead
L + 1
is:
100 * (actual_transmission_time - kernel_calculated_transmission_time) / kernel_calculated_transmission_time = 100 * ((L+53) / link_speed - (L+1) / link_speed)) / ((L+1) / link_speed) = 100 * 52 / (L+1)
From that formula we can compute a table showing the error when we optimise for any given packet length:
Packet Length | Error | Packet Length | Error |
48 | 106% | ... | ... |
96 | 54% | 1392 | 3.7% |
144 | 35% | 1440 | 3.6% |
192 | 22% | 1488 | 3.5% |
The errors are obviously very large when the packet sizes are small. This looks bad, but we have got by without this patch before, so what has changed?
The existing formula works well for large packets. The 3%..5% error is probably within other errors, and so isn't significant. Up until now network traffic has been dominated by large packets. HTTP, FTP, POP3, SMTP, BitTorrent and other P2P traffic will use MTU size packets if they can. When small packets are used, for example TCP acks, DNS queries, NTP, IRC and so on, they invariably take up a low proportion of the bandwidth because they are small. So the approximation "the internet only carries large packets" has worked well enough for now.
What has changed is how we use the network. We now have a new use: real time streaming, and in particular VOIP. VOIP uses small packets, typically in the 60..160 byte range. As the table shows packets in this size range have very large errors, so large that you have to understate your available bandwidth by 50% to allow for them. It only takes a few VOIP streams to use up most of an ADSL link, so unlike the small packets of old they can't be ignored. What is worse, whereas if your don't give your web session priority it just becomes irritating, if you don't give VOIP the bandwidth it needs it becomes unusable.
The tcatm patch does three things:
When the tc program is used on an unpatched kernel the kernel will ignore the rate table alignment offset. This means that it will get errors in around 14% of all packet sizes, but the remaining 86% will be calculated accurately. Under the old version of tc all packet sizes bar the one you optimise for will be calculated inaccurately.
Hysteresis delays the movement a HTB class through the class tree in the hope that the change in traffic load is temporary and thus the move won't be needed in a short while. As there is a significant amount of work involved in moving a class through the tree this reduces the amount of CPU time HTB uses.
The tradeoff is that the HTB shaping of the class becomes inaccurate. The effect is a higher delay variance or jitter. This effect is larger for slow links, because it is governed by the transmission delay (which is determined by the bandwidth). For documentation see Jesper's thesis [1].
The amount of CPU HTB uses is insignificant on modern CPU's sending traffic over broadband connections [7], and conversely the need prioritise the limited amount of bandwidth accurately is paramount, particularly for VOIP. The hysteresis optimisation isn't appropriate for such usage. If posts to LARTC list are anything to go by, most Linux traffic control users fall into this category. Changing hysteresis (which is on by default) from compile time option to run time option seems like a good thing to do under such circumstances.
The overheads of the protocol stacks are calculated as follows:
Connection | Protocol | Overhead (bytes) |
IPoA, VC/Mux RFC-2684 | ATM AAL5 SAR | 8 |
Total | 8 | |
IPoA, LLC/SNAP RFC-2684 | ATM LLC | 3 |
ATM SNAP | 5 | |
ATM AAL5 SAR | 8 | |
Total | 16 | |
Bridged, VC/Mux RFC-1483/2684 | Ethernet Header | 14 |
ATM pad | 2 | |
ATM AAL5 SAR | 8 | |
Total | 24 | |
Bridged, VC/Mux+FCS RFC-1483/2684 | Ethernet Header | 14 |
Ethernet PAD [8] | 0 | |
Ethernet Checksum | 4 | |
ATM pad | 2 | |
ATM AAL5 SAR | 8 | |
Total | 28 | |
Bridged, LLC/SNAP RFC-1483/2684 | Ethernet Header | 14 |
ATM LLC | 3 | |
ATM SNAP | 5 | |
ATM pad | 2 | |
ATM AAL5 SAR | 8 | |
Total | 32 | |
Bridged, LLC/SNAP+FCS RFC-1483/2684 | Ethernet Header | 14 |
Ethernet PAD [8] | 0 | |
Ethernet Checksum | 4 | |
ATM LLC | 3 | |
ATM SNAP | 5 | |
ATM pad | 2 | |
ATM AAL5 SAR | 8 | |
Total | 36 | |
PPPoA, VC/Mux RFC-2364 | PPP | 2 |
ATM AAL5 SAR | 8 | |
Total | 10 | |
PPPoA, LLC RFC-2364 | PPP | 2 |
ATM LLC | 3 | |
ATM LLC-NLPID | 1 | |
ATM AAL5 SAR | 8 | |
Total | 14 | |
PPPoE, VC/Mux RFC-2684 | PPP | 2 |
PPPoE | 6 | |
Ethernet Header | 14 | |
ATM pad | 2 | |
ATM AAL5 SAR | 8 | |
Total | 32 | |
PPPoE, VC/Mux+FCS RFC-2684 | PPP | 2 |
PPPoE | 6 | |
Ethernet Header | 14 | |
Ethernet PAD [8] | 0 | |
Ethernet Checksum | 4 | |
ATM pad | 2 | |
ATM AAL5 SAR | 8 | |
Total | 36 | |
PPPoE, LLC/SNAP RFC-2684 | PPP | 2 |
PPPoE | 6 | |
Ethernet Header | 14 | |
ATM LLC | 3 | |
ATM SNAP | 5 | |
ATM pad | 2 | |
ATM AAL5 SAR | 8 | |
Total | 40 | |
PPPoE, LLC/SNAP+FCS RFC-2684 | PPP | 2 |
PPPoE | 6 | |
Ethernet Header | 14 | |
Ethernet PAD [8] | 0 | |
Ethernet Checksum | 4 | |
ATM LLC | 3 | |
ATM SNAP | 5 | |
ATM pad | 2 | |
ATM AAL5 SAR | 8 | |
Total | 44 |
If the next hop of the packet is over a Ethernet link (as opposed to a PPP link say), then the kernel will add an Ethernet header to the packet length before looking up the rate table. In that case you must subtract the length of an Ethernet header (14 bytes) from the figure above. This is how the negative overhead figure arises.
There are also additional overheads on incoming packets. However, if you are using IMQ for ingress control you can ignore this as it removes them.
The overheads above can then be used to calculate the optimal MTU. The optimal MTU is the one that generates the largest packet size that doesn't need to have its last ATM cell padded [9]. In other words, it is the largest packet size that is evenly divisible by 48. This packet size includes all of the overheads listed above.
The largest packet size in turn is determined by the underlying link layer you are using to transport your data to your modem. For Ethernet and most other transports, it is 1500 bytes. To that you have to include the overheads added by ATM, which is everything bar the PPPoE and PPP headers. For PPPoE VC/Mux say, this is the Ethernet and ATM headers, which total 26 bytes. Hence the largest packet size that could conceivably seen by ATM when you are using PPPoE is 1526 bytes. This occupies 31.7916' ATM cells. The last cell (the once carrying the .7916' bit) has padding, which we don't want so we discard it. This means largest optimal packet size occupies 31 cells. You can do this calculation for all the other protocols but it turns out all the optimal size for all is 31 ATM cells.
31 ATM cells carry 1488 bytes of data. But some of that data is overhead added after the kernel sends the packet. As those overheads are not included in the MTU they must be removed. The overheads are just those listed in the table above - but there is a final twist. The link layer carries its own overheads "for free" as it were, thus they must be removed from the overheads in the table. Generally the link layer is Ethernet, so if the overheads in the table include the Ethernet Header and CRC you remove them, then subtract the resulting figure from 1488 to get the optimal MTU.
For example for PPPoE VC/Mux there are 16 bytes of non Ethernet overheads, so the resultant optimal MTU is 1472 bytes.
In deriving the ADSL link speed from the figure printed by wget four assumptions are made:
The first step is to calculate the number of ATM cells an it would take to send an MTU sized packet. The number of data bytes seen by the ATM link is the MTU plus the overheads from the above table. This then rounded up to the smallest number of whole ATM cells that will hold the data. As each ATM cell holds 48 bytes of data, this becomes:
floor((MTU+overhead+48-1)/48) {cells/packet}
The next step is to calculate the number of packets being transmitted per second. The amount of data being sent in each packet is the MTU less the TCP/IP overheads. Assuming the TCP/IP overheads are:
And since wget reports in units of Ki bytes/sec, we have:
packet_rate = wget_rate*1024 /(MTU-52) {packets/second}
Finally, we have to convert the transmission rate from cells per second to Kbits per second. So we end up with:
link_speed = wget_rate*1024 /(MTU-52) {packets/second} * floor((MTU+overhead+48-1)/48) {cells/packet} * 53 {bytes/cell} * 8 {bits/byte} / 1000 {1/K} = wget_rate / (MTU - 52) * floor((MTU+overhead+47)/48) * 434.176 {Kbits/second}
My own work in this arose because I was trying to implement a VOIP network. I was getting weird results from the traffic control engine which I eventually tracked down to tc now allowing for ATM cell padding. I wrote some patches to fix this, and since I don't like carrying patches I intended to post them upstream.
Having written the code, I then discovered that someone had been there and done that: Jesper Brouer and his ADSL Optimizer. Jesper had obviously spent a lot more time than I had at analysing the problem and his code was mature - unlike mine. If I was going to post a patch upstream his patches were obviously the better starting point.
Hence the tcatm patches are Jespers, but with a few changes:
Before posting these patches I invited to Jesper to comment on them. He did and I have incorporated his suggestions, so this is now more of a joint effort. However when agreeing to put his name on them he has relied on my assurances that the new code is semantically identical to his. If it isn't the fault is entirely mine.
The htb-nohyst patch is mine. The best source of information on HTB is the authors web site about it.
You must do ingress traffic control for VOIP to work. The only way I know to do that effectively under Linux is to use Linux IMQ.
Finally, this would not of happened if my employer, Lube Mobile, wasn't prepared to let me work on this to get their VOIP system going.
[1] | The effects of the hysteresis optimisation on flow control are worse than you might expect, particularly for applications that don't like jitter - such as VOIP. See Jesper's thesis, chapter 7, section 7.3.1, pp 69-70 for details. |
[2] | Well, it might still be an atm link. But if it is your job is to run the Internet, and you don't need me to tell you if it is or isn't. |
[3] |
The formula is given as a python expression. If you don't
understand it you can just run this:python -c 'import math; wget_rate=XXX; MTU=YYY; print formula' to print the raw ADSL line speed in Kbits/second. You have to replace the underlined_bits with your actual figures and
the formula from the table.
|
[4] | I have omitted packet length scaling here as it doesn't effect what follows. |
[5] | "Tweak" here means lie to kernel about the overhead and/or link speed. For large packets you can reduce the link speed by 5%. For small packets, it is simpler to include the ATM padding in the overhead. |
[6] | The intention is to approximate the worst case scenario. |
[7] | Connections running at say less than 10Mbits/second. |
[8] |
When the Ethernet Frame Checksum (FCS) is added, the Ethernet frame
padding is included as well. The padding ensures the frame is
at least 64 bytes long. However, in a typical Linux packet we have
within an Ethernet frame: - 14 bytes of Ethernet Header, - 6 bytes of PPPoE Header, - 2 bytes of PPP Header, - 20 bytes of IP Header, - 20 bytes of TCP Header, - 12 bytes of TCP RTT Options, and - 4 bytes of Ethernet FCS. Which is more than 64 bytes. There is no data because a TCP ACK has no data. UDP (and in particular VOIP) have less headers, but make up for it by always carrying data. |
[9] |
This is not strictly true. Splitting a packet carries its
own penalty. When you add in the IP and TCP headers to all
the other bits and pieces listed in the table an IP packet
carries around 78 bytes of overheads. If the MTU is 1488
bytes, this is about 4.8%. Two or one bytes of ATM cell
padding actually incurs less overhead than this. Mercifully
that case doesn't arise for a 1500 byte link layer MTU. While you are thinking about this, consider the implications of those overheads on VOIP, whose packets typically carry around 40 bytes of voice data. And that is not counting the ATM cell overheads which average an additional 34 bytes. |
You can email me at "russell-tcatm [at] stuart [dot] id [dot] au". If you want immediate feedback you can contact me via a messenger client. I am not into just chatting, but if you have questions or just need a sounding board I am happy to help. Here are my messenger id's:
Time Zone: Australia/Queensland, which is 10 hours ahead of UTC.
Russell Stuart, 25/May/2006.