Search This Blog

Friday, December 14, 2012

95th percentile bandwidth metering explained and analyzed | Semaphore Corporation

95th percentile bandwidth metering explained and analyzed | Semaphore Corporation:


95th percentile bandwidth metering explained and analyzed

We frequently see a fair amount of confusion from new customers as to what 95th percentile bandwidth metering and billing is and how it works. While we have a nice text description of how its calculated, I figured since its such a prevalent billing method I'd explain it with a more visual representation, as well as providing some analysis of 95th percentile vs. other potential metering methods and why it is used at all. Hopefully our future posts here will be somewhat more technically interesting, but this topic is at least informative and answers a question we hear a lot.

What is 95th Percentile metering?

The short answer is that 95th percentile is a way to meter bandwidth usage that allows a customer to burst beyond their committed base rate, and still provides the carrier with the ability to scale their billing with the cost of the infrastructure and transit commits (if any). Its an alternative to either capped ports with fixed billing or actual data transferred, which are models more frequently seen outside the datacenter where occasional bursting is either not allowed or penalized with higher bills.
Carriers sample the amount of data transferred on a customer's port(s) every 5 minutes and use that value to derive a data rate (typically in megabits per second or Mbps) for that 5 minute interval. Over the course of a customers monthly billing cycle, around 8000 of these samples are taken. These values are then sorted and ranked by percentile, and the value that falls on the 95th percentile will be the customer's bill for the month if it exceeds their base commit rate. The higher a customer's base commit rate, the lower their per-Mbps cost will be, allowing to bulk purchase of bandwidth and well as less volatility in the event their 95th percentile rate exceeds their base commit. For a fairly normal business traffic pattern, that provides a value that is fair to both the carrier and the customer in terms of service delivered to the customer and the ability of the carrier to scale its infrastructure to meet customer needs over time.
That's all well and good, but without having a month's worth of data in front of you, its hard to tell what exactly that means. I'll provide some samples of traffic patterns, beginning with a fairly typical one, as well as some more abnormal patterns later on, to explain both how 95th percentile works, as well as how it compares with other burstable and non-burstable methods.

95th percentile metering of normal business traffic

Normal Traffic Patterns - Original Data
The above chart shows a fairly normal month of usage for a business customer. Weekend usage is minimal, and weekday usage when smoothed follows a curve during normal business hours. Usage during off-peak hours is minimal, even during the week. The vast majority of customers fall into this pattern. You can also see short bursts throughout the day to nearly double top of the daily curve. These bursts are what 95th percentile is designed to address.
Normal Traffic Patterns - Ranked Data
The chart above shows what the same month of traffic looks like when sorted and ranked. The 95th percentile falls around 6Mbps, or 60% of the highest burst. Above the 95th percentile, the rate increases rapidly, demonstrating the customer's ability to make use of their full available bandwidth on a momentary basis without penalty. (Remember, these are 5-minute samples, the momentary data rate for those bursts might have been as high as the 100Mbps Fast Ethernet line rate for a few seconds.) The curve above is a fairly typical distribution. 

Alternate metering methodologies

There are a couple of other common methodologies for bandwidth metering that commonly seen in the Internet world. Each has its own application for specific types of services. I'll provide a brief outline as to how these methods work and where they are likely to be found.

Committed Information Rate (CIR)

CIR is a guarantee that the port will always have the bandwidth you're paying for available to it. This is similar to the base commit in 95th percentile metering as well. Traditional TDM circuits such as Frame Relay use this method for provisioning so a customer always knows they are going to get. Most customers are familiar with this model through home Broadband services such as Cable, DSL or Home-fiber connections. Somebody purchasing a 7Mbps/3Mbps DSL line will always have 7Mbps of download speed and 3Mbps of upload speed (in theory at least), and they will always pay for that rate. However, they are also not allowed to exceed that rate, it is a hard cap without any allowance for bursting. This allows the ISP to tightly control the amount of bandwidth entering and leaving their network, which is required on large broadband networks where a given segment is likely oversubscribed rather than overprovisioned.
These rates are controlled by the protocol or physical limitations of the circuit itself, or by shaping and policing the rates on the circuit, depending on the technology. Frame Relay (still used, but becoming less and less common) also allows for limited best-effort bursting. A Frame Relay circuit may have a CIR of 128kbps which is guaranteed to be available, and a PIR (Peak Information Rate) of 256kbps which allows for bursting up to double the CIR if excess bandwidth on the carrier network is available. Typically the PIR is no more than double the CIR and is often only available for customer use for a very limited time (either due to traffic control on the carrier network, or due to the bandwidth being unavailable).
Because most carrier connectivity is now ethernet, this model causes problems. Ethernet has no inherent ability to limit throughput at layer-2 like the above protocols and the data rates typically used make shaping more expensive and management more complex. In order to avoid the cost and complexity of controlling the rate at layer-3, the carrier would only be able to provide services at 10Mbps, 100Mbps, 1Gbps or 10Gbps, which are the ethernet protocol's physical caps. Purchasing bandwidth only in these increments would be cost prohibitive and undesirable for most customers.

Actual throughput billing

Billing based on actual throughput is metered similarly to 95th percentile. However, rather than calculating a rate, the ISP will simply record how much data you moved over the circuit for that interval. This value is usually seen in megabytes or gigabytes. This model is typically seen in limited and shared bandwidth networks such as mobile data networks where overprovisioning is not possible due to limited spectrum availability. You also see this method frequently used by web and virtual hosting companies where a given site may not use much bandwidth on a per-second basis so a more granular unit is needed to bill the customer. This method also has the potential for the most volatility, as bursts are not smoothed in any way, they simply are invoiced. This is one of the significant customer advantages for 95th percentile metering vs actual throughput used.

Average data rate

This sounds like it should be a potential method for billing. Take the 5-minute data rates like 95th percentile, and simply calculate the mean data rate for the month. This would certainly have the effect of smoothing out peaks, but it would require significant overprovisioning on the part of the carrier even beyond what is currently typical. To the best of my knowledge, this method is not used for bandwidth metering anywhere. (I'll show why in a later section)

Atypical traffic patterns

The pairs of charts below will show some less typical usage patterns and how they affect 95th percentile metering. We'll use these to perform some analysis on why 95th percentile is used for transit connections vs some other potential methods.

Sustained traffic patterns

Sustained Traffic Patterns - Original Data
Sustained Traffic Patterns - Ranked Data
This traffic pattern looks a good deal like the normal one, except that the average rates in its 24 hour cycle are fairly flat rather than curved. The slope of the sorted data is much more gradual and the area under the curve is greater on a relative scale. However, the bursts still occupy the top 5th percentile and will be clipped prior to billing. This pattern in particular demonstrates the advantage of 95th percentile vs actual throughput for a customer.

Bursty traffic patterns

Bursty Traffic Patterns - Original Data
Bursty Traffic Patterns - Ranked Data
In this traffic pattern, you can see short periods of extremely high bursting with minimal data rates outside these periods. This performance driven traffic pattern is seen for customers who need extremely high throughput but typically don't use it. Most carriers will require a minimum base commit rate for a customer attached to a Gigabit Ethernet port like the one above, so the 95th percentile measurement isn't all that relevant to this traffic pattern. This allows the carrier to provision their network to allow the large bursts required by the customer and still provide a reasonable compromise of price vs performance.

Sustained-burst traffic patterns

Sustained-burst Traffic Patterns - Original Data
Sustained-burst Traffic Patterns - Ranked Data
This unusual traffic pattern shows several long bursts at line rate (100Mbps), putting the peak mostly below the 95th percentile. This is effectively a capped service similar to a DSL model where the customer would commit to 100Mbps to receive the best per-Mbps pricing assuming this is a typical month.

Data Analysis

95th Percentile Data Table
As you can see from the data table above, the 95th percentile measurement smoothed the peaks out effectively in all but the last model. However, the ratio of actual data to 95th percentile data is fairly proportional in all 4 cases. As expected, the 95th percentile calculation is higher for more bursty traffic, but not unfairly so. However, if you compare the Mean Data Rate to 95th Percentile, the mean data rate ends up higher relative to 95th percentile for customers with smoother traffic patterns (more area under the curve). This is exactly the opposite behavior of what is desired.
Using this data, we can compare 95th percentile to the other potential models.

95th Percentile vs CIR

Advantages

  • Bursting up to line rate available, bandwidth caps can be many times the committed rate
  • Customers can pay for what they use (to some extent) rather than having a bandwidth cap applied
  • Carriers do not need to try to shape or police ethernet traffic down to the customer's commit rate which lowers the carrier's cost basis for infrastructure and management
  • 95th Percentile is industry standard for Tier 1 and Tier 2 settlement-based peering, transit and customer connections

Disadvantages

  • 95th percentile is more volatile than CIR on a particularly bursty month and your monthly invoice could cycle dramatically. (Somewhat offset by base commit rates being correctly set)

95th Percentile vs Actual Usage

Advantages

  • 95th percentile is a good deal less volatile than actual usage, since it is still an averaging method
  • The carrier has guaranteed minimum bandwidth commits that allow them to scale upstream capabilities more consistently
  • Bursty customers are billed more than sustained customers for the same data, which takes into account that the impact of bursty traffic is greater and more difficult to plan for
  • 95th Percentile is industry standard for Tier 1 and Tier 2 settlement-based peering, transit and customer connections

Disadvantages

  • 95th percentile results in paying for a lot of idle bandwidth in most circumstances. (Although not as much as CIR)

Conclusions

None! I invite you to draw your own conclusions. =)
I hope the method by which 95th percentile is calculated is a good deal more clear, and the ramifications of various types of traffic on the final number are more obvious.
The one thing I will mention as far as the comparisons between different models go, is that when you compare the three, 95th percentile strikes a balance between the other two methods. Many of the advantages of 95th over CIR are disadvantages against Actual Usage, and vice versa. By design, 95th percentile seeks to draw a compromise between scalability, cost and volatility for both the carrier and the customer. I think it does a fairly good job at that and that the above data bears that out, but no model is perfect.