The Weak Link in data Protection
All too
often, the Wide Area Network (WAN) link is the weak link in data
protection. Limited bandwidth, high latency, lost packets, and
out-of-order packets can all jeopardize strategic data replication
and backup initiatives, resulting in missed Recovery Point
Objectives and Recovery Time Objectives (RPO/RTO).
As data volumes grow, and as the distance between data centers
increases to protect business data from catastrophic disasters,
there is increasing pressure being placed on the WAN. This has
heightened the demand for optimization tools that can improve data
replication times across the WAN while maximizing bandwidth
efficiency during these processes.
There are unique requirements for deploying WAN optimization in a
disaster recovery (DR) environment. Replication, for example,
involves a high volume of sustained traffic which is highly
susceptible to lost and out-of-order packets. Transfer times are
very well defined to fit within allocated windows, and DR traffic
is often encrypted to protect sensitive business data. Replication
solutions can run over TCP, UDP, and in some cases proprietary and
encapsulated transport protocols (depending on the solution). They
often have their own optimization techniques that can affect the
performance of downstream WAN optimization devices.
By understanding these requirements and establishing guidelines for
addressing them, WAN optimization can be deployed with maximum
effectiveness. As such, WAN optimization can live up to its
potential as a key enabler for strategic disaster recovery
initiatives.
Asking the Right Questions
The first step when deploying any networking solution is to ask the
right questions. When it comes to WAN optimization, the following
questions should be at the forefront of the evaluation
process:
• Know your applications and protocols – It is impossible to
measure results without first baselining the existing situation.
How much traffic is being generated in an “average” replication
stream or backup process? How much traffic is generated in a single
delta set? What are peak loads, and how many times a day are they
being hit? How long does it take to transfer the data from point A
to point B?
• Know your network – It is easy to determine how much
bandwidth you are paying for, but it is harder to know how much
throughput is really being achieved across the WAN. In many
environments, such as MPLS and IP-VPNs, packets are lost or
delivered out of order due to router oversubscription. This can
lead to excessive re-transmissions, which can drop the effective
throughput (aka “goodput”) of traffic across the WAN. As Figure
1 shows, just a small amount of packet loss (.075%) can drop
goodput to less than 5 Mbps — regardless of how much bandwidth is
actually available on the WAN link. Most large backup/replication
processes cannot recover from such a significant drop in
throughput. As a result, WAN optimization technologies like Forward
Error Correction (FEC) and Packet Order Correction (POC) are
extremely valuable in some replication/backup environments.
The more quantifiable information one can collect, the easier it
will be to size an appropriate WAN optimization device and gauge
the level of improvement it provided. Think ahead — factor in an
expected rate of data growth to ensure that a WAN optimization
solution can grow with evolving replication/backup needs.
It is also extremely valuable to know how the replication/backup
applications work. Do they run over TCP (e.g., EMC SRDF/A, NetApp
SnapMirror/SnapVault, Double-Take) or UDP (e.g., Veritas Volume
Replicator, EMC CLARiiON disk library, Aspera/Isilon)? Are
proprietary or encapsulated protocols being used, as is the case
with some Fibre Channel over IP (FCIP) implementations? If anything
other than standard TCP is being used to communicate between host
devices, make sure that WAN optimization appliances can support
those protocols.
In addition to bandwidth and packet loss, latency can be a silent
killer for many DR applications. There is no getting around the
laws of physics – when there is a significant distance between
target and host devices it will take time for packets to travel
back and forth across the WAN. This problem is only getting worse
as enterprises look to extend the geographic distances between data
centers, be it for better protection from catastrophic disasters or
to take advantage of cheaper power in rural environments. In many
instances, TCP acceleration techniques like selective
acknowledgements and adjustable window sizing will help address
latency challenges across the WAN.
If a WAN upgrade is underway, which includes switching to a new WAN
technology (e.g. MPLS or IPVPN) or building out a new data center,
it is encouraged to simulate potential WAN conditions as part of a
WAN optimization evaluation process. A good WAN emulator will
effectively reproduce bandwidth, latency, packet loss, and
out-of-order packets to provide a real-world experience.
• Know your limits – How many simultaneous flows are generated
during a typical replication process? How many are generated when
multiple processes are taking place simultaneously, such as the
backup of dozens of remote branch offices? If other traffic is
using the same WAN as your DR traffic, how many flows is it
generating?
“Faster transfer times and higher LAN/WAN throughput means better
RPO. The more data that can be sent by storage devices across the
WAN, the more data can be protected in a given period of
time.”
By understanding the quantity of flows across the network, one can
ensure that WAN optimization devices handle the volume
appropriately. Be sure to understand how a WAN optimization device
reacts when its flow limits are reached. Is traffic blocked when
limits are reached, or sent through unaccelerated?
In addition to the above, it is useful to know if there are
throughput limits being placed on individual flows by routers,
firewalls, and other network elements. A router, for example, may
limit the amount of throughput per flow to ensure that all flows
get serviced properly. Or, a firewall may limit the amount of
throughput per flow to prevent malicious applications from hogging
precious bandwidth. Whether deliberately set or not, throughput
limits can wreak havoc on high-volume traffic and should be
addressed accordingly (either through reconfiguration of the
network element or through WAN optimization techniques like packet
striping).
Best Practice
Configuration
Guidelines
Many WAN optimization techniques (e.g., data
reduction, QoS, compression, latency mitigation, and loss
mitigation) work transparently to storage devices and DR software.
During normal operations, the storage medium should not even know
that the traffic sent across the WAN is being accelerated. However,
different deployment scenarios can result in different levels of
performance across the WAN.
The following configuration guidelines help to maximize end-to-end
performance when performing data replication across a WAN:
• Compression / de-duplication. Many storage devices perform
basic payload compression (e.g. LZ). This does not prevent
downstream WAN optimization devices from working, but it can reduce
the overall effectiveness of these devices by limiting visibility
into “raw” data. Because this functionality is not unique to
the storage device — that is, most WAN optimization devices can
perform the same or better compression than a storage array — it
is typically recommended that this functionality be turned off in
the array. This typically leads to better overall net performance
from a compression standpoint. In addition, because compression is
very CPU–intensive, moving this functionality off the host
(array) and onto a dedicated WAN optimization appliance can result
in better scalability within the storage medium.
De-duplication is a slightly different story, as in many
environments this functionality is desired within the storage
medium and turning it off is not an option. As long as the WAN
optimization device has byte-level granularity when doing its own
data reduction, working with de-duplication should not be a
problem. In fact, expect an additional 10-20x performance
improvement when WAN data
reduction is performed in conjunction with de-duplication.
This is particularly true when multiple applications are sent
across the same WAN because the optimization device has a larger
data set to sample and match from. For example, if someone sends an
email across the WAN, it will be fingerprinted and stored by a WAN
optimization device performing data reduction. When that email is
backed up, the WAN optimization device will have already seen the
data, leading to immediate data reduction benefits. In contrast,
this might be the first time that the storage device is backing up
the data, so de-duplication may be minimal. As one can expect,
having a WAN optimization device plus de-duplication in the mix
yields the best overall net results.
• Encryption. When communicating across a WAN, many enterprises
look to encryption as a necessary tool for protecting sensitive
information. However, when WAN optimization is deployed with
storage medium, one must be careful as to where this encryption
takes place. When encryption takes place “upstream” of the WAN
optimization device, special actions must be taken to terminate the
encryption session on the appliance, un-encrypt the traffic,
optimize the traffic, and then re-encrypt the traffic. Otherwise,
the WAN optimization device does not have visibility into the data
and cannot perform its optimization functions. Because this process
can be difficult to coordinate and can have an adverse effect on
performance, it is generally not recommended unless encryption is
absolutely necessary at the source. Instead, it is recommended that
encryption be left to the WAN optimization device.
Best practices dictate that WAN optimization devices perform two
types of encryption. One is encryption of data at rest (that is,
data stored on the appliance). The other is encryption of data sent
between appliances. The former is particularly needed when the WAN
optimization device is using local disk drives for data reduction,
which can store several terabytes worth of information. The latter
is most often needed on shared networks, such as IP VPNs, where
IPsec and other VPN solutions can provide an added layer of
security. In both scenarios, it is recommended that encryption take
place in dedicated hardware so as not to impact performance.
• High Availability. When WAN optimization is used for disaster
recovery, it takes on an increased element of importance. Poor
transfer times can mean failed replication/backup processes, which
means business information is placed at a higher level of risk. To
avoid this scenario, it is often recommended that WAN optimization
be deployed in a redundant configuration when used as part of
disaster recovery operations.
Redundancy between appliances is typically achieved using standard
redirecting techniques, like Policy Based Routing (PBR) and Web
Cache Coordination Protocol (WCCP), which can be used to redirect
traffic in the event of a problem.
Redundant power, disk drives, and other modules will help ensure
maximum
uptime within the appliance.
Understanding Success Criteria
With the above
information, one can effectively define criteria for a successful
WAN optimization evaluation. More specifically, enterprises can
collect quantitative data that will justify whether an investment
in WAN optimization is the correct choice for their DR environment.
Specific items to look for include:
• Reduced transfer times. How much faster is the
replication/backup process?
This is easy to measure, and easy to compare to baseline numbers
(assuming they were collected prior to deploying WAN
optimization.)
• Increased LAN -side throughput.
In many instances, removing a WAN bottleneck enables more data to
be sent from the storage medium. This means
that more data can be protected within allocated windows.
• Improved WAN utilization (that is, more “virtual
bandwidth”). If LAN–side throughput is constant, then WAN
utilization should decrease when using WAN optimization. However,
in many instances LAN–side throughput goes up, which can result
in an increase in overall WAN traffic. This may seem
counter-intuitive to the goal of WAN optimization, but it actually
means that one is getting more efficient usage out of available WAN
bandwidth.
As the last point indicates, removing a WAN bottleneck can actually
expose bottlenecks elsewhere in an enterprise. For example, a bad
server NIC or outdated LAN hub may have worked “fine” when WAN
throughput was limited to 10 Mbps, but they may strain to keep up
with a WAN that can now handle 100 Mbps of traffic. Similarly,
replication software can be physically limited in the amount of
data it can push out, or it might have been manually configured to
limit throughput based on WAN conditions. This may lead to
sub-optimal performance gains when WAN optimization is deployed.
For example, the amount of traffic on the WAN might be
significantly reduced with WAN optimization, but transfer times
across the WAN may not show a significant improvement. This might
be something that can be corrected with minor configuration changes
in the storage medium, or it may be a fundamental limitation of
that medium.
Lastly, it is important to point out the importance of effective
management tools when evaluating, and subsequently deploying a WAN
optimization solution. These will help baseline existing network
and application behavior, optimize configuration for seamless
deployment, and monitor behavior on an ongoing basis to assess
performance over time.
Making the Business Case
Faster transfer
times and higher LAN/WAN throughput means better RPO. The more data
that can be pumped out by the storage device and subsequently sent
across the WAN, the more data that can be protected in a given
period of time.
“WAN optimization offers the best performance improvements in
disaster recovery environments with the lowest total cost of
ownership.“
Faster transfer times also mean better RTO. WAN optimization not
only accelerates replication and backup functions, but it ensures
that transfers in the opposite direction — that is, during a
recovery — also happen as quickly as possible.
What is the value placed on improved RPO
and RTO? How much is it worth to protect more data and recover it
faster? Do these benefits outweigh the investment in WAN
optimization equipment?
Consider the alternative — adding more WAN bandwidth. This may
seem like the path of least resistance when data protection is not
performing as desired across the WAN, but it has several major
drawbacks.
For starters, it assumes that bandwidth is the only issue that
needs to be addressed when doing replication and backup across the
WAN. However, if packet loss, packet ordering, and latency are also
issues, adding more bandwidth will not solve the problem. (In fact,
loss is often exacerbated as WAN links grow in size).
Secondly, in many regions it can take quite a long time to get a
large WAN connection ordered and provisioned from a service
provider. If problems exist today, waiting several months for an
OC-3 or OC-12
circuit may not be a viable option.
Lastly, when all factors are considered, the cost of adding more
WAN bandwidth is often significantly higher than the cost of
deploying WAN acceleration. Aside from a dramatic increase in
recurring bandwidth expenditures (30–60% on average), routers and
other network equipment may have to be added or upgraded, the
storage medium may need to be upgraded, new licenses might be
required in the replication/backup software to accommodate
additional WAN links, and new operational expenditures might be
required to handle the added complexity of new and larger WAN
connections. One can argue that bandwidth expenditures are
decreasing over time, but the recurring costs are still significant
and the tangential costs associating with upgrading the WAN can be
quite substantial.
In the end, WAN optimization offers the best performance
improvements in disaster recovery environments with the lowest
total cost of ownership. When deployed correctly, the benefits of
WAN optimization are very tangible — from improved data transfer
times to more efficient usage of available WAN bandwidth. By
following best practice recommendations, WAN optimization is an
indispensible tool in day-to-day disaster recovery
operations.