When everything is down

Most of the Greater Toronto Area and Southern Ontario has been reeling in the aftermath of the ice storm that at its peak left over 300,000 people and businesses without power. For businesses, the most painful part of this outage was being susceptible to losing access to their pertinent data. Now that everything relies on data services, the internet and connected services are as important, if not more, than power and heating. It is a utility, as with a utility, you only notice its importance when it is broken.

While many of our clients may have had local power outages, in a lot of cases, Pathway BCP (business continuity planning) and backup and DR (disaster recovery services) have helped mitigate these local outages through features like multiple transparent uplinks (connectivity continuity), power continuity infrastructure, transparent remote desktop access, synchronized global data availability, and geographical DR.

A connected world

Although the above statement and list of methods seem like an unfair plug, this is done to emphasize the importance of implementing disaster and business continuity planning at multiple layers, from facilities right up to staff and their iPads, even if it’s not implemented with Pathway.

Why? Your uptime also helps others as well as the economy. A healthy and working economy where our suppliers and clients are up and running is also good for us because we all rely on one another.

Controlling the controllable

Business owners and operators often feel as though the stakes are higher for them following an outage (in contrast to residential data service users) because the ability to generate revenue by the minute, and the safety and integrity of data and operations, are compromised.

While events like ice storms, heavy rains, floods, and tornadoes can’t be avoided or controlled, there are at least two aspects that can be controlled (details are left out):

The integrity and availability of business data, and the systems that control and manipulate that data. Security, capacity, uptime are often discussed under this umbrella.
The speed with which the business recovers from an outage-causing event. This encapsulates concepts like recovery time and recovery point.

A few facts

Consider a few key facts and stats:

DR is worth the money — bad stuff happens very often. Technology outages are at least 100 times more frequent than automobile accidents. The frequency of hard drive, network, power, CPU and desktop failures is very high.
Costing is very poorly done. When events of increasing potency occur more frequently, mid-sized to large businesses have no choice but to take notice. Look at variables like frequency of interruptions, reliance on certain systems, and the amount of revenue that is lost each hour that a integral system is down.
Growth plans can be affected. For example, the retail industry was planning for a 3-8 percent growth in 2013 Christmas sales and must now make up for the downtime in both physical footfall and online commerce. Retail isn’t the only industry affected; suppliers and customers horizontally slow down entire verticals, not just their own during an outage.
Frequency and severity are both factors. In Toronto, summer storms have been more frequent and extreme every year. Tornadoes, once a statistical outlier, are now a staple. Between 2010 and 2013, the average global distribution, frequency and average global severity of weather events has continued to increase.
Not just summer. Although winter storms are less frequent, they can sometimes have very adverse affects on the economy. Entire supply chains can be disrupted as a result.
Not just power. Security related outages have grown exponentially in the last three years. Between 2011 to 2013 almost all major technology providers were breached in some manner. Political and economic events have made security and the related threat tools and countermeasures a new battleground for hackers and businesses. The need to privately and resiliently implement a global and secure availability cloud of services has risen to meet this need. We believe this field will only grow, so that no single point of breach compromises any given service in its entirety.
Other factors include capacity and general monitoring related outages. For example, running out of disk space is an embarrassing reason to declare an outage. Modern software and the growth of BYOD (bring your own device) and data obfuscates a lot of these metrics and compounds the problem. Poor local asset monitoring and capacity planning accounts for almost 35% of avoidable outages at the business premises (i.e. on the business premises, when not placed in a properly monitored setup).
Service level agreements need to have teeth. A promise of “five nines” uptime (99.999%) for a given service on an “around the clock” basis uptime means less than 6 minutes of downtime for an entire year. Ask questions about the services to which it applies (is it power, network, apps, everything?), the list of exclusions, how guarantees are honoured and penalties paid, and whether it applies around the clock.

Identifying risks through reliance

Pathway’s core network services and data center have performed well above par during outages. If and when interruptions to our infrastructure availability do occur, we aim to ensure they are highly localized and affect only a few people.

The Northeast outage of 2003 could indeed have been tolerated with the help of resiliency measures, but there is an associated cost for them — satellite uplinks, diesel generators, custom dark fiber networks and so on, are expensive but all have value. It’s important to quantify the degree of a company’s reliance on a given system in order to gauge things like opportunity cost and determine the best problem-solution fit.

Ask yourself what would happen to your overall processes and ability to run if a given system were compromised. Modular design, good isolation, and a solid understanding of the single points of reliance for YOUR individual business are absolutely crucial basics in designing any business continuity, fault tolerance and recovery plan. Pathway’s BCP engineers have become quite good at executing methods to ensure reliance in a practical setting and putting realistic resiliency measures in place. After all, we’ve had the benefit of hard lessons along the way ourselves, over the past 18 years

Start small and be relentless

What’s the best way to start? Discuss and develop a set of small goals with tests into your organization’s continual improvement processes. We’re happy to help kick-start a structured and milestone-oriented discussion within your organization and offer a few tips, no strings attached. We genuinely believe in the concept of a connected economy where we share prosperity through resiliency.

We’d also love to hear what you’ve learned through your experiences. If you’re interested in attending our BCP round-tables or even want to participate, we’d be delighted to hear from you. Come out, share and listen. Check back soon for our post on cloud-based continuity and resiliency planning.

When everything is down

A connected world

Controlling the controllable

A few facts

Identifying risks through reliance

Start small and be relentless

One Path For All Your IT Needs.

Connect

Services

Locations We Serve

Industries

Address

Resources

Sign up