Why is Storage Management So Painful?

If you want to understand the vastness of data, don’t measure it in petabytes. Measure the number of people managing your storage.

In 2015, I walked into a customer’s New Jersey IT center. Their head of storage pointed at an endless sea of cubes and boasted, “That’s our storage team area.” The VP of Infrastructure (a networking guy) asked in amazement, “What do all these people do?” The head of storage chuckled (in that storage person way), “We keep the business running.”

My conclusion:

  1. If that many people work on it, storage management is a really big deal.
  2. If that many people work on it, storage management is really broken.

Why is storage management so painful and expensive? Why do we need all those people and what do they do all day?

Cost, Performance, Capacity — Pick 2

Imagine a storage continuum. On one end is “fast”. On the other is “big”. Each application wants a different point on that continuum. To make it worse, in the real world that continuum is multi-dimensional. There isn’t just one way to define “performance” and one way to define “capacity”.

Cost vs. Performance

Real World Performance Challenge: Applications define “performance” differently.

Some need storage to respond quickly to a transaction (e.g. a payroll application). Others want to do many transactions at the same time (e.g. online purchasing system). Still others want to process oceans of data (e.g. data analytics).

Real World Performance Answer: Many types of storage arrays.

Vendors offer Tier-1, All-Flash, Hybrid Disk-Flash, Scale-Out NAS, Deduplication, and other types of arrays because each type of storage solves a different performance challenge.

Storage teams need people with special expertise to manage each type of array.

Cost vs. Capacity

Real World Capacity Challenge: Companies don’t want to pay to store unimportant data.

Everything generates data, and nobody deletes any of it. It piles up like garbage during a Parisian sanitation strike. Why waste the high-performance storage on old data? Move it somewhere cheaper.

Real World Capacity Answer: Many Types of Tiering

Some businesses move pieces of data (e.g. Carl in Accounting’s vacation photos from Decorah, Iowa) to cheaper storage. Others move whole applications (e.g. the performance review application we stopped using in 2012). Some tier storage to cloud. Some tier to other arrays. Some still tier to tape.

Storage teams need people with special expertise to manage tiering.

Storage teams balance requirements for performance, capacity, and cost. With a lower budget every quarter.

Availability vs. Cost

Storage hardware breaks. Components get old, power surges, or disasters hit the site. Sometimes, the hardware fails. Other times, it returns incorrect data.

Unlike servers, you can’t “just get a new one” because that hardware holds your data. You either need to have a copy of it elsewhere or figure out how to get it back from the broken device.

Storage systems use error-correction codes, RAID, mirroring, and other reliability techniques. Some protect you from more failures than others. Those usually cost more.

Storage teams need to people with special expertise to manage different availability configurations.

Storage Teams Plan, Users Laugh

Then users come in.

All application owners want from storage is:

  • Performance specific to their application’s needs
  • Capacity specific to their application’s needs
  • At the lowest cost
  • Without any errors or downtime

But they don’t know:

  • How much capacity they’ll need.
  • How much performance they’ll need.

Still, let’s pretend that the storage team has done the impossible. They built a stable, cost-optimized storage environment that meets all their users’ needs!

Then the requirements change because:

  • Business priorities changed
  • Government regulations changed
  • The application itself changed
  • An executive ate a blueberry muffin instead of a blueberry bagel for breakfast.

The Road to Hell is Paved with Storage Migration

All storage administrators end up in the land of storage migration. It hurts worse than rubbing your eyes while chopping ghost peppers. We all hate the “application downtime for maintenance over the weekend” email. Guess what? The storage admin, working all weekend to move all the data, hates it a lot more than you.

As one storage admin said, “I start planning to migrate an application the day I deploy it.”

Storage teams need people to maintain the environment.

Can We Solve This Problem?

With shrinking budgets, companies and admins can’t survive managing storage this way. It’s too slow, too complex, and too expensive.

In the last decade, we have simplified storage management:

  • All-Flash Arrays – These arrays reduce the effort to manage storage performance. Despite the marketing, All-Flash cannot handle all workloads at the best cost
  • Hyperconverged – With flash storage inside the server, you can eliminate the storage management. Everything is “good enough”. It works well in smaller deployments.
  • Data Analytics  – Analytics (e.g. Nimble’s Infosight) can recommend optimizations.
  • Software-Defined  – Storage admins can change storage configurations more easily.

We’ve been making storage simpler, but it’s not simple, yet.

When storage can meet the users’ changing requirements without administrator involvement, then we shall be free.

Until then, storage management is like sprinting on a treadmill covered in Legos. You can’t win; you just try to not lose the will to keep running.

Cloud Didn’t Kill Storage

At a big IT event, I asked, “If you care about storage management, raise your hand.” I saw a smattering of hands.

Next, “Put your hand down if you aren’t a storage admin.”

One hand stayed in the air.

I asked the lone holdout, “Why do you care about storage management?”

“What? No. I need a raffle ticket. There’s a raffle at the end of this talk, right?”

Storage management is the least appreciated part of IT infrastructure… which is saying something. Business users understand server and network issues. Security and backup teams overwhelm them with graphic horror stories. The storage team just gets a budget cut and requirements to store twice as much data.

What is Storage Management?

When you buy storage, you have to consider (at least) five factors:

  1. Capacity — How much data do I want to store?
  2. Durability — How much do I not want to lose my data?
  3. Availability — How long can I go without being able to read or write my data
  4. Performance — How fast do I need to get at my data?
  5. Cost — How much am I willing to pay?

Different products optimize for different factors. That’s why you see hundreds of storage products on the market. It’s why you see individual vendors sell dozens of storage products. There is no one-size-fits-all product.

Storage management is meeting the storage needs of all the different business applications.

Storage management is also knowing that you’ll always fall short.

Doesn’t Cloud Make Storage Management Go Away?

No.

The same five factors still matter to your applications, even in the cloud.

That’s why cloud providers offer different types of storage. AWS alone offers:

  1. Local Storage (3 types)– Hard Drive, Flash, NVMe Flash
  2. Block Storage (4 types)– IOPS Flash (io1), General Purpose Flash (gp2), Throughput Optimized Hard Drive (st1), and Cold Hard Drive (sc1)
  3. Object Storage (4 types)– Standard S3, Standard Infrequently Accessed S3, One-Zone Infrequently Accessed S3, Glacier

That’s 11 types of storage for one cloud. Has your head started to spin?

It gets worse. You face all the old storage management challenges plus some new ones.

Wait, Cloud Makes Storage Management MORE Important?

Yes.

Cost Overruns — Overprovisioning

It’s easy for application teams to run up a massive cloud storage bill.

On-premises environments built up checks-and-balances. The application team asks the storage management team for storage resources. The storage team makes them justify the request. The storage management team needs to buy more hardware. Purchasing makes them justify the request. The process slows down the application team, but it prevents reckless consumption and business surprises.

Cloud environments wipe away the checks-and-balances. Application owners pick a type of cloud storage for their application. The cloud provider tells them how much performance they get for each GB of capacity. They see that it costs a dime or less a month. They don’t have to ask anyone for approval. So they buy a big pool of storage. Why not? Turn on the faucet. It’s cheap.

Then they need more capacity. So they buy more. Turn on the faucet again. And it’s cheap.

Then they need more performance. So they buy more. Turn on the faucet again and again and again. And it’s still cheap.

The bill for 100s of TBs of storage comes in at the end of the month. Nobody turned off the faucet. They’ve run up the water bill and flooded the house.

It’s not cheap anymore.

Cost Overruns — Storage Silos

It’s easy for application teams to waste money in the cloud.

Twenty year ago, each server had its own storage. Server A used 1% of its storage. Server B used 100% of its storage and ran out. Server B couldn’t use Server A’s storage. Then storage teams adopted shared storage, SAN or NAS. Now they can give storage resources to any application or system that needs it. Shared storage eliminates the waste from the islands of server storage.

Today, cloud environments don’t share storage.

When I create a cloud instance (aka server), I buy storage for it.

When I create a second cloud instance, I buy storage for it.

When I create a third cloud instance, I buy storage for it!

When I … you get the idea.

We’ve re-created the “island of storage” problem. Except at cloud scale, application teams end up with island chains of storage. Even Larry Ellison doesn’t have archipelago money.

Data Loss — Trusting the Wrong Type of Storage

It’s easy for application teams to lose their data in the cloud.

On-premises environments use resilient shared storage systems. Application owners don’t think about RAID, mirroring, checksums, and data consistency tools. That’s because the storage management team does.

In the cloud, picking storage is like a “Choose Your Own Adventure” story, where you always lose:

  • You choose local storage. When the node goes away, so does your storage. One day you shut down the node to save money. You lost all your data. Start over.
  • You choose block storage. AWS states, “Amazon EBS volumes are designed for an annual failure rate (AFR) of between 0.1% — 0.2%, where failure refers to a complete or partial loss of the volume.” At some point this year you’ll lose a volume. You lost an application’s data. Start over.
  • You choose S3 object storage. You chose resilient storage! You win! **

** Your application can’t run with such slow performance. You lost your customers. Start over.

NOTE: Backups can help. But you should have resilient production storage and backups. Backups should be your last resort, not your first option.

When Should I Plan for Cloud Storage Management?

Now.

It’s time to start thinking about storage management in the cloud.

Today, you’re running cloud applications that don’t need to store data. Or that don’t care if they lose data. Or you’ve just been lucky.

The time to manage cloud storage is coming. You need to start planning now. It can save your applications, your business and your career.

Oh, and don’t spend time worrying about the raffle. They’re always rigged for the customer with the biggest deal pending, anyway.

Why Can’t You Find a Good IT Job?

It hurts to hunt for a job in IT infrastructure right now. Every rejection finds new ways to embarrass and frustrate you. Even the offers carry painful tradeoffs. Cloud has changed the job options for infrastructure engineers. There are no perfect jobs, but there are opportunities.

I’ve seen five types of companies hiring infrastructure engineers. Each has rewards and risks.

Legacy Whale — On-Premises Tech Giants

Legacy infrastructure companies put profit over growth — including yours. Their market may be shrinking, but they still run enterprises’ most important applications. The last company standing will charge a premium for their technology. That’s why the legacy giants need engineers to deliver products for their core markets.

The positives:

  • Salary. They pay good salaries from their profit margins.
  • Enterprise Experience. You learn how to work with a mature product for enterprise customers.

The risks:

  • Layoffs. Profit comes when you earn more than you spend. Products that need incremental development don’t need expensive engineers.
  • Stagnation. You’re working on the same product for the same customers. Everything is incremental. You’re missing sweeping technical and business trends.
  • Left too Late. If you stay too long, interviewers will wonder why. Were you too lazy to move? Too comfortable? Nobody wanted you?

The legacy whales can be a lucrative home, and they teach you how to work with big customers. You just have to ask, “When is the right time to jump ship?”

Legacy Piranha — On-Premises Startups

Legacy piranha companies have to grow fast. The legacy market may be shrinking, but it’s still huge. The legacy whales can’t always move fast enough to block small companies (either with technology or sales). Some piranhas can eat enough of the whales to IPO or get bought.

The positives:

  • System View: You design products from scratch, so you can see new parts of the system
  • Customer Experience: In a smaller company, you can work directly with customers.
  • Financial Upside: If the company takes off, so does your equity.

The risks:

  • Limited growth: Piranhas need you to do what you’ve done before. The race is on, and they can’t afford to train you on something else.
  • No market: In a shrinking market, everything has to be perfect. The product. The go-to-market. And you need the whale to miss you. For every Pure or Rubrik, there are a dozen Tintri and Primary Data.

The legacy piranhas can be an exciting gamble. You can see the whole system and work with hands-on customers. You just have to ask, “What happens if this fails?”

Killer Whales — The Big 3 in Public Cloud

The killer whales (AWS, Azure, Google Cloud) control the new ocean of IT infrastructure. They’re taking share in the growing market of public cloud. The customers, requirements, and technology are different from the legacy environment. Their scale dwarfs even the largest enterprises. The problems are the same, but the rules are different.

The positives:

  • New Technology: Killer whales mix commodity technology with bleeding edge. They must innovate to stay ahead.
  • New Perspective: The scale is orders of magnitude greater than what we’re used to. The integration of the stack eliminates our silo’ed view.
  • Growth: The killer whales can afford to pay and give new opportunities.

The risks:

  • Getting Hired: They have their pick of new hires. They may see your experience as a limitation, since they want to build things in a new way.
  • Succeeding: The environment is different. The way you did things won’t work. They’re moving fast. You’re going to be very uncomfortable.
  • Limited Customer Interaction: At their scale, it’s difficult to get direct customer interaction. You’re one of the masses building for the masses.

The killer whales will be an exciting ride that sets you up for the future. You just have to ask, “Am I ready?”

Inside the Blue Whales — Joining IT

Some of the biggest companies in the world build their own IT infrastructure. They create some of the most interesting infrastructure innovation (e.g. Yahoo, Google, Facebook, Medtronic, Tesla). Nothing makes infrastructure requirements more real than building an application on top of it.

The positives:

  • New Technology: You’re building custom technology because vendors’ products don’t work for them.
  • New Perspective: The scale and integration with business applications changes how you view infrastructure.
  • Growth: You could move from infrastructure to the building the application.

The risks:

  • Getting Hired and Making a Difference. See “Killer Whales”.
  • You’re a Cost Center: When you build the product, you are the business. When you provide services for the product, you’re a cost center. At Morgan Stanley, an IT member advised me, “Don’t work here. We’re the most innovative technical company on Wall Street, but we’re still the help. The traders are the business. Never be the help.”

The Blue Whales are technology users that push the boundaries of infrastructure. You just have to ask, “Am I comfortable being a cost center?”

Riding the Killer Whales — Building on the Public Cloud

The Killer Whales can’t do everything well. No matter how quickly they hire, they can’t build decades of functionality in a few years. Furthermore, nobody wants to lock into one Killer Whale. They know how that story ends. That’s why companies are adding multi-cloud infrastructure services on top of the public cloud.

The positives:

  • New Technology: You’re riding the new technology, trying to tame it.
  • New Perspective: You learn how companies are trying to use public cloud and what challenges they face. You can see how they’re evolving from legacy to public cloud.
  • Upside: If the company takes off, so do you. You’re the expert in a new market area. Oh, and the financial equity will be rewarding, too.

The risks:

  • No Market: You have the traditional startup concerns (funding, customers, competitors) and more. You worry that the killer whales will add your functionality as a free service. You worry that the killer whales will break your product with their newest APIs. Riding killer whales is scary!
  • Financial Downside: Low salary. Even lower job security.

Some Killer Whale Riders will become the next great technology infrastructure companies. You just have to ask, “How much risk am I comfortable with?”

Conclusion

A decade ago, even incompetent IT infrastructure vendors could grow 10% a year because the market was so strong. No more. Today, there are no infrastructure jobs without risk. Of course, there are still great opportunities.

I’m riding the killer whales because I’d gotten disconnected from new technology and new customer challenges. The risk is terrifying, but I’ve never been happier. The choice was right for me.

What did you choose and why?

Backup Sucks, Why Can’t We Move On?

“Tape Sucks, Move On” (Data Domain)

“Don’t Backup. Go Forward.” (Rubrik)

“Don’t even mention backup in our slogan” (Every other company)

Everybody hates backup — executives, users, and administrators. Even backup companies hate it (at least their slogan writers do). Organizations run backup only because they have to protect the business. I’ve met hundreds of frustrated backup customers who have tried snapshots, backup appliances, cloud, backup as a service, and scores of other “fixes”. They all ask one question –

“Why is backup so painful?!?”

Performance: “I’m Givin’ Her All She’s Got, Captain!”
Backup is painful because it is slow and there is so much data.

Companies expect the backup team to:

  1. Back up PBs of data for thousands of applications every day
  2. Not affect application performance (compute, network, and storage)
  3. Spend less on the backup infrastructure (and team)
  4. Rinse and Repeat next year with twice as much data

Everybody underestimates the cost of backups. While at EMC, a federal agency (no way I’m naming this one) complained about their backup performance. In their words, “The data trickles like an old man’s piss.” They were using less than 1% of the Data Domain’s performance. Their production environment, however, was running harder than Tom Cruise (and just as slow). When they set up their application environment, they hadn’t thought about backup. To meet their application and backup SLAs, they had to buy 4x the equipment and run backups 24 hours a day. NOTE: Unless you can pay for IT gear with tax dollars, I would not depend on that approach.

Backups run for a long time and they use a lot of resources. Teams have to balance application performance with backup SLAs across vast oceans of data. It’s an impossible balancing act. That’s why backup schedules are so complex.

Backup will be painful until we solve the performance problem. Imagine that you could make backup in an instant. You could make a simple schedule (e.g. hourly) and not worry. Users could create extra copies whenever they wanted. Backup would be painless!

That was the promise of snapshots. Of course, they ran into the next problem.

Multiple Offsite Copies: “Scotty, Beam Us Up”

Backup is painful because you need to keep many offsite copies.

Companies expect their backup teams to:

  1. Store daily backups, so they can restore data from any day from the past months or years
  2. Restore the applications if something happens to the hardware, the data center, or the region.
  3. Spend less on the backup infrastructure (and team)

That’s why snapshots were never enough. Customers who lost their production system lost their snapshots. Replicating snapshots to a second array didn’t solve the problem, either…

At NetApp, a sales representative asked me to calm Bear Stearns. The director of IT complained that the backup solution (SnapVault to another NetApp system) cost more than the production environment. “You’re lucky that we don’t have to worry about money at Bear Stearns.” (Good times!) Then, he peppered me with questions about exotic failures— e.g. hash collisions, solar flares, and quantum bit flips. Our salesman had asked me to “distract him” from these phantasms, so I did. “I wouldn’t worry about those issues. We’re way more likely to corrupt data with a software bug. And that would corrupt your production and backup copies.” The blood drained from the customer’s face and he stopped asking questions (Mission accomplished!). As we left, the salesman snarled, “Next time, try to distract the customer by saying something good about our product.”

Companies store backups on alternate media (tape, dedupe disk, cloud) for reliability at a reasonable cost. That’s why backup software translates data into proprietary formats tuned for that media. The side effect is that only your backup software can read those copies. Result: Backup vendor lock-in!

Backup will be painful until we can solve the problems of performance and storing offsite copies. Imagine that you could make a resilient, secure offsite backup in an instant. You could make a simple schedule and recover from anything. Backup would be painless!

Until, of course, you met an application owner.

Silos: “Resistance is Futile”

Backup is painful because you have to connect the backup process to the application teams.

Companies expect their backup teams to:

  1. Work across all applications in the environment
  2. Respond quickly to application requests
  3. Spend less on the backup infrastructure (and team)

As difficult as technology is, connecting people is even more challenging. Application owners don’t trust what they can’t see or control.
One EMCWorld, I hosted a session for backup administrators and DBAs. At first, it was a productive discussion. One DBA explained, “If you can’t recover the database, it’s still my application that’s down. That scares me.” The group started brainstorming ways to give DBAs more visibility into the backups. Then a DBA blurted out, “I just can’t trust you guys with my database backups. You became backup admins because you weren’t smart enough to be DBAs. I’m going to keep making my own local database dumps.” After that, we decided try to solve the wrestling feud between Bret Hart and Shawn Michaels instead. It seemed more productive.

Companies need to manage complex backup schedules and create offsite copies. That’s why we have backup software. Backup software and schedules are so complex that companies hired backup teams to manage them. That extra layer is why business application owners don’t trust the backups.

Backup will be painful until application teams can trust and verify the backups of their applications.

Moving On? “I canna’ change the laws of physics”

Why is backup so painful?

It’s slow and expensive. It locks you into a backup vendor. It creates a backup silo that slows the business down. Other than that, backup is great.

Why have 25 years of innovative companies not eliminated the pain of backup?

Because we couldn’t change the laws of physics in the data center. Too much data. Too expensive to get data offsite. Too hard to connect backup teams and application teams.

Why am I optimistic for the future?

Because the cloud changes the laws of physics for backup. We can stop tweaking backup and finally fix it. We’ll save that mystery for next time.

Merry Misadventures in the Public Cloud

My first Amazon Web Services (AWS) bill shocked and embarrassed me. I feared I was the founding member of the “Are you &#%& serious, that’s my cloud bill?” club. I wasn’t. If you’ve recently joined, don’t worry. It’s growing every day.

The cloud preyed on my worst IT habits. I act without thinking. I overestimate the importance of my work (aka rampaging ego). I don’t clean up after myself. (Editor’s note: These bad habits extend beyond IT). The cloud turned those bad habits into zombie systems driving my bill to horrific levels.

When I joined Nuvoloso, I wanted to prove myself to the team. I volunteered to benchmark cloud storage products. All I needed to do was learn how to use AWS, Kubernetes, and Docker, so I could then install and test products I’d never heard of. I promised results in seven days. It’s amazing how much damage you can do in a week.

Overprovisioning — Acting without Thinking

I overprovisioned my environment by 100x. The self-imposed urgency gave me an excuse to take shortcuts. Since I believed my on-premises storage expertise would apply to cloud, I ran full speed into my first two mistakes.

Mistake 1: Overprovisioned node type.

AWS has dozens of compute node configurations. Who has time to read all those specs? I was benchmarking storage, so I launched 5 “Storage Optimized” instances. Oops. They’re called “Storage Optimized” nodes because they offer better local storage performance. The cloud storage products don’t use local storage. I paid a 50% premium because I only read the label.

Mistake 2: Overprovisioned storage.

You buy on-premises storage in 10s or 100s of TB, so that’s how I bought cloud storage. I set a 4 TB quota of GP2 (AWS’ flash storage) for each of the 5 nodes — 20TB in total. The storage products, which had been built for on-premises environments, allocated all the storage. In fact, they doubled the allocation to do mirroring. In less than 5 minutes, I was paying for 40TB. It gets worse. The benchmark only used 40GB of data. I had so much capacity that the benchmark didn’t measure the performance of the products. I paid a 1000x premium for worthless results!

Just Allocate A New Cluster — Ego

I allocated 4x as many Kubernetes clusters as I needed.

When you’re trying new products, you make mistakes. With on-premises systems, you have to fix the problem to make progress. You can’t ignore your burning tire fire and reserve new lab systems. If you try, your co-workers will freeze your car keys in mayonnaise (or worse).

The cloud eliminates resource constraints and peer pressure. You can always get more systems!

Mistakes 3 & 4: “I’ll Debug that Later” / “Don’t Touch it, You’ll Break It!”

Day 1:Tuesday. I made mistakes setting up a 5-node Kubernetes cluster. I told myself I’d debug the issue later.

Day 2: Wednesday. I made mistakes installing a storage product on a new Kubernetes cluster. I told myself I’d debug the issue later.

Day 3: Thursday. I made mistakes installing the benchmark on yet another Kubernetes cluster running the storage. I told myself that I’d debug the issue later.

Day 4: Friday. Everything worked on the 4th cluster, and I ran my tests. I told myself that I was awesome.

Days 5 & 6 — Weekend. I told myself that I shouldn’t touch the running cluster because it took so long to setup. Somebody might want me to do something with it on Monday. Oh, and I’d debug the issues I’d hit later.

Day 7 — Monday. I saw my bill. I told myself that I’d better clean up NOW.

In one week, I had created 4 mega-clusters that generated worthless benchmark results and no debug information.

Clicking Delete Doesn’t Mean It’s Gone — Cleaning up after Myself
After cleaning up, I still paid for 40TB of storage for a week and 1 cluster for a month.

The maxim, “Nothing is ever deleted on the Internet” applies to the cloud. It’s easy to leave remnants behind, and those remnants can cost you.

Mistake 5: Cleaning up a Kubernetes cluster via the AWS GUI.

My horror story began when I terminated all my instances from the AWS console. As I was logging out, AWS spawned new instances to replace the old ones! I shut those down. More new ones came back. I deleted a subset of nodes. They came back. I spent two hours screaming silently, “Why won’t you die?!?!” Then I realized that the nodes kept spawning because that’s what Kubernetes does. It keeps your applications running, even when nodes fail. A search showed that deleting the AWS Auto Scaling Group would end my nightmare. (Today, I use kops to create and delete Kubernetes clusters).

Mistake 6: Deleting Instances does not always delete storage

After deleting the clusters, I looked for any excuse not to log into the cloud. When you work at a cloud company, you can’t hide out for long. A week later, I logged into the AWS for more punishment. I saw that I still had lots of storage (aka volumes). Deleting the instances hadn’t deleted the storage! The storage products I’d tested did not select the AWS option to delete the volume when terminating the node. I needed to delete the volumes myself.

Mistake 7: Clean Up Each Region

I created my first cluster in Northern Virginia. I’ve always liked that area. When I found out that AWS charges more for Northern Virginia, I made my next 3 clusters in Oregon. The AWS console splits the view by region. You guessed it. While freaking out about undead clusters, I forgot to delete the cluster in Northern Virginia! When the next month’s sky-high bill arrived, I corrected my final mistake (of that first week).

Welcome to the Family

Cloud can feel imaginary until that first bill hits you. Then things get real, solid, and painful. When that happens, welcome to the family of cloud experts! Cloud changes how we consume, deploy, and run IT. We’re going to make mistakes (hopefully not 7 catastrophic mistakes in one week), but we’ll learn together. I’m glad to be part of the cloud family. I don’t want to face those undead clusters alone. Bring your boomstick.

Cloud Data Protection is Business Protection

“As I waded through a lake of rancid yogurt, each vile step fueled my rage over failed backups.” The server that ran a yogurt manufacturer’s automated packaging facility crashed. The IT team could recover some of the data, but not all. They hoped everything would “be OK”. When they restarted the production line, they learned that hope is not a plan. Machines sprayed yogurt like a 3 year old with a hose in a crowded church. By the time they shut down the line, they’d created a yogurt lake. It took two months to clean and re-certify the factory. They missed their quarterly earnings. People lost their jobs.
Data protection matters because data recovery matters. Even in the cloud. Especially in the cloud.

Businesses Run on Data

Digital transformation has turned every company into an application business.
Have you ever thought about the lightbulb business? Osram manufactured lightbulbs for almost a century. Then, LED bulbs decimated the lightbulb replacement business. Osram evolved into a lighting solution company. Osram applications optimize customers’ lighting for their houses, businesses, and stadiums. Now a high-tech company, they sold the traditional lightbulb manufacturing business in 2017.

How about the fruit business? Driscoll’s has grown berries for almost 150 years. In 2016, berries were the largest and fastest growing retail produce. Driscoll’s leads the market. They credit their “Driscoll’s Delight Platform”. It tracks and manages the berries from the first mile (growing) through the middle miles (shipping) to the last mile (retail consumer). Driscoll’s analyzes data at every stage to optimize the production and consumption of berries. Driscoll’s is a technology company that sells berries.

Every company is in the application business. Applications need data. To design lighting, Osram uses data about your house. To deliver the best berries, Driscoll’s analyzes data about the farms (e.g. soil, climate), shipping (e.g. temperature and route), and customer preferences.Modern businesses depend on applications. Applications depend on data. Therefore, modern businesses depend on data.

Data Protection: Because of Bad Things and Bad People

Every company protects their data center because there are so many ways to lose data.

CIOs have seen their companies suffer through catastrophes. Hurricane Harvey flooded Houston data centers. Hardware fails and sometimes catches fire. Software bugs corrupt data. People delete the presentation before the biggest meeting of their lives, so they throw a stone at a wasps’ nest to incite a swarm, get rushed to the hospital with dozens of vicious stings to have an excuse to re-schedule (or so I’ve heard).

IT organizations have also survived deliberate attacks. External hackers strike for fun and profit. Ransomware has become mainstream; cyber criminals can now subscribe to Ransomware as a Service! Now, anybody can become a hacker. Some attacks happen from inside, too. A terminated contractor at an Arizona bank destroyed racks of systems with a pickaxe. (I’ll never forget the dumbfounded CIO muttering, “We think he brought the pickaxe from home.” Because that’s what mattered.)After decades of enduring data loss, IT knows to protect the data center. Do we also need to protect data in the cloud?

Data Protection: Bad Things and Bad People Affect the Cloud

Every company needs to protect their data in the cloud because there are even more ways to lose it.

Bad things happen in the cloud. First, users still make mistakes. The cloud provider is not responsible for recovering from user error. Second, the cloud is still built of hardware and software that can fail. Vendors explain, “Amazon EBS volumes are designed for an annual failure rate (AFR) of between 0.1% — 0.2%, where failure refers to a complete or partial loss of the volume.” The applications you lose may be unimportant… or they may decimate your business. Third, since you are sharing resources, performance issues can affect data access. Amazon Prime Day is the most recent example. Finally, storms trigger data loss in a public cloud data center, just like they do in a corporate data center.

Public clouds are a bigger target for bad actors. Aggressive nations (with names that rhyme with Russia, Iran, North Korea, and China), bitcoin miners, and traditional criminals hack companies running in the cloud. Those hacks obliterate companies. Hackers deleted Code Space’s data in AWS. Two days later, the business shut down. Meanwhile, the scope of the public cloud makes internal threats more serious. The pickaxe (or virus)-wielding employee can now damage hundreds of companies instead of one!

Data is not any safer in the cloud than it is on-premises. Cloud providers try to protect your data, but it’s not enough. Even in the cloud, it’s your data. It’s your business. It’s your responsibility.

Protect the Cloud Data, Protect the Business

Modern businesses run on applications. Applications run on data. Most companies that lose data go out of business in 6 months or less.
Unfortunately, bad things and bad people destroy, steal, or disable access to the data. Whether you run on-premises or in the cloud, one day you will lose data. If you have a good backup and disaster recovery solution, you can recover the data. Your business can survive.

Amazon CTO Werner Vogels declared, “Everything fails all the time.” Companies need to protect their data in the cloud, so they can recover from those failures. Now, more than ever.

Traditional Applications Run Better in Public Cloud

Public Cloud = More Choice + Better Data Protection

Public cloud works. Not just for SaaS, cloud native applications, or test and development. Not just for startups or executives bragging to each other on the golf course. Public cloud works for traditional, stable applications. It can deliver better service levels and reduce costs … even compared to a well-run on-premises environment.

To date, market analysts have focused on cloud disrupting who buys IT infrastructure. Frustrated lines of business pounced on the chance to bypass IT. Cloud let them “Fail Fast or Scale Fast”. They didn’t have to wait for IT approval, change control, hardware acquisition, or governance. Lines of business continue to embrace cloud’s self-service provisioning at a low monthly cost.

Still, conventional wisdom says public cloud can’t compete with a well-run on-premises environment. IT architects argue that public cloud can’t match the performance and functionality of legacy environments. IT Administrators can’t tweak low level knobs. IT Directors can’t demand custom releases. How can vanilla cloud handle the complex requirements of legacy applications? Financial analysts note that public cloud charges a premium for its flexible consumption. Stable workloads don’t need that flexibility, so why pay the premium?

Conventional wisdom is wrong. Most traditional workloads don’t need custom-built environments. You don’t need a Formula-1 race car to pick up groceries, and you don’t need specially-made infrastructure to run most applications. Moreover, public cloud’s architectural advantages can reduce IT costs, even with the pricing premium.

In the next stage, public cloud will change how we architect IT infrastructure. Public cloud has two architectural advantages for traditional applications: more price/performance options and on-demand provisioning for data protection.

Public cloud offers more price/performance choices than on-premises infrastructure. Outside of the Fortune 50, most companies don’t get to buy “one of everything” for their infrastructure. Instead, they buy a one-size-fits-all workhorse system to support all the workloads. The public cloud offers more technology choices than even the largest IT shop. It is the biggest marketplace (pun intended) for different technology configurations. Cloud levels the playing field between smaller and bigger companies.*

* NOTE: For this to happen, we need to solve the operational challenges of running different cloud configurations.

Public cloud can improve data protection. For years, IT has struggled to deliver high-performance disaster recovery, backup, and archive. Companies can’t afford to run DR and archive environments for all their applications; maintaining two near-identical sites costs too much. That’s why they pretend that their backups can be DR and archive copies. Unfortunately, when disasters or (even worse) legal issues strike, recovery cannot begin until IT provisions a new environment. Companies collapse before recoveries can complete.

Public cloud’s on-demand provisioning enables cost-effective first-class DR, archive, and backup. Customers don’t waste money on idle standby environments. Nor do they treat “hope that nothing goes wrong” as a strategy. Instead, when necessary, they near-instantly spin up compute and storage in a new location. Then, they near-instantly restore the data and start running.* With public cloud, IT can unify enterprise-class DR, backup, and archive.

Organizations are already moving backup copies to cloud object storage. The next step will be to use those copies for unified data protection.

*NOTE: For this to happen, we must create cost-effective cloud protection storage and build near-instant data recovery mechanisms.

Public cloud works for traditional applications. You can run applications on the best configuration, rather than what is available. You can have first-class DR and archive, rather than “best effort” with backup copies. You can replace your hand-crafted environments with something less expensive and more functional. Public cloud should not threaten IT; instead its architecture should help IT to deliver better services. It’s time to stop resisting and start building.

How I Found My Path to the Cloud

“Dell EMC’s Data Protection Division won’t need a CTO in the future.”

I started 2017 as an SVP and CTO at the world’s largest on-premises infrastructure provider. I ended the year at a 10-person startup building data management for the public cloud. Like many, my journey to the cloud began with a kick in the gut. Like most, I have no idea how it will end.

The Dell layoff didn’t depress me. I’d seen the budget cut targets, so I kånew I wasn’t alone. The layoff felt personal rather than professional, so my ego wasn’t bruised. Since cloud is eating the on-premises infrastructure market, I’d wanted to move. Since I’d always had my choice of jobs, I looked forward to new opportunities

The job hunt, however, plunged me into the chasm of despair. I wanted to be cutting edge, so I applied to cloud providers and SaaS vendors. What’s worse than companies rejecting you? Companies never responding. Even with glowing internal introductions from former colleagues, I heard nothing. No interview. No acknowledgement. Not even rejection. My on-premises background made me invisible. Then, I applied to software companies moving to the cloud. They interviewed me. They rejected me for candidates with “cloud” expertise. My on-premises background made me undesirable. Legacy infrastructure companies called, but I needed to build a career for the next 20 years, not to cling to a job for 5 more years. For the first time in my working life, I worried about becoming obsolete.

Then I found hope. I met a recently “promoted” Cloud Architect whose boss wanted him to “move IT to cloud”. His angst-ridden story sounded familiar: change-resistant organization, insufficient investment, and unsatisfactory tools. He couldn’t deliver data protection, data compliance and security, data availability, or performance. He couldn’t afford to build custom data management solutions. The business didn’t even want to think about it. They did, however, expect an answer.

I realized data management was my ticket into the cloud. Even in cloud, data management problems don’t go away. The problems I know how to solve still matter. In fact, expanding digitization and new regulations (e.g. GDPR, ePrivacy Directive) make solving those problems more important. Even better, the public cloud’s architecture opens better ways to build data management. Electricity surged through me. Cloud gave me the opportunity to build the data management solution I’d spent my career trying to create. Now, I needed to find a place to build it.

Nuvoloso, our startup, wants to help people like me get to the cloud. Individually, each member of the team has built data management for client-server, appliances, and virtualization. Now, together, we’re building data management for cloud. The requirements don’t change, but the solutions must. Each of us adds value with our existing skills, while learning about the public cloud. Our product will enable infrastructure IT professionals to follow our path. We will help them use their experience to add value and get a foothold in the cloud.

The journey to the cloud still ties my stomach in knots. When I started at Nuvoloso, I felt helpless and terrified. Cloud took everything I knew, and changed it just enough to confuse me. As I’ve adjusted, I feel helpful, excited and (still) terrified. Public cloud is real. Public cloud changes how businesses buy and use technology. Public cloud does not, however, eliminate the requirements for data management; it amplifies it. Public cloud will not replace us. Public cloud needs our skills and experience. No matter where the applications run, somebody needs to manage the data infrastructure.

Your journey to the cloud may begin with a project, a promotion, or (like me) a layoff. Regardless of how you start, remember: There’s a future for people like us.