How Cloud Apps Go Wrong

Data Can Bring You Down

Welcome to the Cloud

It’s exciting! You’ve got a huge new toolbox with a cornucopia of choices. And forums with a bonanza of advice.

Whether you are creating, migrating, or something in-between, everybody has an opinion about building cloud apps:

  • Development Model: Serverless? Micro-services? Containerized? Traditional?
  • Platform: Cloud-Provider? PaaS? IaaS packages?
  • Database Stores: MongoDB? PostgreSQL? DynamoDB? S3?

While you dig through the avalanche of advice, there is one topic conspicuous by its absence… what is the right way to handle persistent data?

Data storage in the cloud is clean and easy? Courtesy:

Day 1: How Should I Manage My Application's Data?

There are even more ways to store data in cloud than there were on-premises!

You can use:

  • Database as a Service: AWS RDS, MongoDB Atlas, etc.
  • Standalone Databases: PostgreSQL, MongoDB, MySQL, etc.
  • Files: local files or cloud NAS
  • Objects: more tiers than an Instagram wedding cake

Most applications use a combination of tools to get the best results.

Once you decide how to store the data, you need to provision for capacity and performance. You might even consider how you’ll scale, deal with failures, secure the data, and meet compliance regulations (Hey, I’m an optimist).

It’s a stressful decision. Data is “the new oil”, “the lifeblood of business”, and “the new bacon”. You want to pick the right way to manage your data; your business depends on that information. Unfortunately, with so many variables, making the right decision seems impossible. Still, you’ll pick something reasonable, and then you’ll never have to worry about it again. Phew.

Except… data doesn’t take care of itself. In the cloud, there is no storage, backup, disaster recovery, or compliance team watching over your data. If you care about your application, you’ll be managing its data every day of the application’s life.

Even the Gingerbread man can’t escape data management. Courtesy:

Day 2: A Day in the Life of a Cloud App Developer

Congratulations! You’ve released an application into your environment. You’ve created something that helps people. Take a moment to celebrate.

Now, take a deep breath. Your life will never be the same. Here’s a day in your life.

06:30: Make sure backups are happening and that they’re being replicated off site. Consider testing a restore to make sure the backups are good. Resolve to do it tomorrow… for the 100th straight day. Push down the guilt.

10:30: You get an escalation about application performance. After looking at processing and network, check out the data path. Figure out which storage resources are serving the application. Look at historical data to find changes in workload, resource performance, etc. Hope that application performance goes back to normal on its own.

12:10: Discuss with others how to scale the application. If you add more compute resources, how can they share the same data? Will that become a bottleneck?

14:07: Somebody hits a bug in the application. Create a clone environment, so you can reproduce the bug and test and fix. Wait hours to clone or create a data set.

16:20: Finance has asked you to reduce the cost of running your application. Explore ways to either use less expensive storage or use it more efficiently/dynamically.

18:30: As you walk out of the building, wonder when you will have time to ever build a new application.

23:30: Security has identified a flaw in the configuration of some AWS S3 buckets. They want to make sure your data is secure and encrypted. You need to send them details of what you’re doing and how. They also demand full audit logs of who has access to the data and what they’ve done.

06:30: Do it all over again…

The cloud offers infinite tools for building applications: UI, queuing, analytics, etc. Without data management, however, you won’t have time to create new applications. Instead, you’ll become a de-facto platform administrator: managing storage, backup, and compliance for your existing apps.

There is another way. Credit: Unknown

Day 2, Take 2: A Better Day

It may feel like you only have two options:

  • Bad: Spend every day in data management hell.
  • Worse: Use only one type of storage — e.g. object, DBaaS, etc.

There is a third answer: a data management solution that does the job of managing storage, data protection, and compliance for you.

In this world, today is not the worst day of your life…

06:30: Breakfast. [Backups have been happening automatically every 4 hours.]

10:30: Develop your new application. [Storage resources automatically scaled up to meet your existing application’s performance needs.]

12:10: Eat lunch and talk with co-workers about your new application.

14:07: Somebody hits a bug in the application. Instantly clone the environment to reproduce and test a fix; send the fix through the CI/CD pipeline.

16:20: [Storage resources scale down as the load reduces.]

18:30: Kick off a set of tests with real data for your new application; head home.

23:30: Sleep. [Your data is secure and the records are available to security, auditors, etc.]

Aim for the head and take control of data in the cloud. Courtesy:

The First Day of the Rest of Your Life

Cloud offers a tantalizing array of options to help build applications. Without data management, those tools will torment you because you won’t have time to use them. Plan ahead, though, and you can find a better way. Day 2 can be the best day of your life, not a nightmare.

Cloud should give you choice. Cloud should give you automation. Cloud should make you more productive. If that’s not happening, it’s time to look for a cloud-first data management solution.

Backup Just Means Backup

How the Backup Industry Got It Wrong for Two Decades

I surveyed the room of mostly empty chairs at my VMworld 2017 presentation about Backup and Recovery, and I thought, “I’ve made a huge mistake.”

I spent my career being wrong about backup. For two decades I believed customers wanted to get more value out of their backups. They wanted us to add security, compliance, archive, and data intelligence to backup software, right? Backup hardware should do something to add value instead of holding idle copies!

The backup industry keeps believing that customers want to go “Beyond Backup”. They don’t.


Backup companies listening to their customers. Courtesy of:

Customers Didn't Want Beyond Backup

At NetApp, I thought customers should use backup snapshots for test and development. They didn’t. At EMC, I believed Data Domain should store all secondary data and NetWorker should unify archive, compliance and analytics. Customers did not share that belief.

They didn’t want to do more with their backup. They wanted backup to work.


"Cost less, Work More" Courtesy:

Customers Wanted Simpler Backups

Most customers want two things from their backups:

  1. Cost less.
  2. Work more.

Backups and restores fail too often. It takes too many people to manage (read: troubleshoot) backups. Backup software and hardware cost too much. IT directors do not want to think about doing more with their backups. They want to stop thinking about backups!

Those IT silos will kill you. Courtesy: wikipedia

"Beyond Backup" Could Never Have Worked

Not only do customers not want it, but “Beyond Backup” can’t work in a traditional environment. IT departments aren’t built to share backup data or infrastructure. In an enterprise, different teams own backup, archive, file storage, and test/dev (or CI/CD), and they don’t like to share. Each function has different tools, security and access requirements. “Beyond Backup” won’t break down decades of IT silos.

Organizational structure drives architecture more than technology, and IT is not set up to use backup for anything other than backup.  


Why Was I So Wrong for So Long?

Enterprise Backup is NOT Simple

We couldn’t make backup simple. We made backup better for some applications and some environments, but we couldn’t make enterprise backup simple.

  • Backup is Complex: Enterprise backup will never be simple. The array of applications, servers, OSes make it a miracle that backup ever works. Anybody who says they can make an existing traditional backup environment simple is lying, crazy, or both.
  • Snapshots are not Enough: Local and remote snapshots (e.g. NetApp SnapMirror) augment backup, but nobody trusts all their data and backups to one storage vendor. Since they build a heterogeneous environment, they need traditional backup.
  • SelfService is not Enough: Some applications can protect their own data (e.g. Oracle), but enterprises don’t have the processes and tools to support self-service protection. Self-service protection augments backup, but customers still anchor on traditional backup. 
Legacy Backup is NOT simple. Courtesy:

It's All About the Money

Since we couldn’t build what the customers wanted, we tried to convince the customers to buy what we could build.

The business demanded:

Profit: Backup products boast staggering margins. How do you justify those margins if the product does the same thing every year, just with a faster processor (hello: iPhone)? So you add features, regardless of their value.

Growth: Every company has a backup solution, so you grow by swapping out an existing product. If they’re going to suffer through a switch, customers need to believe that you’re leading them to the promised land. Otherwise, why replace one dinosaur with another?


I Wanted to Believe

I believed in “Beyond Backup” because I needed to believe that I could do something of value. I wanted to build something that mattered. And if you build backup products, everything becomes a backup problem.



Backup has improved in the last 20 years.

Snapshots … Deduplicating backup appliances … VMware snapshots and backup APIs (and modern backup software that uses them) … Application self-service … Backup as a Service…

Despite these advances, customers still spend too much time and money running backups. And vendors are still selling features that customers don’t need.

I spent too much time trying to sell what we could build, instead of solving the customers’ real problem.

At Nuvoloso, we have a new mission: make backup simple and inexpensive.


Why Urgency for Cloud Went to 11

What the Executives Aren’t Telling You

Congratulations! You own “the cloud platform” for your company. Maybe you applied for the role. Maybe you got volunteered. Most of you are just doing the job because somebody has to.

Regardless, your job is simple: lay tracks in front of a speeding freight train without getting flattened. (I said the job is simple, not easy.)

Why did the company put you in this position? Why are they asking you to move legacy workloads? And why are they pushing so hard now?

The #1 reason I hear from cloud practitioners is: “Because my Management said so.” If you want to be successful, that answer is not good enough. You need to know why the company wants to use public cloud, so you know how they’re measuring success… and you.

Your boss, talking about cloud. Courtesy: Bryan Valenza

Why Public Cloud?

Why are most companies adopting cloud?


They aspire to move faster than their competitors. Executives imagine that first to the cloud will get the “multi-cloud, serverless, Kubernetes, microservices, automated, agile, synergistic, digital transformation, IT modernization orgasm of profit!”*

Buzzwords aside, there are real benefits to cloud. It helps companies develop, deploy, and scale applications. It shifts technology costs from large irregular capital expenses to predictable operational expense. Underneath the hype, cloud has value. That’s why it’s growing.

* NOTE: These are actual statements from actual CEO/CIO/CFOs.

The Executive Conference Room for “Orgasm of Profit” Courtesy: Disney

Why Move Old Workloads to Public Cloud?

If the business wants to move forward faster, why spend time on legacy applications?

Critical Mass.

Companies have legacy environments, private cloud, and public cloud. The legacy runs the business. Most IT professionals are experts in one legacy discipline — e.g. compute, storage, networking. Since people want to feel useful, they focus on their silo in the legacy environment. That’s why the public cloud never gets enough attention from IT. The only way to drive critical mass to the cloud is to force IT to move the legacy applications to the cloud. And if that saves the company capital expense on equipment and data centers, bonuses for everyone!*

* NOTE: “Everyone” being only those with access to the conference room dedicated to the “orgasm of profit”.

The business pressure to move to cloud now is real. Courtesy: South Park

Why are Companies Moving NOW?

Why is management putting so much stress on moving to cloud now?

They’re not. It just feels that way. You moved the EASY workloads to the cloud. Moving the next workloads will be HARD. But the schedule is the same. That’s stressful.*

Executives have been pushing for agility and savings via cloud for years. First, companies adopted SaaS for basic functions. Second, they moved test and development to cloud. Third, they stored cold data in the cloud.

Now that you’ve done the “easy” work, it’s time for the hard job — moving real applications. Real applications keep persistent customer data in databases and files. Real applications are complex. Real applications need availability, security, data protection, and predictable performance. Real applications run the business. (Don’t panic, though. There are many real applications to move before getting to SAP and Oracle.)

Executives are hooked on cloud wins. Those wins “prove” that they’re innovating and beating the competition. The savings feel good, too. At each hardware refresh cycle, moving to the cloud cuts capital expenses. The savings from each cloud step funds the next one. It doesn’t matter that each step gets more difficult. Everything depends on the next hit of capital savings. That’s why executives need you to deliver the next step… now.

* NOTE: I took a class taught by Turing Award winner Michael Rabin. He spent half of each lecture covering simple arithmetic. At the end, he raced through complex math proofs. We asked why he spent so much time on the simple math vs. the hard math. His answer: “It’s all simple to me.” That’s how executives think about cloud. It’s all simple to them.

Most executives thought Spinal Tap was a documentary. Courtesy:


Businesses need to move to the cloud to compete. It’s not enough to just build some cloud-native applications. They need critical mass on the cloud. That’s why they’re asking IT to migrate legacy workloads.

IT feels tremendous pressure from the business because the next cloud migrations will be hard. There are no more easy wins. You’ve done SaaS, test and development, and archive. Now, it’s time to move business applications. They’re complicated. They have data. They run the business. And they need to be moved now.

Congratulations on owning the cloud platform! Keep running, the train is always coming.

How to Begin Your Cloud Career

Codeword: Agile

The #1 question people used to ask: “How can I get management to buy into my idea?”

Now it’s: “How can I get management to buy into my idea about cloud?”

Then they talk about their attempts to sway their bosses. I’m not surprised they’re not succeeding. I’m surprised that they haven’t been fired.

Don’t jump in front of a runaway cloud train. Courtesy: Thomas the Tank Engine

What Not To Do

If you’re about to use any of these approaches, stop yourself. Even if you have tap into your inner Tyler Durden and knock yourself out.


Here’s Why It Won’t Work!!

You’ve seen the cloud plan. Your company has been playing with the cloud — test and development and some cloud-native toy applications. It’s gone well. Now they’re planning to run applications with data (aka — real applications).

Now is your moment! You warn everybody that there’s a looming disaster. There’s no plan for handling the storage failures (0.1% of devices) … or backup … or security. There’s no strategy to avoid vendor lock-in. And they sideline you. What?!

Lesson: Everybody has bought in, and you can’t stop the train. Nobody wants to hear why the train will derail. Instead of seeming wise, you sound like you’re protecting your job.

It’s Going to Be Too Expensive!!

This is a favorite criticism of the cloud. Especially from legacy IT vendors. The argument goes:

  • A well-run IT department can deliver the same services at a cheaper price.
  • You’re paying for flexibility in the cloud, so it must be more expensive.

Despite this wisdom, the business units ignore you.

Lesson: Cloud isn’t about cutting costs. Businesses are frustrated with IT’s lack of agility, and cloud lets them move faster. Since you’ve just aligned yourself with IT, you’re now “part of the problem”.

This is how the business thinks of IT. Courtesy:

If You Give Me 6 People and 6 months, I Can “Do It Right”

Businesses are already “swiping a credit card” and running in the cloud. You asked for a team of people and time to come up with a plan. That sounds like you’re using a legacy approach to design a new environment. They hear warning bells, and find somebody who will do it faster with fewer people.

Lesson: Executives like cloud because there’s no lead time. If you’re going to appeal to them, you can’t talk in quarters or even months. Think weeks.

It’s your boss when you bring up new tech. We both know it. Courtesy:

Let me Try this New Technology!

You know Docker, Kubernetes, and/or MongoDB would help the company develop applications faster. Somehow. You extol the virtues of Docker Overlay Networks, Kubernetes Stateful Sets, and eventual consistency NoSQL databases. Unfortunately, your boss refuses to commit and asks you to write up a report. You know nothing is going to happen.

Lesson: Your managers do not have grounding in the new technology, so they feel insecure. They were probably last “hands-on” with VMs. They’re not going to risk their necks for something they don’t understand.


Don’t be negative. Don’t be slow. Don’t make your boss feel stupid.

(Before you laugh, be honest. How many times have you broken these rules?)

Be Agile. Agile is Awesome! Courtesy: IDG Connect

What To Do

Be Agile. Agile is the term of the day. Executives, businesses, and managers love the word and what it symbolizes. Everybody wants to move faster and cheaper. Everybody wants to “Be Agile.”

To change your approach, follow this formula:

  1. Explain your business value (bonus points if you use the word agile!)
  2. Bring solutions to the problems

A More Resilient Cloud Makes the Business More Agile

Business Value: A more resilient cloud environment makes us more agile. With a resilient cloud, we can lift-and-shift existing applications. Without it, we need to re-architect everything to be cloud-native. That will be slow and expensive.

Problem: AWS has 0.1% Storage Failure Rate for Block Storage.

Solution: We should mirror the block devices. We should make backups on the resilient object storage in multiple clouds.

Don’t worry, business units will learn to love best practices. Courtesy:

Centralized Cloud Best Practices Makes the Business More Agile

Business Value: Central management of the cloud makes us more agile. Business units won’t have to figure out what cloud configuration works best with trial and error. We’ll do that work, so they can focus on building revenue-generating applications.

Problem: Each business unit is buying their own cloud resources. There are billions of combinations. They don’t have the time to figure out what works best. They’re picking something and hoping it’s reasonable.

Solution: A small central team can work on best practices. We can even A-B test across groups to find out what works best.

Give your developers the cloud EASY button. Courtesy:

A Simpler Cloud Makes Developers More Agile

Business Value: We can use cloud for more applications, if we give the application teams a more mature environment. Otherwise, they need to learn to build microservices before they’re productive.

Problem: The cloud lacks data management: availability, performance management, and data protection. The application teams have to build data management into their apps. The extra work slows them down.

Solution: We will build cloud data management, so more application developers can be productive.

You can help the lightbulb go on for your boss. Courtesy: Disney

New Technology Can Help Us Be More Agile With Cloud Providers

Business Value: Running in multiple clouds gives us leverage against any one vendor. We can run different applications in different clouds.

Problem: It’s a big learning curve to run in different clouds.

Solution: Technologies like Kubernetes and Docker can help virtualize the cloud. It does for public cloud what VMware did for servers. Let me just walk you through how it might work… (Now you have your chance to educate them!)


“How can I get management to buy into my idea about cloud?” is the right question. Cloud is the future.

You just need to know how to approach management. Don’t be “Dr. No” or “Dr. Slow”. That’s what they don’t like about IT. They’ve fought for cloud and they want people who will fight for them and their success.

Give them:

  • Agile Business Value
  • Problem
  • Solution

And if you think you’re saying “Agile” too often… you’re not. Don’t roll your eyes. Agility is the rare buzzword that actually delivers value to the business.

Agile is Awesome.

Why is Storage Management So Painful?

or What Do All Those Storage Admins Do?

If you want to understand the vastness of data, don’t measure it in petabytes. Measure the number of people managing your storage.

In 2015, I walked into a customer’s New Jersey IT center. Their head of storage pointed at an endless sea of cubes and boasted, “That’s our storage team area.” The VP of Infrastructure (a networking guy) asked in amazement, “What do all these people do?” The head of storage chuckled (in that storage person way), “We keep the business running.”

My conclusion:

  1. If that many people work on it, storage management is a really big deal.
  2. If that many people work on it, storage management is really broken.

Why is storage management so painful and expensive? Why do we need all those people and what do they do all day?

The project management "iron triangle" drives people to drink.

Cost, Performance, Capacity — Pick 2

Imagine a storage continuum. On one end is “fast”. On the other is “big”. Each application wants a different point on that continuum. To make it worse, in the real world that continuum is multi-dimensional. There isn’t just one way to define “performance” and one way to define “capacity”.

Cost vs. Performance

Real World Performance Challenge: Applications define “performance” differently.

Some need storage to respond quickly to a transaction (e.g. a payroll application). Others want to do many transactions at the same time (e.g. online purchasing system). Still others want to process oceans of data (e.g. data analytics).

Real World Performance Answer: Many types of storage arrays.

Vendors offer Tier-1, All-Flash, Hybrid Disk-Flash, Scale-Out NAS, Deduplication, and other types of arrays because each type of storage solves a different performance challenge.

Storage teams need people with special expertise to manage each type of array.

Cost vs. Capacity

Real World Capacity Challenge: Companies don’t want to pay to store unimportant data.

Everything generates data, and nobody deletes any of it. It piles up like garbage during a Parisian sanitation strike. Why waste the high-performance storage on old data? Move it somewhere cheaper.

Real World Capacity Answer: Many Types of Tiering

Some businesses move pieces of data (e.g. Carl in Accounting’s vacation photos from Decorah, Iowa) to cheaper storage. Others move whole applications (e.g. the performance review application we stopped using in 2012). Some tier storage to cloud. Some tier to other arrays. Some still tier to tape.

Storage teams need people with special expertise to manage tiering.

Storage teams balance requirements for performance, capacity, and cost. With a lower budget every quarter.

Sometimes hardware just ... breaks. Courtesy: Office Space

Availability vs. Cost

Storage hardware breaks. Components get old, power surges, or disasters hit the site. Sometimes, the hardware fails. Other times, it returns incorrect data.

Unlike servers, you can’t “just get a new one” because that hardware holds your data. You either need to have a copy of it elsewhere or figure out how to get it back from the broken device.

Storage systems use error-correction codes, RAID, mirroring, and other reliability techniques. Some protect you from more failures than others. Those usually cost more.

Storage teams need to people with special expertise to manage different availability configurations.

And we wonder why requirements change. Courtesy:

Storage Teams Plan, Users Laugh

Then users come in.

All application owners want from storage is:

  • Performance specific to their application’s needs
  • Capacity specific to their application’s needs
  • At the lowest cost
  • Without any errors or downtime

But they don’t know:

  • How much capacity they’ll need.
  • How much performance they’ll need.

Still, let’s pretend that the storage team has done the impossible. They built a stable, cost-optimized storage environment that meets all their users’ needs!

Then the requirements change because:

  • Business priorities changed
  • Government regulations changed
  • The application itself changed
  • An executive ate a blueberry muffin instead of a blueberry bagel for breakfast.
Another weekend of migrations? Noooooo! Courtesy: LA Beast

The Road to Hell is Paved with Storage Migration

All storage administrators end up in the land of storage migration. It hurts worse than rubbing your eyes while chopping ghost peppers. We all hate the “application downtime for maintenance over the weekend” email. Guess what? The storage admin, working all weekend to move all the data, hates it a lot more than you.

As one storage admin said, “I start planning to migrate an application the day I deploy it.”

Storage teams need people to maintain the environment.

What storage management feels like today. Courtesy: WheresMyChallenge on YouTube

Can We Solve This Problem?

With shrinking budgets, companies and admins can’t survive managing storage this way. It’s too slow, too complex, and too expensive.

In the last decade, we have simplified storage management:

  • All-Flash Arrays– These arrays reduce the effort to manage storage performance. Despite the marketing, All-Flash cannot handle all workloads at the best cost
  • Hyperconverged– With flash storage inside the server, you can eliminate the storage management. Everything is “good enough”. It works well in smaller deployments.
  • Data Analytics — Analytics (e.g. Nimble’s Infosight) can recommend optimizations.
  • SoftwareDefined — Storage admins can change storage configurations more easily.

We’ve been making storage simpler, but it’s not simple, yet.

When storage can meet the users’ changing requirements without administrator involvement, then we shall be free.

Until then, storage management is like sprinting on a treadmill covered in Legos. You can’t win; you just try to not lose the will to keep running.

Cloud Didn’t Kill Storage

But it did just lose your data

At a big IT event, I asked, “If you care about storage management, raise your hand.” I saw a smattering of hands.

Next, “Put your hand down if you aren’t a storage admin.”

One hand stayed in the air.

I asked the lone holdout, “Why do you care about storage management?”

“What? No. I need a raffle ticket. There’s a raffle at the end of this talk, right?”

Storage management is the least appreciated part of IT infrastructure… which is saying something. Business users understand server and network issues. Security and backup teams overwhelm them with graphic horror stories. The storage team just gets a budget cut and requirements to store twice as much data.

An average day for a Storage Administrator. Photo Courtesy: Expedia Norway

What is Storage Management?

When you buy storage, you have to consider (at least) five factors:

  1. Capacity — How much data do I want to store?
  2. Durability — How much do I not want to lose my data?
  3. Availability — How long can I go without being able to read or write my data?
  4. Performance — How fast do I need to get at my data?
  5. Cost — How much am I willing to pay?

Different products optimize for different factors. That’s why you see hundreds of storage products on the market. It’s why you see individual vendors sell dozens of storage products. There is no one-size-fits-all product.

Storage management is meeting the storage needs of all the different business applications.

Storage management is also knowing that you’ll always fall short.

If only the cloud storage choices were so obvious... Courtesy:

Doesn’t Cloud Make Storage Management Go Away?


The same five factors still matter to your applications, even in the cloud.

That’s why cloud providers offer different types of storage. AWS alone offers:

  1. Local Storage (3 types)– Hard Drive, Flash, NVMe Flash
  2. Block Storage (4 types)– IOPS Flash (io1), General Purpose Flash (gp2), Throughput Optimized Hard Drive (st1), and Cold Hard Drive (sc1)
  3. Object Storage (4 types)– Standard S3, Standard Infrequently Accessed S3, One-Zone Infrequently Accessed S3, Glacier

That’s 11 types of storage for one cloud. Has your head started to spin?

It gets worse. You face all the old storage management challenges plus some new ones.

Wait, Cloud Makes Storage Management MORE Important?


Cost Overruns — Overprovisioning

It’s easy for application teams to run up a massive cloud storage bill.

On-premises environments built up checks-and-balances. The application team asks the storage management team for storage resources. The storage team makes them justify the request. The storage management team needs to buy more hardware. Purchasing makes them justify the request. The process slows down the application team, but it prevents reckless consumption and business surprises.

Cloud environments wipe away the checks-and-balances. Application owners pick a type of cloud storage for their application. The cloud provider tells them how much performance they get for each GB of capacity. They see that it costs a dime or less a month. They don’t have to ask anyone for approval. So they buy a big pool of storage. Why not? Turn on the faucet. It’s cheap.

Then they need more capacity. So they buy more. Turn on the faucet again. And it’s cheap.

Courtesy: Raphael Cushnir

Then they need more performance. So they buy more. Turn on the faucet again and again and again. And it’s still cheap.

The bill for 100s of TBs of storage comes in at the end of the month. Nobody turned off the faucet. They’ve run up the water bill and flooded the house.

It’s not cheap anymore.

Cost Overruns — Storage Silos

It’s easy for application teams to waste money in the cloud.

Twenty year ago, each server had its own storage. Server A used 1% of its storage. Server B used 100% of its storage and ran out. Server B couldn’t use Server A’s storage. Then storage teams adopted shared storage, SAN or NAS. Now they can give storage resources to any application or system that needs it. Shared storage eliminates the waste from the islands of server storage.

Today, cloud environments don’t share storage.

When I create a cloud instance (aka server), I buy storage for it.

When I create a second cloud instance, I buy storage for it.

When I create a third cloud instance, I buy storage for it!

When I … you get the idea.

We’ve re-created the “island of storage” problem. Except at cloud scale, application teams end up with island chains of storage. Even Larry Ellison doesn’t have archipelago money.

Data Loss — Trusting the Wrong Type of Storage

It’s easy for application teams to lose their data in the cloud.

On-premises environments use resilient shared storage systems. Application owners don’t think about RAID, mirroring, checksums, and data consistency tools. That’s because the storage management team does.

In the cloud, picking storage is like a “Choose Your Own Adventure” story, where you always lose:

  • You choose local storage. When the node goes away, so does your storage. One day you shut down the node to save money. You lost all your data. Start over.
  • You choose block storage. AWS states, “Amazon EBS volumes are designed for an annual failure rate (AFR) of between 0.1% — 0.2%, where failure refers to a complete or partial loss of the volume.” You’ll lose a volume – i.e. an application’s data. Start over.
  • You choose S3 object storage. You chose resilient storage! You win! **

** Oops. Your app can’t run with slow performance. You lost your customers. Start over.

NOTE: Backups can help. But you should have resilient production storage and backups. Backups should be your last resort, not your first option.

It's not too late to manage storage for containers in the cloud. Courtesy: tee

When Should I Plan for Cloud Storage Management?


It’s time to start thinking about storage management in the cloud.

Today, you’re running cloud applications that don’t need to store data. Or that don’t care if they lose data. Or you’ve just been lucky.

The time to manage cloud storage is coming. You need to start planning now. It can save your applications, your business and your career.

Oh, and don’t spend time worrying about the raffle. They’re always rigged for the customer with the biggest deal pending, anyway.

Why Can’t You Find a Good IT Job?

The Risks for the Five Types of IT Infrastructure Jobs

It hurts to hunt for a job in IT infrastructure right now. Every rejection finds new ways to embarrass and frustrate you. Even the offers carry painful tradeoffs. Cloud has changed the job options for infrastructure engineers. There are no perfect jobs, but there are opportunities.

I’ve seen five types of companies hiring infrastructure engineers. Each has rewards and risks.

Legacy Whale Infrastructure Companies haven't been doing so well. Photo Credit:

Legacy Whale — On-Premises Tech Giants

Legacy infrastructure companies put profit over growth — including yours. Their market may be shrinking, but they still run enterprises’ most important applications. The last company standing will charge a premium for their technology. That’s why the legacy giants need engineers to deliver products for their core markets.

The positives:

  • Salary. They pay good salaries from their profit margins.
  • Enterprise Experience. You learn how to work with a mature product for enterprise customers.

The risks:

  • Layoffs. Profit comes when you earn more than you spend. Products that need incremental development don’t need expensive engineers.
  • Stagnation. You’re working on the same product for the same customers. Everything is incremental. You’re missing sweeping technical and business trends.
  • Left too Late. If you stay too long, interviewers will wonder why. Were you too lazy to move? Too comfortable? Nobody wanted you?

The legacy whales can be a lucrative home, and they teach you how to work with big customers. You just have to ask, “When is the right time to jump ship?”

On-premises startups need to move faster than even Chuck Norris. Photo credit:

Legacy Piranha — On-Premises Startups

Legacy piranha companies have to grow fast. The legacy market may be shrinking, but it’s still huge. The legacy whales can’t always move fast enough to block small companies (either with technology or sales). Some piranhas can eat enough of the whales to IPO or get bought.

The positives:

  • System View.You design products from scratch, so you can see new parts of the system
  • Customer Experience.In a smaller company, you can work directly with customers.
  • Financial Upside.If the company takes off, so does your equity.

The risks:

  • Limited growth. Piranhas need you to do what you’ve done before. The race is on, and they can’t afford to train you on something else.
  • No market.In a shrinking market, everything has to be perfect. The product. The go-to-market. And you need the whale to miss you. For every Pure or Rubrik, there are a dozen Tintri and Primary Data.

The legacy piranhas can be an exciting gamble. You can see the whole system and work with hands-on customers. You just have to ask, “What happens if this fails?”

The Big 3 Cloud Providers are hungry for more. Photo Credit: killer-whale,org

Killer Whales — The Big 3 in Public Cloud

The killer whales (AWS, Azure, Google Cloud) control the new ocean of IT infrastructure. They’re taking share in the growing market of public cloud. The customers, requirements, and technology are different from the legacy environment. Their scale dwarfs even the largest enterprises. The problems are the same, but the rules are different.

The positives:

  • New Technology.Killer whales mix commodity technology with bleeding edge. They must innovate to stay ahead.
  • New Perspective. The scale is orders of magnitude greater than what we’re used to. The integration of the stack eliminates our silo’ed view.
  • Growth.The killer whales can afford to pay and give new opportunities.

The risks:

  • Getting Hired. They have their pick of new hires. They may see your experience as a limitation, since they want to build things in a new way.
  • Succeeding. The environment is different. The way you did things won’t work. They’re moving fast. You’re going to be very uncomfortable.
  • Limited Customer Interaction. At their scale, it’s difficult to get direct customer interaction. You’re one of the masses building for the masses.

The killer whales will be an exciting ride that sets you up for the future. You just have to ask, “Am I ready?”

Inside the belly of a while, you can go places. But you're still stuck inside a whale. Photo Credit: Pinocchio..

Inside the Blue Whales — Joining IT

Some of the biggest companies in the world build their own IT infrastructure. They create some of the most interesting infrastructure innovation (e.g. Yahoo, Google, Facebook, Medtronic, Tesla). Nothing makes infrastructure requirements more real than building an application on top of it.

The positives:

  • New Technology.You’re building custom technology because vendors’ products don’t work for them.
  • New Perspective. The scale and integration with business applications changes how you view infrastructure.
  • Growth.You could move from infrastructure to the building the application.

The risks:

  • Getting Hired and Making a Difference. See “Killer Whales”.
  • You’re a Cost Center. When you build the product, you are the business. When you provide services for the product, you’re a cost center. At Morgan Stanley, an IT member advised me, “Don’t work here. We’re the most innovative technical company on Wall Street, but we’re still the help. The traders are the business. Never be the help.”

The Blue Whales are technology users that push the boundaries of infrastructure. You just have to ask, “Am I comfortable being a cost center?”

Building services on top of public cloud is exhilarating and terrifying. Photo Credit: Sarah Kim

Riding the Killer Whales — Building on the Public Cloud

The Killer Whales can’t do everything well. No matter how quickly they hire, they can’t build decades of functionality in a few years. Furthermore, nobody wants to lock into one Killer Whale. They know how that story ends. That’s why companies are adding multi-cloud infrastructure services on top of the public cloud.

The positives:

  • New Technology.You’re riding the new technology, trying to tame it.
  • New Perspective. You learn how companies are trying to use public cloud and what challenges they face. You can see how they’re evolving from legacy to public cloud.
  • Upside. If the company takes off, so do you. You’re the expert in a new market area. Oh, and the financial equity will be rewarding, too.

The risks:

  • No Market. You have the traditional startup concerns (funding, customers, competitors) and more. You worry that the killer whales will add your functionality as a free service. You worry that the killer whales will break your product with their newest APIs. Riding killer whales is scary!
  • Financial Downside.Low salary. Even lower job security.

Some Killer Whale Riders will become the next great technology infrastructure companies. You just have to ask, “How much risk am I comfortable with?”

It only felt like Moses led me to Nuvoloso. Photo Credit: The Ten Commandments


A decade ago, even incompetent IT infrastructure vendors could grow 10% a year because the market was so strong. No more. Today, there are no infrastructure jobs without risk. Of course, there are still great opportunities.

I’m riding the killer whales because I’d gotten disconnected from new technology and new customer challenges. The risk is terrifying, but I’ve never been happier. The choice was right for me.

What did you choose and why?

Backup Sucks, Why Can’t We Move On?

Solving the Mystery of “Why is backup so painful?”

“Tape Sucks, Move On” (Data Domain)

“Don’t Backup. Go Forward.” (Rubrik)

“Don’t even mention backup in our slogan” (Every other company)

Everybody hates backup — executives, users, and administrators. Even backup companies hate it (at least their slogan writers do). Organizations run backup only because they have to protect the business. I’ve met hundreds of frustrated backup customers who have tried snapshots, backup appliances, cloud, backup as a service, and scores of other “fixes”. They all ask one question –

“Why is backup so painful?!?”


Performance: “I’m Givin’ Her All She’s Got, Captain!”

No matter how hard you run, backup isn’t fast enough. Photo Credit: Mission Impossible 4

Backup is painful because it is slow and there is so much data.

Companies expect the backup team to:

  1. Back up PBs of data for thousands of applications every day
  2. Not affect application performance (compute, network, and storage)
  3. Spend less on the backup infrastructure (and team)
  4. Rinse and Repeat next year with twice as much data

Everybody underestimates the cost of backups. While at EMC, a federal agency (no way I’m naming this one) complained about their backup performance. In their words, “The data trickles like an old man’s piss.” They were using less than 1% of the Data Domain’s performance. Their production environment, however, was running harder than Tom Cruise (and just as slow). When they set up their application environment, they hadn’t thought about backup. To meet their application and backup SLAs, they had to buy 4x the equipment and run backups 24 hours a day. NOTE: Unless you can pay for IT gear with tax dollars, I would not depend on that approach.

Backups run for a long time and they use a lot of resources. Teams have to balance application performance with backup SLAs across vast oceans of data. It’s an impossible balancing act. That’s why backup schedules are so complex.

Backup will be painful until we solve the performance problem. Imagine that you could make backup in an instant. You could make a simple schedule (e.g. hourly) and not worry. Users could create extra copies whenever they wanted. Backup would be painless!

That was the promise of snapshots. Of course, they ran into the next problem.

Offsite backups are painful. Image Credit:

Multiple Offsite Copies: “Scotty, Beam Us Up”

Backup is painful because you need to keep many offsite copies.

Companies expect their backup teams to:

  1. Store daily backups, so they can restore data from any day from the past months or years
  2. Restore the applications if something happens to the hardware, the data center, or the region.
  3. Spend less on the backup infrastructure (and team)

That’s why snapshots were never enough. Customers who lost their production system lost their snapshots. Replicating snapshots to a second array didn’t solve the problem, either…

At NetApp, a sales representative asked me to calm Bear Stearns. The director of IT complained that the backup solution (SnapVault to another NetApp system) cost more than the production environment. “You’re lucky that we don’t have to worry about money at Bear Stearns.” (Good times!) Then, he peppered me with questions about exotic failures— e.g. hash collisions, solar flares, and quantum bit flips. Our salesman had asked me to “distract him” from these phantasms, so I did. “I wouldn’t worry about those issues. We’re way more likely to corrupt data with a software bug. And that would corrupt your production and backup copies.” The blood drained from the customer’s face and he stopped asking questions (Mission accomplished!). As we left, the salesman snarled, “Next time, try to distract the customer by saying something good about our product.”

Companies store backups on alternate media (tape, dedupe disk, cloud) for reliability at a reasonable cost. That’s why backup software translates data into proprietary formats tuned for that media. The side effect is that only your backup software can read those copies. Result: Backup vendor lock-in!

Backup will be painful until we can solve the problems of performance and storing offsite copies. Imagine that you could make a resilient, secure offsite backup in an instant. You could make a simple schedule and recover from anything. Backup would be painless!

Until, of course, you met an application owner.

Nobody trusts the backup silo. Image Source:

Silos: “Resistance is Futile"

Backup is painful because you have to connect the backup process to the application teams.

Companies expect their backup teams to:

  1. Work across all applications in the environment
  2. Respond quickly to application requests
  3. Spend less on the backup infrastructure (and team)

As difficult as technology is, connecting people is even more challenging. Application owners don’t trust what they can’t see or control.

DBA vs. Backup Admin

One EMCWorld, I hosted a session for backup administrators and DBAs. At first, it was a productive discussion. One DBA explained, “If you can’t recover the database, it’s still my application that’s down. That scares me.” The group started brainstorming ways to give DBAs more visibility into the backups. Then a DBA blurted out, “I just can’t trust you guys with my database backups. You became backup admins because you weren’t smart enough to be DBAs. I’m going to keep making my own local database dumps.” After that, we decided try to solve the wrestling feud between Bret Hart and Shawn Michaels instead. It seemed more productive.

Companies need to manage complex backup schedules and create offsite copies. That’s why we have backup software. Backup software and schedules are so complex that companies hired backup teams to manage them. That extra layer is why business application owners don’t trust the backups.

Backup will be painful until application teams can trust and verify the backups of their applications.

Cloud is going to change things for backup. Image Source:

Moving On? “I canna’ change the laws of physics”

Why is backup so painful?

It’s slow and expensive. It locks you into a backup vendor. It creates a backup silo that slows the business down. Other than that, backup is great.

Why have 25 years of innovative companies not eliminated the pain of backup?

Because we couldn’t change the laws of physics in the data center. Too much data. Too expensive to get data offsite. Too hard to connect backup teams and application teams.

Why am I optimistic for the future?

Because the cloud changes the laws of physics for backup. We can stop tweaking backup and finally fix it. We’ll save that mystery for next time.

Merry Misadventures in the Public Cloud

Seven Costly Cloud Catastrophes in Seven Days

My first Amazon Web Services (AWS) bill shocked and embarrassed me. I feared I was the founding member of the “Are you &#%& serious, that’s my cloud bill?” club. I wasn’t. If you’ve recently joined, don’t worry. It’s growing every day.

The cloud preyed on my worst IT habits. I act without thinking. I overestimate the importance of my work (aka rampaging ego). I don’t clean up after myself. (Editor’s note: These bad habits extend beyond IT). The cloud turned those bad habits into zombie systems driving my bill to horrific levels.

When I joined Nuvoloso, I wanted to prove myself to the team. I volunteered to benchmark cloud storage products. All I needed to do was learn how to use AWS, Kubernetes, and Docker, so I could then install and test products I’d never heard of. I promised results in seven days. It’s amazing how much damage you can do in a week.


Sometimes too much is too much. Photo Credit: Danny Sullivan

Overprovisioning - Acting without Thinking

I overprovisioned my environment by 100x. The self-imposed urgency gave me an excuse to take shortcuts. Since I believed my on-premises storage expertise would apply to cloud, I ran full speed into my first two mistakes.

Mistake 1:Overprovisioned node type.

AWS has dozens of compute node configurations. Who has time to read all those specs? I was benchmarking storage, so I launched 5 “Storage Optimized” instances. Oops. They’re called “Storage Optimized” nodes because they offer better local storage performance. The cloud storage products don’t use local storage. I paid a 50% premium because I only read the label.

Mistake 2: Overprovisioned storage.

You buy on-premises storage in 10s or 100s of TB, so that’s how I bought cloud storage. I set a 4 TB quota of GP2 (AWS’ flash storage) for each of the 5 nodes — 20TB in total. The storage products, which had been built for on-premises environments, allocated all the storage. In fact, they doubled the allocation to do mirroring. In less than 5 minutes, I was paying for 40TB. It gets worse. The benchmark only used 40GB of data. I had so much capacity that the benchmark didn’t measure the performance of the products. I paid a 1000x premium for worthless results!

Eventually, you have to clean up the mess. Photo Credit: Reuters

Just Allocate A New Cluster - Ego

I allocated 4x as many Kubernetes clusters as I needed.

When you’re trying new products, you make mistakes. With on-premises systems, you have to fix the problem to make progress. You can’t ignore your burning tire fire and reserve new lab systems. If you try, your co-workers will freeze your car keys in mayonnaise (or worse).

The cloud eliminates resource constraints and peer pressure. You can always get more systems!

Mistakes 3 & 4: I’ll Debug that Later” / “Don’t Touch it, You’ll Break It!”

Day 1:Tuesday. I made mistakes setting up a 5-node Kubernetes cluster. I told myself I’d debug the issue later.

Day 2: Wednesday. I made mistakes installing a storage product on a new Kubernetes cluster. I told myself I’d debug the issue later.

Day 3: Thursday. I made mistakes installing the benchmark on yet another Kubernetes cluster running the storage. I told myself that I’d debug the issue later.

Day 4: Friday. Everything worked on the 4th cluster, and I ran my tests. I told myself that I was awesome.

Days 5 & 6 — Weekend. I told myself that I shouldn’t touch the running cluster because it took so long to setup. Somebody might want me to do something with it on Monday. Oh, and I’d debug the issues I’d hit later.

Day 7 — Monday. I saw my bill. I told myself that I’d better clean up NOW.

In one week, I had created 4 mega-clusters that generated worthless benchmark results and no debug information.

"Terminate Instance" - I do not think it means what you think it means. Photo Credit: Princess Bride

Clicking Delete Doesn't Mean It's Gone - Cleaning up after Myself

After cleaning up, I still paid for 40TB of storage for a week and 1 cluster for a month.

The maxim, “Nothing is ever deleted on the Internet” applies to the cloud. It’s easy to leave remnants behind, and those remnants can cost you.

Mistake 5: Cleaning up a Kubernetes cluster via the AWS GUI.

My horror story began when I terminated all my instances from the AWS console. As I was logging out, AWS spawned new instances to replace the old ones! I shut those down. More new ones came back. I deleted a subset of nodes. They came back. I spent two hours screaming silently, “Why won’t you die?!?!” Then I realized that the nodes kept spawning because that’s what Kubernetes does. It keeps your applications running, even when nodes fail. A search showed that deleting the AWS Auto Scaling Group would end my nightmare. (Today, I use kops to create and delete Kubernetes clusters).

Mistake 6: Deleting Instances does not always delete storage

After deleting the clusters, I looked for any excuse not to log into the cloud. When you work at a cloud company, you can’t hide out for long. A week later, I logged into the AWS for more punishment. I saw that I still had lots of storage (aka volumes). Deleting the instances hadn’t deleted the storage! The storage products I’d tested did not select the AWS option to delete the volume when terminating the node. I needed to delete the volumes myself.

Mistake 7: Clean Up Each Region

I created my first cluster in Northern Virginia. I’ve always liked that area. When I found out that AWS charges more for Northern Virginia, I made my next 3 clusters in Oregon. The AWS console splits the view by region. You guessed it. While freaking out about undead clusters, I forgot to delete the cluster in Northern Virginia! When the next month’s sky-high bill arrived, I corrected my final mistake (of that first week).

Welcome to the Family

Cloud can feel imaginary until that first bill hits you. Then things get real, solid, and painful. When that happens, welcome to the family of cloud experts! Cloud changes how we consume, deploy, and run IT. We’re going to make mistakes (hopefully not 7 catastrophic mistakes in one week), but we’ll learn together. I’m glad to be part of the cloud family. I don’t want to face those undead clusters alone. Bring your boomstick.

Cloud Data Protection is Business Protection

In the Cloud, Data Protection Matters More Than Ever

Cleaning up after data loss. Photo Credit:

“As I waded through a lake of rancid yogurt, each vile step fueled my rage over failed backups.” The server that ran a yogurt manufacturer’s automated packaging facility crashed. The IT team could recover some of the data, but not all. They hoped everything would “be OK”. When they restarted the production line, they learned that hope is not a plan. Machines sprayed yogurt like a 3 year old with a hose in a crowded church. By the time they shut down the line, they’d created a yogurt lake. It took two months to clean and re-certify the factory. They missed their quarterly earnings. People lost their jobs.

Data protection matters because data recovery matters. Even in the cloud. Especially in the cloud.

Businesses Run on Data

Digital transformation has turned every company into an application business.

Have you ever thought about the lightbulb business? Osram manufactured lightbulbs for almost a century. Then, LED bulbs decimated the lightbulb replacement business. Osram evolved into a lighting solution company. Osram applications optimize customers’ lighting for their houses, businesses, and stadiums. Now a high-tech company, they sold the traditional lightbulb manufacturing business in 2017.

How about the fruit business? Driscoll’s has grown berries for almost 150 years. In 2016, berries were the largest and fastest growing retail produce. Driscoll’s leads the market. They credit their Driscoll’s Delight Platform. It tracks and manages the berries from the first mile (growing) through the middle miles (shipping) to the last mile (retail consumer). Driscoll’s analyzes data at every stage to optimize the production and consumption of berries. Driscoll’s is a technology company that sells berries.

Every company is in the application business. Applications need data. To design lighting, Osram uses data about your house. To deliver the best berries, Driscoll’s analyzes data about the farms (e.g. soil, climate), shipping (e.g. temperature and route), and customer preferences.

Modern businesses depend on applications. Applications depend on data. Therefore, modern businesses depend on data.

Driscoll's: A tech company that sells berries. Credit: Jacks Sachs, The New Yorker
Flooded data centers cause data loss. Photo Credit: Unknown

Data Protection: Because of Bad Things and Bad People

Every company protects their data center because there are so many ways to lose data.

CIOs have seen their companies suffer through catastrophes. Hurricane Harvey flooded Houston data centers. Hardware fails and sometimes catches fire. Software bugs corrupt data. People delete the presentation before the biggest meeting of their lives, so they throw a stone at a wasps’ nest to incite a swarm, get rushed to the hospital with dozens of vicious stings to have an excuse to re-schedule (or so I’ve heard).

IT organizations have also survived deliberate attacks. External hackers strike for fun and profit. Ransomware has become mainstream; cyber criminals can now subscribe to Ransomware as a Service! Now, anybody can become a hacker. Some attacks happen from inside, too. A terminated contractor at an Arizona bank destroyed racks of systems with a pickaxe. (I’ll never forget the dumbfounded CIO muttering, “We think he brought the pickaxe from home.” Because that’s what mattered.)

After decades of enduring data loss, IT knows to protect the data center. Do we also need to protect data in the cloud?


A data center destroyed by a fire. Photo Credit:

Data Protection: The Cloud Has Bad Things and Bad People

Every company needs to protect their data in the cloud because there are even more ways to lose it.

Bad things happen in the cloud. First, users still make mistakes. The cloud provider is not responsible for recovering from user error. Second, the cloud is still built of hardware and software that can fail. Vendors explain, “Amazon EBS volumes are designed for an annual failure rate (AFR) of between 0.1% — 0.2%, where failure refers to a complete or partial loss of the volume.” The applications you lose may be unimportant… or they may decimate your business. Third, since you are sharing resources, performance issues can affect data access. Amazon Prime Day is the most recent example. Finally, storms trigger data loss in a public cloud data center, just like they do in a corporate data center.

Public clouds are a bigger target for bad actors. Aggressive nations (with names that rhyme with Russia, Iran, North Korea, and China), bitcoin miners, and traditional criminals hack companies running in the cloud. Those hacks obliterate companies. Hackers deleted Code Space’s data in AWS. Two days later, the business shut down. Meanwhile, the scope of the public cloud makes internal threats more serious. The pickaxe (or virus)-wielding employee can now damage hundreds of companies instead of one!

Data is not any safer in the cloud than it is on-premises. Cloud providers try to protect your data, but it’s not enough. Even in the cloud, it’s your data. It’s your business. It’s your responsibility.

Protect the Cloud Data, Protect the Business

Modern businesses run on applications. Applications run on data. Most companies that lose data go out of business in 6 months or less.

Unfortunately, bad things and bad people destroy, steal, or disable access to the data. Whether you run on-premises or in the cloud, one day you will lose data. If you have a good backup and disaster recovery solution, you can recover the data. Your business can survive.

Amazon CTO Werner Vogels declared, “Everything fails all the time.” Companies need to protect their data in the cloud, so they can recover from those failures. Now, more than ever.

Data Protection - Business Protection in the Cloud. Photo Credit: