Cloud Didn’t Kill Storage

At a big IT event, I asked, “If you care about storage management, raise your hand.” I saw a smattering of hands.

Next, “Put your hand down if you aren’t a storage admin.”

One hand stayed in the air.

I asked the lone holdout, “Why do you care about storage management?”

“What? No. I need a raffle ticket. There’s a raffle at the end of this talk, right?”

Storage management is the least appreciated part of IT infrastructure… which is saying something. Business users understand server and network issues. Security and backup teams overwhelm them with graphic horror stories. The storage team just gets a budget cut and requirements to store twice as much data.

What is Storage Management?

When you buy storage, you have to consider (at least) five factors:

  1. Capacity — How much data do I want to store?
  2. Durability — How much do I not want to lose my data?
  3. Availability — How long can I go without being able to read or write my data
  4. Performance — How fast do I need to get at my data?
  5. Cost — How much am I willing to pay?

Different products optimize for different factors. That’s why you see hundreds of storage products on the market. It’s why you see individual vendors sell dozens of storage products. There is no one-size-fits-all product.

Storage management is meeting the storage needs of all the different business applications.

Storage management is also knowing that you’ll always fall short.

Doesn’t Cloud Make Storage Management Go Away?

No.

The same five factors still matter to your applications, even in the cloud.

That’s why cloud providers offer different types of storage. AWS alone offers:

  1. Local Storage (3 types)– Hard Drive, Flash, NVMe Flash
  2. Block Storage (4 types)– IOPS Flash (io1), General Purpose Flash (gp2), Throughput Optimized Hard Drive (st1), and Cold Hard Drive (sc1)
  3. Object Storage (4 types)– Standard S3, Standard Infrequently Accessed S3, One-Zone Infrequently Accessed S3, Glacier

That’s 11 types of storage for one cloud. Has your head started to spin?

It gets worse. You face all the old storage management challenges plus some new ones.

Wait, Cloud Makes Storage Management MORE Important?

Yes.

Cost Overruns — Overprovisioning

It’s easy for application teams to run up a massive cloud storage bill.

On-premises environments built up checks-and-balances. The application team asks the storage management team for storage resources. The storage team makes them justify the request. The storage management team needs to buy more hardware. Purchasing makes them justify the request. The process slows down the application team, but it prevents reckless consumption and business surprises.

Cloud environments wipe away the checks-and-balances. Application owners pick a type of cloud storage for their application. The cloud provider tells them how much performance they get for each GB of capacity. They see that it costs a dime or less a month. They don’t have to ask anyone for approval. So they buy a big pool of storage. Why not? Turn on the faucet. It’s cheap.

Then they need more capacity. So they buy more. Turn on the faucet again. And it’s cheap.

Then they need more performance. So they buy more. Turn on the faucet again and again and again. And it’s still cheap.

The bill for 100s of TBs of storage comes in at the end of the month. Nobody turned off the faucet. They’ve run up the water bill and flooded the house.

It’s not cheap anymore.

Cost Overruns — Storage Silos

It’s easy for application teams to waste money in the cloud.

Twenty year ago, each server had its own storage. Server A used 1% of its storage. Server B used 100% of its storage and ran out. Server B couldn’t use Server A’s storage. Then storage teams adopted shared storage, SAN or NAS. Now they can give storage resources to any application or system that needs it. Shared storage eliminates the waste from the islands of server storage.

Today, cloud environments don’t share storage.

When I create a cloud instance (aka server), I buy storage for it.

When I create a second cloud instance, I buy storage for it.

When I create a third cloud instance, I buy storage for it!

When I … you get the idea.

We’ve re-created the “island of storage” problem. Except at cloud scale, application teams end up with island chains of storage. Even Larry Ellison doesn’t have archipelago money.

Data Loss — Trusting the Wrong Type of Storage

It’s easy for application teams to lose their data in the cloud.

On-premises environments use resilient shared storage systems. Application owners don’t think about RAID, mirroring, checksums, and data consistency tools. That’s because the storage management team does.

In the cloud, picking storage is like a “Choose Your Own Adventure” story, where you always lose:

  • You choose local storage. When the node goes away, so does your storage. One day you shut down the node to save money. You lost all your data. Start over.
  • You choose block storage. AWS states, “Amazon EBS volumes are designed for an annual failure rate (AFR) of between 0.1% — 0.2%, where failure refers to a complete or partial loss of the volume.” At some point this year you’ll lose a volume. You lost an application’s data. Start over.
  • You choose S3 object storage. You chose resilient storage! You win! **

** Your application can’t run with such slow performance. You lost your customers. Start over.

NOTE: Backups can help. But you should have resilient production storage and backups. Backups should be your last resort, not your first option.

When Should I Plan for Cloud Storage Management?

Now.

It’s time to start thinking about storage management in the cloud.

Today, you’re running cloud applications that don’t need to store data. Or that don’t care if they lose data. Or you’ve just been lucky.

The time to manage cloud storage is coming. You need to start planning now. It can save your applications, your business and your career.

Oh, and don’t spend time worrying about the raffle. They’re always rigged for the customer with the biggest deal pending, anyway.

Merry Misadventures in the Public Cloud

My first Amazon Web Services (AWS) bill shocked and embarrassed me. I feared I was the founding member of the “Are you &#%& serious, that’s my cloud bill?” club. I wasn’t. If you’ve recently joined, don’t worry. It’s growing every day.

The cloud preyed on my worst IT habits. I act without thinking. I overestimate the importance of my work (aka rampaging ego). I don’t clean up after myself. (Editor’s note: These bad habits extend beyond IT). The cloud turned those bad habits into zombie systems driving my bill to horrific levels.

When I joined Nuvoloso, I wanted to prove myself to the team. I volunteered to benchmark cloud storage products. All I needed to do was learn how to use AWS, Kubernetes, and Docker, so I could then install and test products I’d never heard of. I promised results in seven days. It’s amazing how much damage you can do in a week.

Overprovisioning — Acting without Thinking

I overprovisioned my environment by 100x. The self-imposed urgency gave me an excuse to take shortcuts. Since I believed my on-premises storage expertise would apply to cloud, I ran full speed into my first two mistakes.

Mistake 1: Overprovisioned node type.

AWS has dozens of compute node configurations. Who has time to read all those specs? I was benchmarking storage, so I launched 5 “Storage Optimized” instances. Oops. They’re called “Storage Optimized” nodes because they offer better local storage performance. The cloud storage products don’t use local storage. I paid a 50% premium because I only read the label.

Mistake 2: Overprovisioned storage.

You buy on-premises storage in 10s or 100s of TB, so that’s how I bought cloud storage. I set a 4 TB quota of GP2 (AWS’ flash storage) for each of the 5 nodes — 20TB in total. The storage products, which had been built for on-premises environments, allocated all the storage. In fact, they doubled the allocation to do mirroring. In less than 5 minutes, I was paying for 40TB. It gets worse. The benchmark only used 40GB of data. I had so much capacity that the benchmark didn’t measure the performance of the products. I paid a 1000x premium for worthless results!

Just Allocate A New Cluster — Ego

I allocated 4x as many Kubernetes clusters as I needed.

When you’re trying new products, you make mistakes. With on-premises systems, you have to fix the problem to make progress. You can’t ignore your burning tire fire and reserve new lab systems. If you try, your co-workers will freeze your car keys in mayonnaise (or worse).

The cloud eliminates resource constraints and peer pressure. You can always get more systems!

Mistakes 3 & 4: “I’ll Debug that Later” / “Don’t Touch it, You’ll Break It!”

Day 1:Tuesday. I made mistakes setting up a 5-node Kubernetes cluster. I told myself I’d debug the issue later.

Day 2: Wednesday. I made mistakes installing a storage product on a new Kubernetes cluster. I told myself I’d debug the issue later.

Day 3: Thursday. I made mistakes installing the benchmark on yet another Kubernetes cluster running the storage. I told myself that I’d debug the issue later.

Day 4: Friday. Everything worked on the 4th cluster, and I ran my tests. I told myself that I was awesome.

Days 5 & 6 — Weekend. I told myself that I shouldn’t touch the running cluster because it took so long to setup. Somebody might want me to do something with it on Monday. Oh, and I’d debug the issues I’d hit later.

Day 7 — Monday. I saw my bill. I told myself that I’d better clean up NOW.

In one week, I had created 4 mega-clusters that generated worthless benchmark results and no debug information.

Clicking Delete Doesn’t Mean It’s Gone — Cleaning up after Myself
After cleaning up, I still paid for 40TB of storage for a week and 1 cluster for a month.

The maxim, “Nothing is ever deleted on the Internet” applies to the cloud. It’s easy to leave remnants behind, and those remnants can cost you.

Mistake 5: Cleaning up a Kubernetes cluster via the AWS GUI.

My horror story began when I terminated all my instances from the AWS console. As I was logging out, AWS spawned new instances to replace the old ones! I shut those down. More new ones came back. I deleted a subset of nodes. They came back. I spent two hours screaming silently, “Why won’t you die?!?!” Then I realized that the nodes kept spawning because that’s what Kubernetes does. It keeps your applications running, even when nodes fail. A search showed that deleting the AWS Auto Scaling Group would end my nightmare. (Today, I use kops to create and delete Kubernetes clusters).

Mistake 6: Deleting Instances does not always delete storage

After deleting the clusters, I looked for any excuse not to log into the cloud. When you work at a cloud company, you can’t hide out for long. A week later, I logged into the AWS for more punishment. I saw that I still had lots of storage (aka volumes). Deleting the instances hadn’t deleted the storage! The storage products I’d tested did not select the AWS option to delete the volume when terminating the node. I needed to delete the volumes myself.

Mistake 7: Clean Up Each Region

I created my first cluster in Northern Virginia. I’ve always liked that area. When I found out that AWS charges more for Northern Virginia, I made my next 3 clusters in Oregon. The AWS console splits the view by region. You guessed it. While freaking out about undead clusters, I forgot to delete the cluster in Northern Virginia! When the next month’s sky-high bill arrived, I corrected my final mistake (of that first week).

Welcome to the Family

Cloud can feel imaginary until that first bill hits you. Then things get real, solid, and painful. When that happens, welcome to the family of cloud experts! Cloud changes how we consume, deploy, and run IT. We’re going to make mistakes (hopefully not 7 catastrophic mistakes in one week), but we’ll learn together. I’m glad to be part of the cloud family. I don’t want to face those undead clusters alone. Bring your boomstick.