Cloud Didn’t Kill Storage

At a big IT event, I asked, “If you care about storage management, raise your hand.” I saw a smattering of hands.

Next, “Put your hand down if you aren’t a storage admin.”

One hand stayed in the air.

I asked the lone holdout, “Why do you care about storage management?”

“What? No. I need a raffle ticket. There’s a raffle at the end of this talk, right?”

Storage management is the least appreciated part of IT infrastructure… which is saying something. Business users understand server and network issues. Security and backup teams overwhelm them with graphic horror stories. The storage team just gets a budget cut and requirements to store twice as much data.

What is Storage Management?

When you buy storage, you have to consider (at least) five factors:

  1. Capacity — How much data do I want to store?
  2. Durability — How much do I not want to lose my data?
  3. Availability — How long can I go without being able to read or write my data
  4. Performance — How fast do I need to get at my data?
  5. Cost — How much am I willing to pay?

Different products optimize for different factors. That’s why you see hundreds of storage products on the market. It’s why you see individual vendors sell dozens of storage products. There is no one-size-fits-all product.

Storage management is meeting the storage needs of all the different business applications.

Storage management is also knowing that you’ll always fall short.

Doesn’t Cloud Make Storage Management Go Away?

No.

The same five factors still matter to your applications, even in the cloud.

That’s why cloud providers offer different types of storage. AWS alone offers:

  1. Local Storage (3 types)– Hard Drive, Flash, NVMe Flash
  2. Block Storage (4 types)– IOPS Flash (io1), General Purpose Flash (gp2), Throughput Optimized Hard Drive (st1), and Cold Hard Drive (sc1)
  3. Object Storage (4 types)– Standard S3, Standard Infrequently Accessed S3, One-Zone Infrequently Accessed S3, Glacier

That’s 11 types of storage for one cloud. Has your head started to spin?

It gets worse. You face all the old storage management challenges plus some new ones.

Wait, Cloud Makes Storage Management MORE Important?

Yes.

Cost Overruns — Overprovisioning

It’s easy for application teams to run up a massive cloud storage bill.

On-premises environments built up checks-and-balances. The application team asks the storage management team for storage resources. The storage team makes them justify the request. The storage management team needs to buy more hardware. Purchasing makes them justify the request. The process slows down the application team, but it prevents reckless consumption and business surprises.

Cloud environments wipe away the checks-and-balances. Application owners pick a type of cloud storage for their application. The cloud provider tells them how much performance they get for each GB of capacity. They see that it costs a dime or less a month. They don’t have to ask anyone for approval. So they buy a big pool of storage. Why not? Turn on the faucet. It’s cheap.

Then they need more capacity. So they buy more. Turn on the faucet again. And it’s cheap.

Then they need more performance. So they buy more. Turn on the faucet again and again and again. And it’s still cheap.

The bill for 100s of TBs of storage comes in at the end of the month. Nobody turned off the faucet. They’ve run up the water bill and flooded the house.

It’s not cheap anymore.

Cost Overruns — Storage Silos

It’s easy for application teams to waste money in the cloud.

Twenty year ago, each server had its own storage. Server A used 1% of its storage. Server B used 100% of its storage and ran out. Server B couldn’t use Server A’s storage. Then storage teams adopted shared storage, SAN or NAS. Now they can give storage resources to any application or system that needs it. Shared storage eliminates the waste from the islands of server storage.

Today, cloud environments don’t share storage.

When I create a cloud instance (aka server), I buy storage for it.

When I create a second cloud instance, I buy storage for it.

When I create a third cloud instance, I buy storage for it!

When I … you get the idea.

We’ve re-created the “island of storage” problem. Except at cloud scale, application teams end up with island chains of storage. Even Larry Ellison doesn’t have archipelago money.

Data Loss — Trusting the Wrong Type of Storage

It’s easy for application teams to lose their data in the cloud.

On-premises environments use resilient shared storage systems. Application owners don’t think about RAID, mirroring, checksums, and data consistency tools. That’s because the storage management team does.

In the cloud, picking storage is like a “Choose Your Own Adventure” story, where you always lose:

  • You choose local storage. When the node goes away, so does your storage. One day you shut down the node to save money. You lost all your data. Start over.
  • You choose block storage. AWS states, “Amazon EBS volumes are designed for an annual failure rate (AFR) of between 0.1% — 0.2%, where failure refers to a complete or partial loss of the volume.” At some point this year you’ll lose a volume. You lost an application’s data. Start over.
  • You choose S3 object storage. You chose resilient storage! You win! **

** Your application can’t run with such slow performance. You lost your customers. Start over.

NOTE: Backups can help. But you should have resilient production storage and backups. Backups should be your last resort, not your first option.

When Should I Plan for Cloud Storage Management?

Now.

It’s time to start thinking about storage management in the cloud.

Today, you’re running cloud applications that don’t need to store data. Or that don’t care if they lose data. Or you’ve just been lucky.

The time to manage cloud storage is coming. You need to start planning now. It can save your applications, your business and your career.

Oh, and don’t spend time worrying about the raffle. They’re always rigged for the customer with the biggest deal pending, anyway.

Why Can’t You Find a Good IT Job?

It hurts to hunt for a job in IT infrastructure right now. Every rejection finds new ways to embarrass and frustrate you. Even the offers carry painful tradeoffs. Cloud has changed the job options for infrastructure engineers. There are no perfect jobs, but there are opportunities.

I’ve seen five types of companies hiring infrastructure engineers. Each has rewards and risks.

Legacy Whale — On-Premises Tech Giants

Legacy infrastructure companies put profit over growth — including yours. Their market may be shrinking, but they still run enterprises’ most important applications. The last company standing will charge a premium for their technology. That’s why the legacy giants need engineers to deliver products for their core markets.

The positives:

  • Salary. They pay good salaries from their profit margins.
  • Enterprise Experience. You learn how to work with a mature product for enterprise customers.

The risks:

  • Layoffs. Profit comes when you earn more than you spend. Products that need incremental development don’t need expensive engineers.
  • Stagnation. You’re working on the same product for the same customers. Everything is incremental. You’re missing sweeping technical and business trends.
  • Left too Late. If you stay too long, interviewers will wonder why. Were you too lazy to move? Too comfortable? Nobody wanted you?

The legacy whales can be a lucrative home, and they teach you how to work with big customers. You just have to ask, “When is the right time to jump ship?”

Legacy Piranha — On-Premises Startups

Legacy piranha companies have to grow fast. The legacy market may be shrinking, but it’s still huge. The legacy whales can’t always move fast enough to block small companies (either with technology or sales). Some piranhas can eat enough of the whales to IPO or get bought.

The positives:

  • System View: You design products from scratch, so you can see new parts of the system
  • Customer Experience: In a smaller company, you can work directly with customers.
  • Financial Upside: If the company takes off, so does your equity.

The risks:

  • Limited growth: Piranhas need you to do what you’ve done before. The race is on, and they can’t afford to train you on something else.
  • No market: In a shrinking market, everything has to be perfect. The product. The go-to-market. And you need the whale to miss you. For every Pure or Rubrik, there are a dozen Tintri and Primary Data.

The legacy piranhas can be an exciting gamble. You can see the whole system and work with hands-on customers. You just have to ask, “What happens if this fails?”

Killer Whales — The Big 3 in Public Cloud

The killer whales (AWS, Azure, Google Cloud) control the new ocean of IT infrastructure. They’re taking share in the growing market of public cloud. The customers, requirements, and technology are different from the legacy environment. Their scale dwarfs even the largest enterprises. The problems are the same, but the rules are different.

The positives:

  • New Technology: Killer whales mix commodity technology with bleeding edge. They must innovate to stay ahead.
  • New Perspective: The scale is orders of magnitude greater than what we’re used to. The integration of the stack eliminates our silo’ed view.
  • Growth: The killer whales can afford to pay and give new opportunities.

The risks:

  • Getting Hired: They have their pick of new hires. They may see your experience as a limitation, since they want to build things in a new way.
  • Succeeding: The environment is different. The way you did things won’t work. They’re moving fast. You’re going to be very uncomfortable.
  • Limited Customer Interaction: At their scale, it’s difficult to get direct customer interaction. You’re one of the masses building for the masses.

The killer whales will be an exciting ride that sets you up for the future. You just have to ask, “Am I ready?”

Inside the Blue Whales — Joining IT

Some of the biggest companies in the world build their own IT infrastructure. They create some of the most interesting infrastructure innovation (e.g. Yahoo, Google, Facebook, Medtronic, Tesla). Nothing makes infrastructure requirements more real than building an application on top of it.

The positives:

  • New Technology: You’re building custom technology because vendors’ products don’t work for them.
  • New Perspective: The scale and integration with business applications changes how you view infrastructure.
  • Growth: You could move from infrastructure to the building the application.

The risks:

  • Getting Hired and Making a Difference. See “Killer Whales”.
  • You’re a Cost Center: When you build the product, you are the business. When you provide services for the product, you’re a cost center. At Morgan Stanley, an IT member advised me, “Don’t work here. We’re the most innovative technical company on Wall Street, but we’re still the help. The traders are the business. Never be the help.”

The Blue Whales are technology users that push the boundaries of infrastructure. You just have to ask, “Am I comfortable being a cost center?”

Riding the Killer Whales — Building on the Public Cloud

The Killer Whales can’t do everything well. No matter how quickly they hire, they can’t build decades of functionality in a few years. Furthermore, nobody wants to lock into one Killer Whale. They know how that story ends. That’s why companies are adding multi-cloud infrastructure services on top of the public cloud.

The positives:

  • New Technology: You’re riding the new technology, trying to tame it.
  • New Perspective: You learn how companies are trying to use public cloud and what challenges they face. You can see how they’re evolving from legacy to public cloud.
  • Upside: If the company takes off, so do you. You’re the expert in a new market area. Oh, and the financial equity will be rewarding, too.

The risks:

  • No Market: You have the traditional startup concerns (funding, customers, competitors) and more. You worry that the killer whales will add your functionality as a free service. You worry that the killer whales will break your product with their newest APIs. Riding killer whales is scary!
  • Financial Downside: Low salary. Even lower job security.

Some Killer Whale Riders will become the next great technology infrastructure companies. You just have to ask, “How much risk am I comfortable with?”

Conclusion

A decade ago, even incompetent IT infrastructure vendors could grow 10% a year because the market was so strong. No more. Today, there are no infrastructure jobs without risk. Of course, there are still great opportunities.

I’m riding the killer whales because I’d gotten disconnected from new technology and new customer challenges. The risk is terrifying, but I’ve never been happier. The choice was right for me.

What did you choose and why?

Backup Sucks, Why Can’t We Move On?

“Tape Sucks, Move On” (Data Domain)

“Don’t Backup. Go Forward.” (Rubrik)

“Don’t even mention backup in our slogan” (Every other company)

Everybody hates backup — executives, users, and administrators. Even backup companies hate it (at least their slogan writers do). Organizations run backup only because they have to protect the business. I’ve met hundreds of frustrated backup customers who have tried snapshots, backup appliances, cloud, backup as a service, and scores of other “fixes”. They all ask one question –

“Why is backup so painful?!?”

Performance: “I’m Givin’ Her All She’s Got, Captain!”
Backup is painful because it is slow and there is so much data.

Companies expect the backup team to:

  1. Back up PBs of data for thousands of applications every day
  2. Not affect application performance (compute, network, and storage)
  3. Spend less on the backup infrastructure (and team)
  4. Rinse and Repeat next year with twice as much data

Everybody underestimates the cost of backups. While at EMC, a federal agency (no way I’m naming this one) complained about their backup performance. In their words, “The data trickles like an old man’s piss.” They were using less than 1% of the Data Domain’s performance. Their production environment, however, was running harder than Tom Cruise (and just as slow). When they set up their application environment, they hadn’t thought about backup. To meet their application and backup SLAs, they had to buy 4x the equipment and run backups 24 hours a day. NOTE: Unless you can pay for IT gear with tax dollars, I would not depend on that approach.

Backups run for a long time and they use a lot of resources. Teams have to balance application performance with backup SLAs across vast oceans of data. It’s an impossible balancing act. That’s why backup schedules are so complex.

Backup will be painful until we solve the performance problem. Imagine that you could make backup in an instant. You could make a simple schedule (e.g. hourly) and not worry. Users could create extra copies whenever they wanted. Backup would be painless!

That was the promise of snapshots. Of course, they ran into the next problem.

Multiple Offsite Copies: “Scotty, Beam Us Up”

Backup is painful because you need to keep many offsite copies.

Companies expect their backup teams to:

  1. Store daily backups, so they can restore data from any day from the past months or years
  2. Restore the applications if something happens to the hardware, the data center, or the region.
  3. Spend less on the backup infrastructure (and team)

That’s why snapshots were never enough. Customers who lost their production system lost their snapshots. Replicating snapshots to a second array didn’t solve the problem, either…

At NetApp, a sales representative asked me to calm Bear Stearns. The director of IT complained that the backup solution (SnapVault to another NetApp system) cost more than the production environment. “You’re lucky that we don’t have to worry about money at Bear Stearns.” (Good times!) Then, he peppered me with questions about exotic failures— e.g. hash collisions, solar flares, and quantum bit flips. Our salesman had asked me to “distract him” from these phantasms, so I did. “I wouldn’t worry about those issues. We’re way more likely to corrupt data with a software bug. And that would corrupt your production and backup copies.” The blood drained from the customer’s face and he stopped asking questions (Mission accomplished!). As we left, the salesman snarled, “Next time, try to distract the customer by saying something good about our product.”

Companies store backups on alternate media (tape, dedupe disk, cloud) for reliability at a reasonable cost. That’s why backup software translates data into proprietary formats tuned for that media. The side effect is that only your backup software can read those copies. Result: Backup vendor lock-in!

Backup will be painful until we can solve the problems of performance and storing offsite copies. Imagine that you could make a resilient, secure offsite backup in an instant. You could make a simple schedule and recover from anything. Backup would be painless!

Until, of course, you met an application owner.

Silos: “Resistance is Futile”

Backup is painful because you have to connect the backup process to the application teams.

Companies expect their backup teams to:

  1. Work across all applications in the environment
  2. Respond quickly to application requests
  3. Spend less on the backup infrastructure (and team)

As difficult as technology is, connecting people is even more challenging. Application owners don’t trust what they can’t see or control.
One EMCWorld, I hosted a session for backup administrators and DBAs. At first, it was a productive discussion. One DBA explained, “If you can’t recover the database, it’s still my application that’s down. That scares me.” The group started brainstorming ways to give DBAs more visibility into the backups. Then a DBA blurted out, “I just can’t trust you guys with my database backups. You became backup admins because you weren’t smart enough to be DBAs. I’m going to keep making my own local database dumps.” After that, we decided try to solve the wrestling feud between Bret Hart and Shawn Michaels instead. It seemed more productive.

Companies need to manage complex backup schedules and create offsite copies. That’s why we have backup software. Backup software and schedules are so complex that companies hired backup teams to manage them. That extra layer is why business application owners don’t trust the backups.

Backup will be painful until application teams can trust and verify the backups of their applications.

Moving On? “I canna’ change the laws of physics”

Why is backup so painful?

It’s slow and expensive. It locks you into a backup vendor. It creates a backup silo that slows the business down. Other than that, backup is great.

Why have 25 years of innovative companies not eliminated the pain of backup?

Because we couldn’t change the laws of physics in the data center. Too much data. Too expensive to get data offsite. Too hard to connect backup teams and application teams.

Why am I optimistic for the future?

Because the cloud changes the laws of physics for backup. We can stop tweaking backup and finally fix it. We’ll save that mystery for next time.

Merry Misadventures in the Public Cloud

My first Amazon Web Services (AWS) bill shocked and embarrassed me. I feared I was the founding member of the “Are you &#%& serious, that’s my cloud bill?” club. I wasn’t. If you’ve recently joined, don’t worry. It’s growing every day.

The cloud preyed on my worst IT habits. I act without thinking. I overestimate the importance of my work (aka rampaging ego). I don’t clean up after myself. (Editor’s note: These bad habits extend beyond IT). The cloud turned those bad habits into zombie systems driving my bill to horrific levels.

When I joined Nuvoloso, I wanted to prove myself to the team. I volunteered to benchmark cloud storage products. All I needed to do was learn how to use AWS, Kubernetes, and Docker, so I could then install and test products I’d never heard of. I promised results in seven days. It’s amazing how much damage you can do in a week.

Overprovisioning — Acting without Thinking

I overprovisioned my environment by 100x. The self-imposed urgency gave me an excuse to take shortcuts. Since I believed my on-premises storage expertise would apply to cloud, I ran full speed into my first two mistakes.

Mistake 1: Overprovisioned node type.

AWS has dozens of compute node configurations. Who has time to read all those specs? I was benchmarking storage, so I launched 5 “Storage Optimized” instances. Oops. They’re called “Storage Optimized” nodes because they offer better local storage performance. The cloud storage products don’t use local storage. I paid a 50% premium because I only read the label.

Mistake 2: Overprovisioned storage.

You buy on-premises storage in 10s or 100s of TB, so that’s how I bought cloud storage. I set a 4 TB quota of GP2 (AWS’ flash storage) for each of the 5 nodes — 20TB in total. The storage products, which had been built for on-premises environments, allocated all the storage. In fact, they doubled the allocation to do mirroring. In less than 5 minutes, I was paying for 40TB. It gets worse. The benchmark only used 40GB of data. I had so much capacity that the benchmark didn’t measure the performance of the products. I paid a 1000x premium for worthless results!

Just Allocate A New Cluster — Ego

I allocated 4x as many Kubernetes clusters as I needed.

When you’re trying new products, you make mistakes. With on-premises systems, you have to fix the problem to make progress. You can’t ignore your burning tire fire and reserve new lab systems. If you try, your co-workers will freeze your car keys in mayonnaise (or worse).

The cloud eliminates resource constraints and peer pressure. You can always get more systems!

Mistakes 3 & 4: “I’ll Debug that Later” / “Don’t Touch it, You’ll Break It!”

Day 1:Tuesday. I made mistakes setting up a 5-node Kubernetes cluster. I told myself I’d debug the issue later.

Day 2: Wednesday. I made mistakes installing a storage product on a new Kubernetes cluster. I told myself I’d debug the issue later.

Day 3: Thursday. I made mistakes installing the benchmark on yet another Kubernetes cluster running the storage. I told myself that I’d debug the issue later.

Day 4: Friday. Everything worked on the 4th cluster, and I ran my tests. I told myself that I was awesome.

Days 5 & 6 — Weekend. I told myself that I shouldn’t touch the running cluster because it took so long to setup. Somebody might want me to do something with it on Monday. Oh, and I’d debug the issues I’d hit later.

Day 7 — Monday. I saw my bill. I told myself that I’d better clean up NOW.

In one week, I had created 4 mega-clusters that generated worthless benchmark results and no debug information.

Clicking Delete Doesn’t Mean It’s Gone — Cleaning up after Myself
After cleaning up, I still paid for 40TB of storage for a week and 1 cluster for a month.

The maxim, “Nothing is ever deleted on the Internet” applies to the cloud. It’s easy to leave remnants behind, and those remnants can cost you.

Mistake 5: Cleaning up a Kubernetes cluster via the AWS GUI.

My horror story began when I terminated all my instances from the AWS console. As I was logging out, AWS spawned new instances to replace the old ones! I shut those down. More new ones came back. I deleted a subset of nodes. They came back. I spent two hours screaming silently, “Why won’t you die?!?!” Then I realized that the nodes kept spawning because that’s what Kubernetes does. It keeps your applications running, even when nodes fail. A search showed that deleting the AWS Auto Scaling Group would end my nightmare. (Today, I use kops to create and delete Kubernetes clusters).

Mistake 6: Deleting Instances does not always delete storage

After deleting the clusters, I looked for any excuse not to log into the cloud. When you work at a cloud company, you can’t hide out for long. A week later, I logged into the AWS for more punishment. I saw that I still had lots of storage (aka volumes). Deleting the instances hadn’t deleted the storage! The storage products I’d tested did not select the AWS option to delete the volume when terminating the node. I needed to delete the volumes myself.

Mistake 7: Clean Up Each Region

I created my first cluster in Northern Virginia. I’ve always liked that area. When I found out that AWS charges more for Northern Virginia, I made my next 3 clusters in Oregon. The AWS console splits the view by region. You guessed it. While freaking out about undead clusters, I forgot to delete the cluster in Northern Virginia! When the next month’s sky-high bill arrived, I corrected my final mistake (of that first week).

Welcome to the Family

Cloud can feel imaginary until that first bill hits you. Then things get real, solid, and painful. When that happens, welcome to the family of cloud experts! Cloud changes how we consume, deploy, and run IT. We’re going to make mistakes (hopefully not 7 catastrophic mistakes in one week), but we’ll learn together. I’m glad to be part of the cloud family. I don’t want to face those undead clusters alone. Bring your boomstick.

Traditional Applications Run Better in Public Cloud

Public Cloud = More Choice + Better Data Protection

Public cloud works. Not just for SaaS, cloud native applications, or test and development. Not just for startups or executives bragging to each other on the golf course. Public cloud works for traditional, stable applications. It can deliver better service levels and reduce costs … even compared to a well-run on-premises environment.

To date, market analysts have focused on cloud disrupting who buys IT infrastructure. Frustrated lines of business pounced on the chance to bypass IT. Cloud let them “Fail Fast or Scale Fast”. They didn’t have to wait for IT approval, change control, hardware acquisition, or governance. Lines of business continue to embrace cloud’s self-service provisioning at a low monthly cost.

Still, conventional wisdom says public cloud can’t compete with a well-run on-premises environment. IT architects argue that public cloud can’t match the performance and functionality of legacy environments. IT Administrators can’t tweak low level knobs. IT Directors can’t demand custom releases. How can vanilla cloud handle the complex requirements of legacy applications? Financial analysts note that public cloud charges a premium for its flexible consumption. Stable workloads don’t need that flexibility, so why pay the premium?

Conventional wisdom is wrong. Most traditional workloads don’t need custom-built environments. You don’t need a Formula-1 race car to pick up groceries, and you don’t need specially-made infrastructure to run most applications. Moreover, public cloud’s architectural advantages can reduce IT costs, even with the pricing premium.

In the next stage, public cloud will change how we architect IT infrastructure. Public cloud has two architectural advantages for traditional applications: more price/performance options and on-demand provisioning for data protection.

Public cloud offers more price/performance choices than on-premises infrastructure. Outside of the Fortune 50, most companies don’t get to buy “one of everything” for their infrastructure. Instead, they buy a one-size-fits-all workhorse system to support all the workloads. The public cloud offers more technology choices than even the largest IT shop. It is the biggest marketplace (pun intended) for different technology configurations. Cloud levels the playing field between smaller and bigger companies.*

* NOTE: For this to happen, we need to solve the operational challenges of running different cloud configurations.

Public cloud can improve data protection. For years, IT has struggled to deliver high-performance disaster recovery, backup, and archive. Companies can’t afford to run DR and archive environments for all their applications; maintaining two near-identical sites costs too much. That’s why they pretend that their backups can be DR and archive copies. Unfortunately, when disasters or (even worse) legal issues strike, recovery cannot begin until IT provisions a new environment. Companies collapse before recoveries can complete.

Public cloud’s on-demand provisioning enables cost-effective first-class DR, archive, and backup. Customers don’t waste money on idle standby environments. Nor do they treat “hope that nothing goes wrong” as a strategy. Instead, when necessary, they near-instantly spin up compute and storage in a new location. Then, they near-instantly restore the data and start running.* With public cloud, IT can unify enterprise-class DR, backup, and archive.

Organizations are already moving backup copies to cloud object storage. The next step will be to use those copies for unified data protection.

*NOTE: For this to happen, we must create cost-effective cloud protection storage and build near-instant data recovery mechanisms.

Public cloud works for traditional applications. You can run applications on the best configuration, rather than what is available. You can have first-class DR and archive, rather than “best effort” with backup copies. You can replace your hand-crafted environments with something less expensive and more functional. Public cloud should not threaten IT; instead its architecture should help IT to deliver better services. It’s time to stop resisting and start building.