Comparing Clouds: “Day 2” Management Operations

So far in this blog series, I’ve taken a look at how to provision and scale servers using five leading cloud providers. Now, I want to dig into support for “Day 2 operations” like troubleshooting, reactive or proactive maintenance, billing, backup/restore, auditing, and more. In this blog post, we’ll look at how to manage (long-lived) running instances at each provider and see what capabilities exist to help teams manage at scale. For each provider, I’ll assess instance management, fleet management, and account management.

There might be a few reasons you don’t care a lot about the native operational support capabilities in your cloud of choice. For instance:

  • You rely on configuration management solutions for steady-state. Fair enough. If your organization relies on great tools like Ansible, Chef or CFEngine, then you already have a consistent way to manage a fleet of servers and avoid configuration drift.
  • You use “immutable servers.” In this model, you never worry about patching or updating running machines. Whenever something has to change, you deploy a new instance of a gold image. This simplifies many aspects of cloud management.
  • You leverage “managed” servers in the cloud. If you work with a provider that manages your cloud servers for you, then on the surface, there is less need for access to robust management services.
  • You’re running a small fleet of servers. If you only have a dozen or so cloud servers, then management may not be the most important thing on your mind.
  • You leverage a multi-cloud management tool. As companies chase the “multi-cloud” dream, they leverage tools like RightScale, vRealize, and others to provide a single experience across a cloud portfolio.

However, I contend that the built-in operational capabilities of a particular cloud are still relevant for a variety of reasons, including:

  • Deployments and upgrades. It’s wonderful if you use a continuous deployment tool to publish application changes, but cloud capabilities still come into play. How do you open up access cloud servers and push code to them? Can you disable operational alarms while servers are in an upgrading state? Is it easy to snapshot a machine, perform an update, and roll back if necessary? There’s no one way to do application deployments, so your cloud environment’s feature set may still play an important role.
  • Urgent operational issues. Experiencing a distributed denial of service attack? Need to push an urgent patch to one hundred servers? Trying to resolve a performance issue with a single machine? Automation and visibility provided by the cloud vendor can help.
  • Handle steady and rapid scale. There’s a good chance that your cloud footprint is growing. More environments, more instances, more scenarios. How does your cloud make it straightforward to isolate cloud instances by function or geography? A proper configuration management tool goes a long way to making this possible, but cloud-native functionality will be important as well.
  • Audit trails. Users may interact with the cloud platform via a native UI, third party UI, or API. Unless you have a robust log aggregation solution that pulls data from each system that fronts the cloud, it’s useful to have the system of record (usually the cloud itself) capture information centrally.
  • UI as a window to the API. Many cloud consumers don’t ever see the user interface provided by the cloud vendor. Rather, they only use the available API to provision and manage cloud resources. We’ll look at each cloud provider’s API in a future post, but the user interface often reveals the feature set exposed by the API. Even if you are an API-only user, seeing how the Operations experience is put together in a user interface can help you see how the vendor approaches operational stories.

Let’s get going in alphabetical order.

DISCLAIMER: I’m the product owner for the CenturyLink Cloud. Obviously my perspective is colored by that. However, I’ve taught three well-received courses on AWS, use Microsoft Azure often as part of my Microsoft MVP status, and spend my day studying the cloud market and playing with cloud technology. While I’m not unbiased, I’m also realistic and can recognize strengths and weaknesses of many vendors in the space.

Amazon Web Services

Instance Management

Users can do a lot of things with each particular AWS instance. I can create copies (“Launch more like this”), convert to a template, issue power operations, set and apply tags, and much more.

2014.12.19cloud01

AWS has a super-rich monitoring system called CloudWatch that captures all sorts of metrics and capable of sending alarms.

2014.12.19cloud02

 

Fleet Management

AWS shows all your servers in a flat, paging, list.

2014.12.19cloud04

You can filter the list based on tag/attribute/keyword associated with the server(s). Amazon also JUST announced Resource Grouping to make it easier to organize assets.

2014.12.19cloud06

When you’ve selected a set of servers in the list, you can do things like issue power operations in bulk.

2014.12.19cloud03

Monitoring also works this way. However, Autoscale does not work against collections of servers.

2014.12.19cloud05

It’d be negligent of me to talk about management at scale in AWS without talking about Elastic Beanstalk and OpsWorks. Beanstalk puts an AWS-specific wrapper around an “application” that may be comprised on multiple individual servers. A Beanstalk application may have a load balancer, and be part of an Autoscaling group. It’s also a construct for doing rolling deployments. Once a Beanstalk app is up and running, the user can manage the fleet as a unit.

Once you have a Beanstalk application, you can terminate and restart the entire environment.

2014.12.19cloud07

There are still individual servers shown in the EC2 console, but Beanstalk makes it simpler to manage related assets.

OpsWorks is a relatively new offering used to define and deploy “stacks” comprised of application layers. Developers can associate Chef recipes to multiple stages of the lifecycle. You can also run recipes manually at any time.

2014.12.19cloud08

Account Management

AWS doesn’t offer any “aggregate” views that roll up your consumption across all regions. The dashboards are service specific, and are shown on a region-by-region basis. AWS accounts are autonomous, and you don’t share anything between them. Within an account, user can do a lot of things. For instance, the Identity and Access Management service lets you define customized groups of users with very specific permission sets.

2014.12.19cloud09

AWS has also gotten better at showing detailed usage reports.

2014.12.19cloud10

The invoice details are still a bit generic and don’t easily tie back to a given server.

2014.12.19cloud11

There are a host of other AWS services that make account management easier. These include CloudTrail for API audit logs and SNS for push notifications.

CenturyLink Cloud

Instance Management

For an individual virtual server in CenturyLink Cloud, the user has a lot of management options. It’s pretty easy to resize, clone, archive, and issue power commands.

2014.12.19cloud12

Doing a deployment but want to be able to revert any changes? The platform supports virtual machine snapshots for creating restore points.

2014.12.19cloud14

Each server details page shows a few monitoring metrics.

2014.12.19cloud13

Users can also bind usage alert and vertical autoscale policies to a server.

 

Fleet Management

CenturyLink Cloud has you organize servers into collections called “Groups.” These Groups – which behave similarly to a nested file structure – are management units.

2014.12.19cloud15

Users can issue bulk power operations against all or some of the servers in a Group. Additionally, you can set “scheduled tasks” on a Group. For instance, power off all the servers in a Group every Friday night, and turn them back on Monday morning.

2014.12.19cloud16

You can also choose pre-loaded or dynamic actions to perform against the servers in a Group. These packages could be software (e.g. new antivirus client) or scripts (e.g. shut off a firewall port) that run against any or all of the servers at once.

2014.12.19cloud17

 

The CenturyLink Cloud also provides an aggregated view across data centers. In this view, it’s fairly straightforward to see active alarms (notice the red on the offending server, group, and data center), and navigate the fleet of resources.

2014.12.19cloud18

Finally, the platform offers a “Global Search” where users can search for servers located in any data center.

2014.12.19cloud48

 

Account Management

Within CenturyLink Cloud, there’s a concept of an account hierarchy. Accounts can be nested within one another. Networks and other settings can be inherited (or separated), and user permissions cascade down.

2014.12.19cloud19

Throughout the system, users can see the month-to-date and projected cost of their cloud consumption. The invoice data itself shows costs on a per server, and per Group basis. This is handy for chargeback situations where teams pay for specific servers or entire environments.

2014.12.19cloud20

CenturyLink Cloud offers role-based access controls for a variety of personas. These apply to a given account, and any sub-accounts beneath it.

2014.12.19cloud21

The CenturyLink Cloud has other account administration features like push-based notifications (“webhooks”) and a comprehensive audit trail.

Digital Ocean

Instance Management

Digital Ocean specializes in simplicity targeted at developers, but their experience is still serves up a nice feature set. From the server view, you can issue power operations, resize the machine, create snapshots, change the server name, and more.

2014.12.19cloud22

There are a host of editable settings that touch on networking, Linux Kernel, and recovery processes.

2014.12.19cloud23

Digital Ocean gives developers a handful of metrics that clearly show bandwidth consumption and resource utilization.

2014.12.19cloud24

There’s a handy audit trail below each server that clearly identifies what operations were performed and how long they took.

2014.12.19cloud26

Fleet Management

Digital Ocean focuses on the developer audience and API users. Their UI console doesn’t really have a concept of managing a fleet of servers. There’s no option to select multiple servers, sort columns, or perform bulk activities.

2014.12.19cloud25

Account Management

The account management experience is fairly lightweight at Digital Ocean. You can view account resources like snapshots and backups.

2014.12.19cloud27

It’s easy to create new SSH keys for accessing servers.

2014.12.19cloud28

 

The invoice experience is simple but clear. You can see current charges, and how much each individual server cost.

2014.12.19cloud29

The account history shows a simple audit trail.

2014.12.19cloud30

 

Google Compute Engine

Instance Management

The Google Compute Engine offers a nice amount of per-server management options. You can connect to a server via SSH, reboot it, clone it, and delete it. There are also a set of monitoring statistics clearly shown at the top of each server’s details.

2014.12.19cloud31

Additionally, you can change settings for storage, network, and tags.

2014.12.19cloud32

 

Fleet Management

The only thing you really do with a set of Google Compute Engine servers is delete them.

2014.12.19cloud34

 

Google Compute Engine offers Instance groups for organizing virtual resources. They can all be based on the same template and work together in an autoscale fashion, or, you can put different types of servers into an instance group.

2014.12.19cloud33

An instance group is really just a simple construct. You don’t manage the items as a group, and if you delete the group, the servers remain. It’s simply a way to organize assets.

2014.12.19cloud35

Account Management

Google Compute Engine offers a few different types of management roles including owner, editor, and viewer.

2014.12.19cloud36

What’s nice is that you can also have separate billing managers.  Other billing capabilities include downloading usage history, and reviewing fairly detailed invoices.

2014.12.19cloud37

I don’t yet see an audit trail capability, so I assume that you have to track activities some other way.

Microsoft Azure

Instance Management

Microsoft is in transition between its legacy, production portal, and it’s new blade-oriented portal. For the classic portal, Microsoft crams a lot of useful details into each server’s “details” page.

2014.12.19cloud38

The preview portal provides even more information, in a more … unique … format.

2014.12.19cloud39

In either environment, Azure makes it easy to add disks, change virtual machine size, and issue power ops.

Microsoft gives users a useful set of monitoring metrics on each server.

2014.12.19cloud40

Unlike the classic portal, the new one has better cost transparency.

2014.12.19cloud41

Fleet Management

There are no bulk actions in the existing portal, besides filtering which Azure subscription to show, and sorting columns. Like AWS, Azure shows a flat list of servers in your account.

2014.12.19cloud42

The preview portal has the same experience, but without any column sorting.

2014.12.19cloud43

Account Management

Microsoft Azure users have a wide array of account settings to work with. It’s easy to see current consumption and how close to the limits you are.

2014.12.19cloud44

The management service gives you an audit log.

2014.12.19cloud45

New portal gives users the ability to set a handful of account roles for each server. I don’t see a way to apply these roles globally, but it’s a start!

2014.12.19cloud46

The pricing information is better in the preview portal, although the costs are still fairly coarse and not at a per-machine basis.

2014.12.19cloud47

 

Summary

Each of these providers has a very unique take on server management. Whether your virtual servers typically live for three hours or three years, the provider’s management capabilities will come into play. Think about what your development and operations staff need to be successful, and take an active role in planning how Day 2 operations in your cloud will work. Consider things like bulk management, audit trails, and security controls when crafting your strategy!

Author: Richard Seroter

Richard Seroter is Director of Developer Relations and Outbound Product Management at Google Cloud. He’s also an instructor at Pluralsight, a frequent public speaker, the author of multiple books on software design and development, and a former InfoQ.com editor plus former 12-time Microsoft MVP for cloud. As Director of Developer Relations and Outbound Product Management, Richard leads an organization of Google Cloud developer advocates, engineers, platform builders, and outbound product managers that help customers find success in their cloud journey. Richard maintains a regularly updated blog on topics of architecture and solution design and can be found on Twitter as @rseroter.

5 thoughts

  1. One thing to note about the Azure “transition” is that unique capabilities exist in both portals that don’t exist in their complement, so “Day 2” ops requires knowledge and use of both portals. Similarly, their cross-platform tools for Linux/OSX is underdeveloped with respect to either the portals or its PowerShell equivalent. And, speaking of PowerShell, it has unique capabilities that the other interfaces don’t. So, if you need to be able to effectively manage an Azure fleet from OSX, you’ll need to use all four interfaces and have a Windows box around.

    Also, the “classic” to “blade” transition has been going on for at least a year at this point with no clear roadmap for completion, so don’t expect this state of affairs to change very soon.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.