Managing your Virtual DataCenter – 4 Key Strategies

Posted on December 20, 2013

While recently studying for VMware Advanced Professional Datacenter Design exam, I had the chance to study documents in the VCAP-DCD blueprint. Some of these documents provide insights into effective strategies to use before, during, and after deploying a Virtual Infrastructure.

In terms of the “after” piece, one document caught my eye. It is the Glasshouse Whitepaper entitled “Four Keys to Managing a VMware environment”.

You can find the document on the community website at this location: https://communities.vmware.com/docs/DOC-17397.

While this document comes from a VMware source, I suggest it is equally valid when applied to any Virtualization or Cloud Platform. The same rules typically apply to all, when considering monitoring and management strategies.

First things, first, let’s talk about monitoring, which should be the first thing that is in place before anything goes anywhere near production.

Step 1: Management and Monitoring : Not the Same thing!

It is common to see Virtual Platforms “integrated” into an existing monitoring solution, be it Solarwinds, SCOM or any other. Typically there is a module, add-on or plugin which enables visibility into the Virtual world, normally via the Management API. Think vCenter API.

There is a huge (almost endless) amount of telemetry that can be surfaced up to any monitoring tool from the likes of vCenter and SCVMM. Anyone who uses Powershell or VMware PowerCLI will be aware of endless amounts of options, fields, methods and actions available.

But the data that is presented, is arbitrary and depends on the software vendor.

[ASIDE: Isn’t that one reason why PowerCLI and Powershell have the word “Power” in their title. It’s the power of customisation and automation, where every problem can be solved by a script.]

We should state here and now, that unless the tool in question has been specifically chosen to meet the specific requirements, it may not provide the required level of visibility into the environment. Different vendors display different information after all.

The result could be that it may not be possible to easily display critical information required to understand if an SLA is being met.

Revisit Design Requirements

So rolling back to the design stage, it’s important to evaluate the management requirements, both in terms of metrics that need to be monitored, access that needs to be provided, and by extension, We need to perform an assessment as to whether the existing toolset can meet these requirements.

In most cases, it would be extremely disruptive to “throw out” an existing Monitoring Tool, but blindly integrating into the existing toolset can expose a risk in future monitoring of your environment.

The risk is that the toolset cannot provide the requisite level of visibility into the virtual environment. So monitoring is Step 1 piece of the puzzle, but that is not management of your environment. All it does it provide a green-amber-red view of the world.

In many cases you still need to customise the thresholds and alerts to meet site-specific needs. This seems obvious but it’s surprising how this can get missed.

By doing this you can ensure you gain Visibility of key metrics that must be monitored to ensure you understand performance against SLA.

So that conveniently brings us to steps 2,3 and 4 which relate to Management of your environment:

Step 2: Implement an effective Virtual Machine Cost Model

People not familiar with Virtualization and Cloud don’t understand that all resources have a cost, both in terms of Capital and ongoing Operational Expenditure.

Virtual Machines are not free. They are made up of CPU, Memory, Network and Storage, as well as licensing costs. They also have ongoing OPEX costs.

And as the whitepaper acknowledges, implementing chargeback can be a difficult challenge, particularly within an Enterprise.

However, it is a very worthwhile exercise to try and establish the cost of a virtual machine for all parties.

It also helps you to understand whether Private Cloud or Hosted Private Cloud via Azure, VMware vCloud Hybrid Service, AWS or other, is economically viable, and advantageous.

At a minimum I suggest you consider the replenishment cost for any storage you provide.

In many cases storage is a persistent resource, that once consumed, is no longer available to your future customers.

It can be oversubscribed using thin provisioning, but this is a strategy that requires careful design and monitoring to ensure out-of-space conditions do not occur.

To use an example, an ex-colleague uses a simple strategy which works well.

When approached for large amounts of Virtual resource, he asks a simple question…

What can you give me back first?

This has provided good results and is a nice solution to a complex area.

He also finds that he can sometimes move his customer’s existing virtual machines to a lower storage tier using Storage vMotion, when he discusses the performance requirements in more detail.

When you examine it, they may not need to live on the Gold Tier and may be happy to shift to your Bronze tier.

Step 3: Integrate the Virtual Environment into your Configuration and Change Management Processes

You need to ensure that any existing processes that manage change within your environment fully integrate with, and include your Virtual Infrastructure.

It is useful to deploy different classes of environment, if you can, such as Production, Test, QA and even a Sandpit environment.

There are cost-effective ways to give users a platform for testing, without them becoming tempted to make a change to production. Lookout for nesting environments, or using vCloud Solutions.

Remember that your Virtualization or Cloud Platform could support your entire business.

Adhering to standards requires:

Implementing an agreed Design taking considerations of key Business Requirements.

Establishing a Baseline State after deployment.

Ensuring Role Based Access Control is fully implemented, to ensure the correct people have authority against discrete objects.

Implement strict change control on Production Systems in relation to the agreed baseline.

I recommend that you try to deploy a smaller QA environment that exactly matches your Production environment in every way. This will ensure any changes can be evaluated here before being moved into Production.

Use features such as Host Profiles or Update Manager to ensure your systems are compliant with defined baselines and target states. Attach Desired Baseline states and reference profiles to systems, and setup alarms to monitor for divergence from desired state. These features might only be available depending on licensing entitlements.

Ensure all important stakeholders are represented at Change Control reviews, to ensure the impact of change can be assessed and agreed in advance.

And finally, ensure you understand the dependencies between applications, as this will help to assess the impact of a change within your environment.

Step 4: Develop a Regular Maintenance Schedule

In order to prevent problems occurring such as Zombies and Orphaned Virtual Machines, referenced in my post, HERE !!!!, you need to perform routine maintenance.

First, define the tasks that need to be executed. Here are some examples:

Checking the health of the environment using RVtools, Veeam One, PowerCLI or other for snapshots, zombies, Port Group inconsistencies, Over-provisioned Datastores, Warnings, Errors etc.

Develop a RunBook that defines the operational procedures that should be followed in exceptional circumstances. This also has the benefit of helping you map out application dependencies and upstream and downstream impacts.

Apply recommended updates to your environment on a cycle that makes sense. Six months is probably a reasonable period after major updates are released, to consider deployment in Production. Doing nothing may not actually be the least risky avenue to follow.

Don’t forget Disaster Recovery and Business Continuity. I advocate that customers need to take the bull by the horns and actually fail over systems to a DR Site. This is a great way to find out whether changes required for a DR switchover to function successfully, have been made/captured on both sites. This will help you improve your Change Control process.

Both of the last steps helps you also test your DR Plans and Change Control processes by performing regular maintenance so it does actually become routine.

When carried out frequently, this will not be something to be feared, but will become a regular part of Business-As-Usual Activities.

Summary

By taking these considerations into account, you can ensure you have a clear handle on what’s going on in your Virtual Universe.

That will help you and your customers to ensure you don’t get any nasty surprises, when you least expect them.