Category Archives: vOperations
The Daily grunt work for every virtual environment
Hello my friends, It has been quite a while since I last blogged but I wanted to take some time to share some of my experience over the past couple of years. I have had the opportunity to work with some great companies, people, and it has definitely been a very enlightening experience.
I had the privilege of being apart of a special project nearly 5 years ago which began my career in the cloud. I got to engineer and deploy one of the nations first ever GSA clouds which was a great experience. As time rolled on and cloud was adopted many things came into the light. Being a VMware savvy guy I really didn’t have all the time to spend learning all these new technologies which were directly competing. At this time, Amazon was getting big, VMware was about to release VRA, and the market stood still… or so it felt.
Microsoft had launched their On-Prem cloud and before we knew it we had to start getting serious about the cost of our delivery and compute. If you have never had the pleasure of working for service providers let me tell you – its all about cost. So we put Azure to the test, compared it, vetted it, did anything we could to ensure it could be operationally supported. It was a very interesting time and nice comparison to our existing IaaS architecture. We definitely had our work cut out for us.
Since then the challenges of hybrid cloud have become real. Although some vendors had good solutions like UCS Inter-Cloud Fab or vCloud Connector… (insert whatever else here) we always seemed to have unique enough requirements to disqualify it. Needless to say we still deployed, stood them up, tested them, and found great value it still wasn’t justifiable enough for us to warrant a change. Being a service provider isn’t about offloading to another cloud… it’s about how you can upsell your services and provide more value for customers.
As time grew on people adopted Cisco UCS into their infrastructures and eventually it seemed like updating and maintaining infrastructure became critical and the speed of delivery is only hindered by how fast we can adopt new offerings.. If we cannot seamlesly update, migrate, or refresh to new then what can we do?
“Its so old its not even supported!”
“Wow, no new firmware for 5 tears?!”
“Support for VMware has lapsed :(”
You can automate this pain away easily. Just because one vendor doesn’t support a feature or a new version does not mean you have to still burden your IT staff. If you can standardize operational processes between your cloud(s), Visibility, Integration, and Support – would you?
The biggest challenge is getting out of the old and into the new. Most legacy infrastructure runs on VMware and you can do this with Turbonomics and a variety of other tools. One of the benefits of going 3rd party is that you don’t have “lock-in” to any infrastructures or software. You can size it, optimize it, price it, and compare it to ensure things run as they should. Versioning, Upgrades, and these things will always be challenge but as long as you can ensure compliance, provisioning, optimization, and performance it won’t be an after thought. I found Turbonomics to always get the job done and always respond in a way that provided a solution and more than that… at a push of a button.
Some of the benefits:
– Agnostic Integration with a large set of vendors
– Automated Provisioning for various types of compute
– Easily retrofit existing infrastructure for migration
– Elastic compute models
– Cost Comparison, Pricing Existing, Etc…
– I.e. Amazon AWS, Azure
– Track and exceed your ROI Goals
– Eliminate Resource Contention
– Automate and Schedule Migrations between Compute Platforms (Iaas > DBaaS)
– Assured performance, control, and automated re-sizing
– Not version dependent and can be used in a wide variety of scenarios – I.e. I can elaborate if needed.
– Get rolling almost instantly with it…
5 years and I still think Turbonomics is a great product. I have used it extensively in the early days and also worked with it during the vCloud Integration piece. The free version is also amazing and very helpful. Spending time checking capacity, double checking data, ensuring things are proper and standard, all that stuff you can forget about it. Configure your clouds; private, public, or dedicated into Turbonomics quickly.
You just have to trust proven software especially if its been 7 years in the making and exceeds capabilities that most tools require significant configuration for. Also, always keep in mind that TURBONOMICS can learn your environment and the value of understanding the platform and providing insight can be huge. You have to admit that some admins may not understand or know other platforms. This simplifies all that by simply understanding the workload and infrastructure that it runs on.
Other Great Information or References:
Cisco One Enterprise Suite – Cisco Workload Optimization Manager:
CWOM offered with
Turbonomics and BMC:
“Running it Red Hot with Turbonomics”
I am sure this topic has been beaten to death and I am sure I am not going to say anything I already may have said. I don’t want to cover how to perform the upgrade but I really wanted to provide some feedback on some post task implementation plains when doing your upgrade. I hope this is helpful to someone out there. For others this may not be something you have to worry about because you may use a standard installation of the components. For me however I find this invaluable because it puts you a position of what you can expect when doing the upgrade. In my experience I have run into every component being configured differently and sad to say it was impacting to time and stress…
The Post Checks:
- Back-up the stuff.. 😉
User Accounts and service accounts for the following:
VMware Update Manager
- Service account usually..
- Web Manager Log in
- Appliance Console log in
vCloud Director Cells
- Root passwords
- Connectivity accounts in the configuration of the portal to vCenter
vCenter Server services
- Service Accounts
- I don’t use it but the same goes – get the accounts and IDs.
Database configurations and user names and access
Vmware Update Manager Database
- ODBC configuration
- SQL Permissions and Access (Use SQL manager to test!!!)
- SQL service account
vCenter Server Database
- ODBC Configuration (64-bit)
- SQL Permissions and Access (Use SQL manager to test!!!)
- SQL service account
VCD Oracle DB
- I’m not an oracle dude but you should just need to make sure its backed up.
I was able to clone and use a test upgrade on my vCenter to check all my configurations. I would highly recommend considering a standard configuration which if you follow VMware documentation you should be fine. Just remember to document things when you have to do them a bit different. One last thing I would recommend is definitely performing backups, snapshots, and even clones if you DON’T have 100% for sure backups. This is invaluable…
One Last Thing…
As a side note I don’t think I would add any of the other features of vCenter 5 unless you’re going to configure them and use them right then. There really is no point installing the other components because if you follow the VMware recommendations on vCenter sizing for each component consumes additional CPU. An example would be like installing the Web Client. It’s important to know what you are going to install and how you are going to configure it. If you are not going to configure it I would not install it. Standardize your implementation and then move forward.
It seems I keep running into things throwing my in a loop and I am sure someone else out there in the VMTN worlds knows that I am talking about. I love vCloud, I enjoy the product, but man some things are still pretty vague although it continue to grow more and more each day I definitely want to do my part to help contribute. So I will make this a short and quick post.
Well take a peek at this VMware Knowledgebase and it will describe in detail the issue I ran into:
“Upgrading vCloud Director 1.5 with an Oracle database to vCloud Director 1.5.1 fails with the error: CALL create_missing_index()”
Although it is hard to say why this is a bug in the database or duplicated entries most oracle DBA’s can knock it out fairly easy. However let me explain to you how this happens and provide some further insight on how to resolve it without having to guess.
Now I want to point out that someone forgot to use the call-management tool (which you can find here) but one would think that this is mostly a database issue because it’s referencing a duplicate table entry. Also this seems to happen with updating vCloud from 1.5 to 1.5.1.
If you run into this issue you need to do the following:
- Restore the Database
- Run the KB Fix of the following:
“DELETE FROM object_condition WHERE object_id IN (SELECT object_id FROM object_condition GROUP BY object_id, object_type, category, condition HAVING count(*) > 1);”
- Then perform you Database upgrade
I know this seems to be pretty simple but the KB doesn’t tell you specifically what you need to do. The Oracle piece was vague in that one would think you need to run the snippet above to fix it. You should just know that this is something that should be ran AFTER the database restore. This may save you some time if you run into it.
So the sad truth.
Regarding my post from a few days ago allow me to elaborate.
VMturbo takes what vCOPS AKA VROPS 6.0 just introduced. Prior it was based on integration with VCO.
Long story short VMturbo has been doing this for a long time. Its time to trust something that can both – automate capacity and properly balance any workload based on SLA.
I have mad respect for VMware and VMturbo but they have been at this for so long. What took me months in vROPS – takes me seconds or hours in VMturbo. I will say VROPS 6.0 is almost NEARLY caught up to VMTurbo…
When I decided to trust VMturbo to manage my workloads the ROI became real. In my experience I have worked for service providers and enterprises but VMturbo deserves a fair chance with any business. When you automate DAY 1 operations.. that is pretty darn nice.
I won’t get into the nitty gritty but it comes down to patented analytics that recommend actions based on performance data…In other words KPIs mixed with Super Metrics and “Insert VROPS expert here” tag…
May the vForce be with you..
VMturbo or vROPS?
I’d go with something that would easily tie in ROI into your capacity. vRealize Operations is nice but as I continue my journey into tooling and reviewing end to end solutions there are various challenges and complexities.
In short, the more you understand about your workloads (Or LEARN through AI) the more optimization and efficiency you can attain.
The key distinction I find is the development and ease of use Turbonomics (VMturbo) brings. It drasticially simplifies your compute management by eliminating dependencies and ensure standard process/features are executed in the most efficient way.
If you want to stay in tune with Industry trends and cutting edge features I would definitely give Turbonomics a fair shot.
#CISCOUCS #WINNING #BIGDATA #UCSDIRECTOR
So I was doing some testing in vCloud Director 1.5 and noticed my RHEL Linux 5 vApp wasn’t able to enable Virtual CPU Hot add.
I went in and check my vCenter settings to see what the deal was:
Changing the setting on my vCenter updated it in my vCloud Director..
The alternative to having to do this workaround would be to change the template version within vCloud Director to RHEL version 6
You will notice the Virtual CPU hot add becomes available to check. I used this method on existing templates and it did not seem to break the templates.
However, if you are trying to create new templates of RHEL 6 with RHEL5 5 OS you may want to make sure your SCSI controller is correct. Again, changing it on my vApps seemed to make no impact to my OS currently installed.
It’s apparent bug to vCloud Director and @Lamw was kind enough to help me out.
So this article is more of a FYI than anything. I wanted to just bring some attention to this as some may really be puzzled by why the hypervisor stinks at performing large copies. @Lamw can verify as well especially when working the VM Disk files. I think it is important to highlight the distinct difference. The CP command is for files (although a VM by definition is a subset of files per VMware) but not the VMDISK files. I am sure there can be much conspiracy for why this is the case but this has actually been around for a while. If I was probably one of the age old VMware guys out there this would probably not catch me off guard because it has been around or published I should say since VI3 (ESX 3). So obviously since I did not finish my Back to the Future Delorian ride in time yet, well I just didn’t know.
During a particular situation I was copying some data from one ESX to another. This was basically a copy using the Datastore Browser in the vSphere client. I had staged some files from a NFS mount and wanted to copy them over to the SAN datastores. This NFS mount was read only so doing a storage migration would not work because they would require removing the VMDK files on the NFS mount after the copy. So I could do some clones but I could only do so many at a time. What I decided was to pop open the datastore browser and do a copy paste from the NFS to the SAN datastore. It’s also important to understand that the Datastore Browser uses HTTP GET and PUT not CP. Keep in mind this is over 10GB Ethernet (NFS) and copying to the SAN which is 4GB FC HBA. It took a while to do the copy but I didn’t really notice. After staging all the data to the new SAN datastore I had to then turn it over to another ESX that had yet another separate datastore from the one hosting all the VMDK files. So there again another copy…. This time I noticed how slow it was really going even from datastore to datastore. I knew that the copy process would more than likely run over the Management Interface but even that was on a 10GB Ethernet connection so that should be screaming as well. Not the case… So as a last test I decided to try a copy from Datastore to Datastore that is mounted to the same host. I still averaged around 20-50kbs which is pretty terrible. So no matter how I went about it performance was terrible. I pretty much knew it had to do with the process at this point although I wasn’t sure why. In many of these scenarios I used different methods from SCP applications, the Datastore Browser, and CP in the shell of ESXi.
Trying a Different Approach
So after talking with VMware support and confirming my suspicions on the issues being around the process (using CP) we went through the very same instances I noted above to rule out any issues. We tested the same scenarios; Different Protocol Datastores, Non-shared Datastore copies, Shared Datastore Copies, Local Datastore to Datastore copies, all with the same affect – even when copying just a single disk. Of course at this point the support guy was a little stumped and had to get off the line to go talk to someone else. Usually that means they need to go to someone with a fresh set of eyes or more experience to help out and sure enough he came back with another suggestion; use cloning and storage migrations as a test. I of course didn’t think of this but when he mentioned it I pretty much had a Homer the Simpson “DOH!” moment. I guess by then my head was hurting trying to figure this stuff out. When we did the storage migrations and clones it was actually MUCH faster. In fact after the support call we did some testing. I could do 10 storage migrations to 1 VM copy using the CP command. In some cases it was 10+ to one VM copy. Granted I had to now have an additional step of adding to inventory the VM Guest but that wasn’t as bad as taking 1 hour to copy 1 virtual machine. Note: The array was not VAAI capable
What does this mean?
Yeah, so that is the million dollar question isn’t? Well CP has pretty much been deprecated since VI3 but its better said “Not to be used for handling Virtual Disk”. To better understand see/read for yourself: http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=1000936
In http://www.vmware.com/pdf/esx_3p_scvcons.pdf page 3
NOTE:notice the words “SIGNIFICANT PERFORMANCE IMPROVEMENTS”
So all this to tell you that CP is not a very good solution for doing mass copies or datastore copies. For me this present a problem when using any other tools like VEEAM SCP, Putty SCP, and etc.. So make sure you know what you want to accomplish beforehand as you don’t want to end up with some headaches as I did. I know that some of you may think it was a waste of a VMware case but anytime I can find information like this and share it out for others for me is invaluable. To add to my findings I should also mention that VMKFSTOOLS also ensures the integrity of the disk and is more suited for these things by design. I think VMware intentionally focused on VMKFSTOOLS as the solution because I don’t think CP was ever something intended to be used due to the lack of functionality. It may have some to do with licensing as well.
One Last Thing:
This was a huge pain at the time of moving some data between the NFS and SAN because I really didn’t have an automated solution for doing the copies. Many of you know that VEEAM FAST SCP before the new version did not have 64-bit support. I didn’t have any 32 bit machines and I didn’t want to waste time hacking away. However, I did want to mention that VEEAM released their new version of the product which is known as VEEAM free backup; you can get that here. I also did some testing and was very impressed with the copying speeds compared to that of the CP command. Another nice thing is that even if you have no Virtual Machines registered in the vCenter it still picks them up in the copy process as VMs. Not to mention you can get statistics and automate-schedule copy jobs with the application. For me and with what I do this is priceless. Simplicity, automation, and reporting – all free! I love it! Thanks to VEEAM for listening to all those out there wanting an improved solution. They did a good job. NOTE: Thanks again to @Lamw for pointing this out. The Datastore Browser uses HTTP Get/Put not CP. I will correct this in the post later.
Alright, this is going to be difficult for me to really explain so I will do my best to serve it justice. First, I am not a coder and I do not know the ins and outs of the API and code. What I will attempt to explain to you is how you can reproduce this issue on your VCD instance. I also want to note this is vanilla VCD 1.5 with no updates yet. I currently do have a case with VMware opened and I have yet to resolve it.
Let’s get to the nitty gritty.
First off, I want to say that I am not 100% sure that any other queries you use produce the same affect. This issue seems to happen with only the VMadmin query.
First I would recommend reading about connecting the Rest API with Will’s blog over at VMware:
Now that you have read that and understand how to connect to the REST API I will show you an example of a basic VMadmin query.
(Note: you need to have over 128 VCD Vapps to reproduce this type of issue)
This showed me that I had 333 queries returned however on the 1st page I only found 128. Now the way the script talked to VCD API was rather plain and it was basically doing this query and dumping it to a XML file. The idea was that this was similar to 1.0 API where I could get all the data I wanted and dumped into an XML file. This wasn’t the case. It seems I couldn’t get around this 128 limit. So I decided to try the next query:
After running it I still got 333 queries returned but only 128 on the single page EVEN after specifying a pageSize=999 so this isn’t the end of it… let’s dig deeper. After further researching I had actually found documented proof that this was a hard setting somewhere.
Page 212 of the VCD 1.5 API Guide taken from here: http://www.vmware.com/pdf/vcd_15_api_guide.pdf
So it became obvious to me at this point that no matter what your query is it would always default to 128 objects per page. So I tried to also do the following to change this hard setting (at the recommendation of someone) located in a global.properties file in the following directory on the vCloud Director cells:
add/change the following: restapi.queryservice.maxPageSize=1024
I added this to the global.properties file and the VCD cells service were also restarted. Can you guess what still happened? Nothing… this didn’t change anything at all. In fact, it still remained broken. Folks, this still wasn’t the worse part about it. Lets cover the part that I believe is a true bug in the API and had someone on Twitter also comment that there is a possible bug in adminVM query.
Lets say I do a query for a pageSize=135 and my query returns 153 results. We get the usual 128 queries per page. Here is an example of the commands I used:
Sort ascending gives me an alphabetical sorting of all my vApp names and I can find a Breaking point for my virtual machines (I know my ABC’s and what should be next so to speak). So I copy and paste the results into Notepad++ and it shows me 128 entries of the page size of 135 (give or take a few for other lines returned not relevant to the query. The bug as discussed is evident. However, it doesn’t show the other 7 entries it should be showing. Remember, we did the page size for 135. So now let’s take a peek at page 2.
So after you run this query you will the list of the remaining 153 results. However if you take notes you will notice that it is in fact completely missing the 7 other entries. So basically your query takes the 7 it could NOT list and dumps it out to somewhere in the Cloud…. So what does this mean aside from the fact that there is a bug?
You will need to use a looping construct and not specify a page size greater then 128. (see Will’s comments below)
This is a bug and I don’t think I could make it any clearer. I wish I could’ve provided some screenshots but I think if someone does there due diligence they will see what I am talking about. If you have 2000 VCD vApps and you do a page size of 500 you would lose 372 queries between each page. No matter how you specify the page size, modify the Global.properties its just broken plain and simple. If someone would like to provide some screen shots I would be happy to put them up here to show some better detail.
If you want to discuss in further detail feel free to comment and I will follow up.
UPDATE: After reviewing with VMware on some things I found out this is actually a true but with the vCloud 1.5 API bug. The good news is that there is a fix slated to be published in August, perhaps they will allow for a private fix if you really need it. Stay tuned. If anyone has some information aside from this please provide and I will link it! Thanks again. Also, this is not related to any type of Query parameter this is more to do with how the Query service works.
Well, I guess I am on a roll this week. I feel like a lot of my themes have been around storage and VMware this week. I don’t think that is a bad thing but I am seeing some gaps out there as far as considerations and recommendations. My only point in this post is to share my thoughts for you and what you should consider when facing this after your vSphere 5 upgrade or after you install it. I have to wonder just how many enterprises out there have seriously pushed the envelope of LUN sizing in VMware. One has to think; “If you are carving up large LUNS does that mean your scaling up?”. There are so many implications one should consider when designing your storage. One of the more critical pieces is I/Ops and the cluster size and what your target workload is. With bigger LUNS this is something you have to consider and I do think it is common knowledge for the most part.
There are so many things one should consider when deciding on a LUN Size for vSphere 5. I sincerely believe VMware is putting us all in a situation of scaling up sometimes. With the limitations of SDRS and Fast Provisioning it has really got my mind thinking. It’s going to be hard to justify a design scenario of a 16 node “used to be” cluster when you are trying to make a call on if you really want to use some of these other features. Again, you have heard me says this before but I will say it again; it seems more and more that VMware is making a huge target of this to Small to Medium sized businesses but offering some features larger sized companies (with much bigger clusters) now have to invest even more time in reviewing their current designs and standards – Hey, that could be a good thing 🙂 . Standards to me are a huge factor for any organization. That part seems to take the longest to define and some cases even longer to get other teams to agree to. I don’t think VMware thought about some of those implications but I am sure they did their homework and knew just were a lot of this was going to land…
With that being said I will stop my rambling on about these things and get to the heart of the matter or better yet heart of the storage.
So, After performing an upgrade I have been wondering what LUN size would work best. I believe I have some pretty tough storage and a solid platform (CISCO UCS) so we can handle some I/Ops. I wanted to share some numbers with you that I found was very VERY interesting. I have begun to entertain the notion of utilizing Thin Provisioning even further. However, we are all aware that VMware still has an issue with UNMAP command which I have pointed out in previous blogs (here). However being that I have been put between a rock and hard place I believe update 1 to vSphere 5 at least addressed 1/2 of my concern of it. The other 1/2 that didn’t was the fact that now I have to defer to a manual process that involves an outage to reclaim that Thin Provisioned space… I guess that is a problem I can live it with given the way we use our storage today. It doesn’t cause us to much of a pain, but it is a pain none the less.
Anyways, so here is my homework on LUN sizing and how to get your numbers (Estimates):
(Note: This is completely hypothetical and not related to any specific company or customer; this will also include Thin Provisioning and Thick)
Factor an Average IOps per LUN (if you can from your storage vendor or from vCenter or an ESXi host)
Take the IOps per all production LUNS and divide it by the number of datastores
Total # IOps / # of Datastores
Gather the average numbers of virtual machines per datastore
Total # VM’s / # of Datastores
Try to use Real World production virtual machines
Decide on the LUN Size and use your current baseline as a multiplication factor from your current.
So if you want to use 10TB Datastores and you are using 2TB datastores you can take whatever numbers and
10TB / 2TB = 5 (this is you multiplication factor for IOPs and VM:Datastore Ratio)
So now let’s use an example to put this to practical use… and remember to factor in free space for maintenance I always keep it at 10% free.
Let’s say we have a customer with the following numbers before:
16 VM’s per Datastore
1200 I/Ops Average per Datastore (we will have to account for peak to)
2TB Datastore LUNS
Now for the math (Lets say the customer is moving to 10TB LUNS so this would be a factor of 5):
16 x 5 = 80 VM’s per Datastore (Thick Provisioned)
120 x 5 = 600 IOps per Datastore…
Not bad at all, but now let’s seriously take a look at thin provisioning which is QUITE different on numbers. Let’s say we check our storage software and it tells us on average a 2TB LUN only really uses 500 GB of space for the 16 VM’s per Datastore. Lets go ahead and factor some room in here (10% for alerting and maintenance purposes this time around). You can also download RVTools to get a glimpse of actual VM usage versus provisioned for some thin numbers.
16 VM per 500GB so that times 4 for the 2TB LUN; Makes 64 Thin VMs per 2TB Datastore.
Times that by the new LUN size 9TB / by 2TB = 4.5 (minus 10% for reserved for alerting purposes and Maintenance; this could also be considered conservative)
64 x 4.5 = 288 Average VM Per 10TB Datastore (and that 1 TB reserved too!)
We aren’t done yet; here comes the IOPs and lets use 1500 IOPs. Since we times the VM’s by a factor of 4 we want to do this for the average of IOPs as well:
1500 x 4 = 6000 per 2TB LUN; Using thin provisioning on VMs
600 x 4.5 = 2700 IOps per LUN.
So this leave use with the following numbers for thick and thin:
VM to 10TB Datastore ratios:
IOps to 10TB Datastore ratios:
6000/IOps Thick Provisioning
2700/IOps Thin Provisioning
So, I hope this brings to light some things you will have to think about when choosing a LUN size. Also note that this is probably more of a service provider type of scenario as we all know most may use a single 64TB LUN though I am not sure I would recommend that. It all comes down to use-case and how it can be applied. So this also begs to question what’s the point of some of those other features if you leverage Thin Provisioning. Here are some closing thoughts and things I would recommend:
- Consider Peak loads for your design; the maximum IOps you may be looking for in some cases
- Get an average/max per VM datastore ratio (locate your biggest Thin VM)
- Consider tiered storage and how it could be better utilized
- Administration and Management overhead; essentially the larger the LUN the less over all provisioning time and so on.
- VAAI capable array for those Thin benefits (running that reclaim UNMAP script..)
- Benchmark, Test using some other tools on that bigger LUN to ensure stability at higher IOps
- Lastly the storage array benchmarks and overall design/implementation
- The more VM you can scale on a LUN can affect your cluster design; You may not want to enable your customers to scale that much
- Alerting considerations and how you will manage it efficiently to not be counterproductive.
- Consider other things like SDRS (fast provisioning gets ridiculous with Thin Provisioning)
- Storage latency and things like Queues can be a pain point.
I hope this helps some of those out there that have been wondering about some of this stuff. The LUN size for me dramatically affect my cluster design and what I am looking to achieve. You also want to load test your array or at least get some proven specs on the array. I currently work with HDS VSP arrays and these things can handle anything you can throw at them. They are able to add any type of additional capacity you need rather it be Capacity, IOps, Processing or what not you can easily scale it out or up. Please share your thoughts on this as well. Here are some great references:
Note: these numbers are hypothetical but its all in the numbers.