Category Archives: CiscoUCS

Cisco UCS Journey – When to update firmware

Don’t update if its not broke. If it breaks then update it. If you have issues with false alerts you may want to update firmware. I saw this with 1.4j.

The issue is not with the IOM but with the chassis communication bus(i2c bus) and hence the IOM is not getting detected and backplane ports never come up. If you seeing alerts related to PSU and those types of things then you may want to pay attention.

I2C is a bus that provides connectivity between different components in the chassis.

The PCA9541 is an I2C part that helps us control access to the shared devices in a chassis; the chassis serial eeproms, power supplies, and fan modules.

The 9541 I2C mux has known hardware/hang issues that can cause failures to access hardware components on the chassis. This can result in failures to read fan and PSU sensor data (such as fan speed and temp), triggering faults to be raised for the component (such as fan inoperable).

Some early PCA9541s that were used have a bug that if they are switched back and forth between IOM1 access and IOM2 access too quickly, they will get stuck and not allow any connection to the devices behind them.

Action Required:

Required to upgrade firmware version to 1.4(3q) or above.

Workaround to be followed before going for firmware upgrade:

• Reseat all the PSUs one by one in the chassis. Wait for 10min after inserting one unit ,so that it could stabilize.

• Reseat all the Fan Units on the backside of the chassis. Wait for 3min before going for the next one.

• Reseat both the IO modules. Wait for 20min before going for the next one.

• Verifying the i2c counter for the chassis.

• (Requires Down Time)Power cycle to reset all counters to fix issues in the running version.

• (Requires Down Time)Upgrading to firmware version 1.4(3q) or above (2.0 release) for a permanent fix.

Please follow the link to download the 1.4(3q) bundle:

http://www.cisco.com/cisco/software/release.html?mdfid=283612660&release=1.4%283q%29&relind=AVAILABLE&softwareid=283655658&rellifecycle=&reltype=all

Related Issue with firmware version used:

Incorrect behavior of I2C bus or CMC software interpreting I2C transactions?

  1. Fans (count 8 or less), PSU (count 4 or less) can be reported as inoperable. State never cleared.
  2. Fans are running at 100% rotation rate.
  3. UCSM cannot retrieve the PSU/Fan part detailed information
  4. Transient errors indicating Fan inoperable, cleared in one minute time interval.
  5. LED state does not match faults reported in UCSM and actual health of the system.
  6. Incorrectly reported thermal errors on blades and chassis .

Fixes that are promised for 1.4(3q):

  1. CSCtl74710 I2C bus access improvements for 9541

PCA9541 (NXP I2C bus multiplexor) workaround to improve bus access for parts built prior of mid 2009. The workaround assures that if internal clock fails to #:start it gets retried. The change designed and works as expected for both PCA9541 and PCA9541A parts from NXP. PCA9541 parts due to the internal clocking bug #:had a high number of bus_lost events.

  1. CSCtn87821 Minor I2C driver fixes and instrumentation

New Linux I2C driver has optimization to handle I2C controller and slave devices synchronization. With older driver simple synchronization error could appear as uncorrectable device errors.

  1. CSCtl77244 Transient FAN inoperable transition

During UCS (CMC) firmware upgrade and switching to new master/slave mode CMC erroneously takes information from the slave IOM and evaluates fans as inoperable based on stale data.

  1. CSCtl43716 9541 device error. Fan Modules reported inoperable, running 100%

Software code routine bug where single bus_lost event followed by successful retry will result in an infinite loop. As result Fans are reported as inoperable and are not controlled by CMC.

  1. ??

Removed an artificial cumulative threshold to enable amber color LED upon reaching 1000 bus_lost events. This was implemented as a monitoring mechanism to simplify identification of the PCA9541 devices. This is no longer needed since a proper software workaround is implemented.

Since this email we have started the update to firmware 2.0. This is a separate blog I am going to write because that too was pretty intense. I will provide some additional steps that we performed to lessen the impact. One thing is for certain don’t expect it to NOT be impacting….

***Disclaimer: The thoughts and views expressed on VirtualNoob.wordpress.com  and Chad King in no way reflect the views or thoughts of his employer or any other views of a company. These are his personal opinions which are formed on his own. Also, products improve over time and some things maybe out of date. Please feel free to contact us and request an update and we will be happy to assist. Thanks!~

Blah Blah Cloud… part 1

When you look at cloud today in context of VMware what is your biggest concern? For some of us it may be networking, others storage, and maybe even focused in a more broader perspective like; availability, scalability, and BU/DR. Since I have been working with vCloud director day in and day out I have been asking some deeper technical questions centered more around scalability of storage and other related components to the overall design. I have been challenged in various ways because of this technology. Prior to vCloud it was vSphere and a lot of how you implemented and managed vSphere was much less complex. Cloud brings another level of complexity – especially if your initial “design and management” is poor to begin with. Usually you end spending more time and money going back addressing issues related to simple best practices that most Architects and Engineers should already know. In some cases it’s a disconnect between that design and infrastructure team and the help desk. This may not always be the case but in my experience it seems to happen more often than naught.

I am sure we could all spend plenty of time talking about operations, procedures, protocols, standards, and blah blah…. but this isn’t the point of this blog…. Even though these things are of the highest importance and the more effort that is put into this the better the results you will get and the less cost you will end up spending. Anyways…

So, as I was saying vCloud has challenged me in several ways. Now not only do I have to consider the design of vSphere, but I also have to look at the design of vCloud director and how we manage all these different components. Even though you simply add vCloud director still doesn’t mean that is in the end of it all. More complexity comes with integration of other applications, Application availability, and Backup and DR. I have been amazed at how many things I see as an oversight due to the lack of expertise in this area. This is no offense to anyone but really VMware is still in its infancy when running against other markets. Though I strongly sense that VMware is going to be majority market share for a while.

Crossing the gaps:
Since I have been studying and learning day and day out covering VMware best practices and other companies best practices (not VMware) I continue to see a lot of disconnects in certain areas (vCloud Director). Storage guys have no idea or clue about running virtualized workloads on Arrays and often times they care not to even want to learn about VMware. Usually they already have plenty to do but this disconnect on some level will affect the implementation. I honestly say that in most cases the Architect should be the one researching and ensuring that all the components which make up the cloud computing stack should be standardize and implemented correctly, even so these gaps still cause setbacks. Which now leads me into the networking side of things. Networking engineers I see are beginning to come up to speed more quickly on virtualization. The main factor of this is because of Cisco UCS and how it appeals to those network administrators and engineers, and add to that FCoE/CNA’s. However, the disconnect once again lies in that knowledge transfer of the virtual platform of how it works and best practices designed around VMware. I first one to say that many don’t really get the choice especially if a company just threw you into the fire. It’s like right now we are looking at giving our network team the keys to the kingdom (CISCO UCS) but yet they have nearly ZERO understanding and training of how any of it works…. scary right? We have to cross these gaps people we need to make sure that we have people positioned in areas who can understand and impart that training or have someone available as a resource.

My Real Concerns:
vCloud director is something totally new and alien to me when I first stepped in the cloud. I had to learn and quickly. Having my background I quickly go to the manuals, read the blogs, get plugged into good sources, learn even more, read books, and I start auditing. I start looking at designs that may be questionable and start asking the questions of “Is it ignorance” or “What the … was he thinking?” and quickly find that usually it was the latter.. simply ignorance. No one really is to blame because we have to understand YES, it is a NEW technology – BUT how much more critical is it that we research and ensure that we are implementing a design that is “rock” solid before rolling it out… Yes, I know deadlines are deadlines but it is what it is either way. You either spend a lot more money in the long run or spend a little bit more to get it right the first time. We are now having to go back and perform a second phase and for the past couple of months we have been remediating a lot of different things that could’ve been done right had a simple template been designed correctly. We now spend countless additional hours updating and working more issues because of this one simple thing. This isn’t even getting into the storage and other concerns I have.

Cloud and What’s Scary?:
Yeah, I know right scary? I don’t know about yours but some of the ones I have seen are. Here is what scares the heck out of me. ABC customer decides deploy a truck load of Oracle, MSSQL, IIS, Weblogic, and etc Virtual machines all on the fly. Next thing we know we see some latency on the storage back-end and see some impact to performance. Come to find out a bunch of cloning operations are kicking off… I/O is spiking, the VM’s are writing many types of Iops and in a matter of about 12 hours we are having some major issues. This is called “Scalability” or sometimes “Elasticity” whatever you want to call it. Some catalogs host every kind of application and majority of the apps are all tier 1 virtualized workloads. This isn’t the little stuff most corporations virtualize. They usually put this stuff off for later because the need of having a high performance server and old traditional thinking still tells them to not do it (Playing it safe). Scaling a cloud to accommodate tier 1 workloads is going to be something I think we are going to be seeing a lot more. In fact, most vendors provide documentation of implementing solutions on VMware Cloud Director – but they almost NEVER cover the application workloads. I am speaking to Storage, Networking, and Server Hardware. This is probably because in most cases due to the mixed nature you can have in an environment you should do THOROUGH testing to ensure that you can scale out and run an optimal amount of workloads… some would call it vBlock..

Anyways I didn’t mean to write a blog this long but I have just had a lot on my mind lately and I will continue to write more as I continue my VMware Cloud journey.

Cheers,

***Disclaimer: The thoughts and views expressed on VirtualNoob.wordpress.com and Chad King in no way reflect the views or thoughts of his employer or any other views of a company. These are his personal opinions which are formed on his own. Also, products improve over time and some things maybe out of date. Please feel free to contact us and request an update and we will be happy to assist. Thanks!~

CISCO UCS – Benefits of VMXNET3 Driver

Well so I have been at it again.  Attempting to learn enough stuff about CISCO UCS to better understand what it can do.  I already know there is a lot of potential and that we probably don’t utilize it to its capacity.

The other day a colleague and I were talking about slowness in general in cloud environments and he mentioned how we could improve performance for all the VMs from E1000 to the VMXNET3.  Now I am fully aware of all the benefits and features of the VMXNET3 but I have to say; I was very reluctant to buy into the EVERY VM now gets a 10GB link – In my opinion, that terrifies me at first though. What if a VM all of sudden decided to GO NUTS and completely saturate the link? That would impact other VMs, would it not? At first yes, that could happen on a “RARE” occasion but you obviously have to understand your design and how Cisco UCS works.

Now onto the other observations and misconceptions I had about the VMXNET3.  I have to say from what I have researched and gathered it does seem that most articles point to an increase in overall performance.  Others reported that Host to Host communications was greatly increased even more than the percentages seen in outbound traffic.  One blog post stated nearly a %300 percent increase! > that’s very impressive. So now I can confidently say if you are using CISCO UCS you should definitely consider using VMXNET3 driver. (NOTE: You cannot use FT with VMXNET3)

So how exactly does all this tie into my CISCO UCS post?
In short it’s this link here.

Excerpt:
“The revolutionary Cisco® UCS M81KR Virtual Interface Card (VIC) helps increase application performance and consolidation ratios, with 38 percent greater network throughput, complementing the latest increases in Cisco Unified Computing System CPU performance and memory capacity. The virtual interface card and the Cisco Unified Computing System together set a new standard for balanced performance and efficiency.”

Now the VIC Card seems pretty cool, but what I thought was a little disappointing is that most companies will only really use something like this for a particular “Use Case” and It’s also curious because they don’t get into other things like upstream traffic and how it would affect host to host communication.  The other disappointing factor was they tested this using RHEL which I can understand and it wasn’t really a real world test.  What they only wanted to prove was that by offloading network traffic to UCS you get better performance.  Now, this doesn’t mean I still wouldn’t want to know what it is capable of.  Even so they showed just how having the interface card and VMXNET3 how much further traffic was improved.

Now Down to the nitty gritty:

1)      Limitation on total overall Network Interfaces for VM’s

a)      1/2 height can only have 1 VIC = 128 Virtual Interfaces

b)      Full Height can only have a maximum of 2 VICs = 128-256 Virtual Interfaces

2)      Doesn’t really benchmark windows – that really does matter in the scheme of things considering MOST environments RUN windows.

3)      Doesn’t really go into detail on how you would bind these NICS between UCS and vSphere Hypervisor. Only allocating a MAC in UCS and then using VMDirect Path for the NIC. (this is probably more simple then I think)

4)      They don’t cover host to host but they do cover Chassis to Chassis which is great to see that kind of performance – but come on show us host to host!!!

5)      Scenario 3 isn’t real clear on the VM ethernet interface used – it says “Default enic” so my guess is they couldn’t use anything else but a VMXNET3 – not sure why it says that.

6)      Statistics for how CPU performance was affected per scenario

7)      Does this mean there is no needs for 1000kv switching since you can use the “VIC” to set up your interface within UCS itself? (This would be my biggest reasoning > hand off to Net Eng = WIN!)

8)      Lastly, VMware vCloud Director uses templates and is automated..how could you creatively design this to work with an automated cloud solution? (I mean heck I would love the performance; Only thing I can think is VCO plug-in for UCS and Tie it into VCO/VCD plug-in, Maybe? Why I say “USE-CASE”)

Obviously this is a lot of information but I would honestly like to test this in my own environment and see how well it does perform.  Our cloud platform offers everything from weblogic, oracle, SQL, and more. Anyways let me know your thoughts and any other information would be greatly appreciated! Yes, I know I am a Noob 🙂 .

***Disclaimer: The thoughts and views expressed on VirtualNoob.wordpress.com and Chad King in no way reflect the views or thoughts of his employer or any other views of a company. These are his personal opinions which are formed on his own. Also, products improve over time and some things maybe out of date. Please feel free to contact us and request an update and we will be happy to assist. Thanks!~

%d bloggers like this: