Cisco UCS Journey – When to update firmware

Don’t update if its not broke. If it breaks then update it. If you have issues with false alerts you may want to update firmware. I saw this with 1.4j.

The issue is not with the IOM but with the chassis communication bus(i2c bus) and hence the IOM is not getting detected and backplane ports never come up. If you seeing alerts related to PSU and those types of things then you may want to pay attention.

I2C is a bus that provides connectivity between different components in the chassis.

The PCA9541 is an I2C part that helps us control access to the shared devices in a chassis; the chassis serial eeproms, power supplies, and fan modules.

The 9541 I2C mux has known hardware/hang issues that can cause failures to access hardware components on the chassis. This can result in failures to read fan and PSU sensor data (such as fan speed and temp), triggering faults to be raised for the component (such as fan inoperable).

Some early PCA9541s that were used have a bug that if they are switched back and forth between IOM1 access and IOM2 access too quickly, they will get stuck and not allow any connection to the devices behind them.

Action Required:

Required to upgrade firmware version to 1.4(3q) or above.

Workaround to be followed before going for firmware upgrade:

• Reseat all the PSUs one by one in the chassis. Wait for 10min after inserting one unit ,so that it could stabilize.

• Reseat all the Fan Units on the backside of the chassis. Wait for 3min before going for the next one.

• Reseat both the IO modules. Wait for 20min before going for the next one.

• Verifying the i2c counter for the chassis.

• (Requires Down Time)Power cycle to reset all counters to fix issues in the running version.

• (Requires Down Time)Upgrading to firmware version 1.4(3q) or above (2.0 release) for a permanent fix.

Please follow the link to download the 1.4(3q) bundle:

http://www.cisco.com/cisco/software/release.html?mdfid=283612660&release=1.4%283q%29&relind=AVAILABLE&softwareid=283655658&rellifecycle=&reltype=all

Related Issue with firmware version used:

Incorrect behavior of I2C bus or CMC software interpreting I2C transactions?

  1. Fans (count 8 or less), PSU (count 4 or less) can be reported as inoperable. State never cleared.
  2. Fans are running at 100% rotation rate.
  3. UCSM cannot retrieve the PSU/Fan part detailed information
  4. Transient errors indicating Fan inoperable, cleared in one minute time interval.
  5. LED state does not match faults reported in UCSM and actual health of the system.
  6. Incorrectly reported thermal errors on blades and chassis .

Fixes that are promised for 1.4(3q):

  1. CSCtl74710 I2C bus access improvements for 9541

PCA9541 (NXP I2C bus multiplexor) workaround to improve bus access for parts built prior of mid 2009. The workaround assures that if internal clock fails to #:start it gets retried. The change designed and works as expected for both PCA9541 and PCA9541A parts from NXP. PCA9541 parts due to the internal clocking bug #:had a high number of bus_lost events.

  1. CSCtn87821 Minor I2C driver fixes and instrumentation

New Linux I2C driver has optimization to handle I2C controller and slave devices synchronization. With older driver simple synchronization error could appear as uncorrectable device errors.

  1. CSCtl77244 Transient FAN inoperable transition

During UCS (CMC) firmware upgrade and switching to new master/slave mode CMC erroneously takes information from the slave IOM and evaluates fans as inoperable based on stale data.

  1. CSCtl43716 9541 device error. Fan Modules reported inoperable, running 100%

Software code routine bug where single bus_lost event followed by successful retry will result in an infinite loop. As result Fans are reported as inoperable and are not controlled by CMC.

  1. ??

Removed an artificial cumulative threshold to enable amber color LED upon reaching 1000 bus_lost events. This was implemented as a monitoring mechanism to simplify identification of the PCA9541 devices. This is no longer needed since a proper software workaround is implemented.

Since this email we have started the update to firmware 2.0. This is a separate blog I am going to write because that too was pretty intense. I will provide some additional steps that we performed to lessen the impact. One thing is for certain don’t expect it to NOT be impacting….

***Disclaimer: The thoughts and views expressed on VirtualNoob.wordpress.com  and Chad King in no way reflect the views or thoughts of his employer or any other views of a company. These are his personal opinions which are formed on his own. Also, products improve over time and some things maybe out of date. Please feel free to contact us and request an update and we will be happy to assist. Thanks!~

Advertisements

About Cwjking

I am an IT professional working in the industry for over 10 years. Starting in Microsoft Administration and Solutions I was also a free lance consultant for small businesses. Since I first saw virtualization I have always been fascinated by the concept. I currently specialize in VMware technology. I consult daily on many different types of VMware Solutions. My current role is hands on administration, technical design, and consulting.

Posted on January 24, 2012, in CiscoUCS, Troubleshooting, vSphere 4 and tagged , , , , , . Bookmark the permalink. 2 Comments.

  1. Did you implement the workaround(s) prior to upgrading the firmware or did you just upgrade the firmware? I had a similar issue with my UCS – a bad power supply was not detected and that caused both IOM’s (and thus the chassis) to reboot. I’m going to upgrade to 1.4(3u) – but would rather not have to reseat the power supplies and IOMs.

    • Yeah Jeff, Thanks for the reply first off.

      From Cisco Support
      1. Reseat The parts (being that which was falsely alarming)
      2. Flash Firmware

      What really sucked was that after the firmware update which I didn’t state in the blog to mezzanines had to be replaced because the 230 M1 showed the interfaces as blank – no mezzanine at all. We ended up having to replace this part additionally on 3 UCS blades. After moving the SP’s to another blade in a seperate chassis I had the pleasure of renaming VMnics (YAY!) which we all know on ESXi is never fun. We decided to just go with 2.0 though Cisco recommended 1.4(q) which I am not so sure is the most stable but any updates beyond that addresses these issues. As always though Cisco support was good.. the update process. wow another blog perhaps..

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: