Posted on:
Over this past weekend I completed a large unified communications suite upgrade for a hospital network in Tennessee. The process took over 100 hours of planning and configuration, had two failed attempts at the migration task, and culminated in around 30 hours of work during the maintenance window last Saturday and Sunday. After what feels like a huge mountain of work that should have been more of a molehill, these are my thoughts about what went right, what went wrong, and what I'd do differently.
The Scope #
This project kicked off around a year ago. The plan was originally to migrate to the 12.5 line of products, but another project to migrate this hospital network's call center from UCCX, to PCCE, Cisco's contact center enterprise, delayed the UC project. Once the PCCE project finished, we decided to switch the version upgrades from 12.5 to 14.
I say migrate specifically to mean upgrade the version, but also move the virtual machines from aging physical hardware on ESXi 5.5 to new UCS servers running ESXi 7. The high level tasks look like this:
- Install and configure seven Cisco UCS C series servers across five different locations.
- Upgrade firmware and install ESXi 7 on each server.
- Deploy and configure Cisco Prime Collaboration Deployment (PCD) to automate the migration of the CUCM cluster.
- Upgrade IOS XE on several CUBEs and analog gateways.
- Use PCD to deploy two new Emergency Responder servers.
- Use PCD to deploy two new Unity Connection servers.
- Migrate an Expressway Core server from old ESXi host to new host and upgrade.
- Upgrade an Expressway Edge server and leave on current host.
- Migrate an InformaCast Basic paging server from old ESXi host to new host and upgrade.
- Use PCD to migrate ten CUCM nodes and two IM&P nodes from old ESXi hosts to new hosts and upgrade.
- Use COBRAS to migrate configuration and voicemails from old UCXN servers to new servers.
- Shutdown old UCXN, Expressway-C, and InformaCast VMs and change temporary IP addresses on new host to IPs previously occupied by old VMs.
Sounds like a lot, huh? In theory, it shouldn't have been too bad. Most of the work can be done during normal business hours as the new stuff wasn't in production. However, due to several failures related to the PCD server, the off-hours maintenance windows ended up being extremely long.
The Prep #
Leading up to the first attempt at the migration, I had to get all of the prep work out of the way.
Their on site resources racked, cabled, and got the servers on their network for me.
I built out the host machines, staged the VMs, uploaded the installation media, scripted the PCD process for the CUCM migration and all the other things that needed to be put in place.
I ran the ciscocm.preUpgradeCheck-00041.cop.sha512
against all nodes in the CUCM cluster and each passed without issue.
I opened a stand-by case with Cisco TAC to have assistance at the ready in case of big problems.
The plan was to start on a Saturday around 9AM by starting the PCD task to export the data from the existing CUCMs and running the COBRAS export on the old UCXN servers.
I inserted a pause step in the PCD migration task after the data export because we couldn't start taking server offline and bringing up new ones until around 7PM.
That should give the export step plenty of time to run and allow me to have the afternoon to do my own thing while just keeping an eye on it to make sure nothing went wrong.
Everything was in place and ready to go, we had our final meeting the Friday before hand and put a change freeze in place and then...
The Problems Started #
I got started on Saturday morning and everything started out as planned. I kicked off the PCD migration task for the CUCM migration and the COBRAS data export for UCXN. While those were doing their thing, I started building a fresh InformaCast Basic Paging on the new host, restored from a backup of the old host, and upgraded that to the latest version. After I wrapped up the process with InformaCast, I went to check on the PCD data export task and it had failed on the publisher node. What's worse was now the publisher's web pages were only partially loading and CLI access (both via SSH and from the VMware console) was completely broken. I guess it's time to call Cisco TAC.
Fast forward several hours of attempting to gather logs, restart the node, run recovery disks, etc. and we're now at the point where TAC recommends we rebuild the publisher and restore from a backup. We weren't able to collect much information from the CUCM as the CLI was completely inaccessible and TAC didn't have any cases on file with a similar problem to reference. It's getting rather late into the evening at this point but I decided to go ahead with the rebuild. Rebuilding the publisher took around 2.5 hours for the install and another hour to find the backup and get it restored because a cron job had moved the files from the SFTP server they're sent to by CUCM. By the time I was finally able to get the publisher back in working condition and the dbreplication looked good, it was around 5AM and we're out of time.
We did have some success; the UCXN imports went well and InformaCast was built. However, we couldn't use the new versions because they weren't compatible with the old CUCM 11.5 cluster we were still running. They would have to sit and wait until the next migration happened and I would have to re-run the COBRAS import to retrieve any voicemails that were left in the meanwhile. The next morning, we met to discuss what had happened the night before, the plan for the next attempt, and to pick a date. We agreed on another Saturday a few weeks later and waited to give it another shot.
The Second Attempt #
I started around 7 or 8 in the morning this time. I kicked off the PCD migration task and the COBRAS export and waited; this time paying closer attention to the migration to catch any potential errors as they happen. After a bit of time the publisher exports all of its data and we see the lovely Success message next to that task. Now I think I'm in the clear and I start to relax a bit but, that was soon to change. The next task, export data from the IM&P publisher succeeds and so does the export from the first CUCM subscriber. Then the second subscriber fails its export task. A quick check of the webpage only partially loading and the CLI being inaccessible proves that we've run into the same problem on a different node. Back on the phone with TAC.
We spent the afternoon trying to grab logs from the CUCM publisher and the PCD server. Nothing points to any obvious reason why this is happening, so we decide that we need to rebuild the subscriber. Once I had that server rebuilt and restored from backup, the plan was to run the CUCM recovery software against all remaining nodes in the cluster. We run all of the check against all of the subscribers, a few report that issues have been corrected. I rebuild the PCD task from scratch and kick it off again.
Would it surprise you, dear reader, if I told you that the publisher passed, the IM&P publisher passed, the next TWO subscribers also passed, and then the third one in line failed? I will tell you that I can be a bit of a pessimist and it did not surprise me at all. The TAC engineer wanted to start doing the usual log pulling and test running, I was having none of that. I went directly to rebuild yet another server and around 3 or so hours later, the cluster was back in working order. At this point, it was again around 2AM and I didn't see us getting this done this night either.
During all of the waiting this time, COBRAS finished its export and import tasks successfully again, though that wouldn't matter for the same reason as last time. We also got the Expressway-C VM moved to the new ESXi host after a little bit of difficulty. The original plan was to have the customer's server team tie the old and new ESXi hosts into their vCenter environment to vMotion the Expressway-C between them. But, as it turns out, their vCenter couldn't support ESXi 5.5 and 7 at the same time as the versions are too far apart. Instead, with the help of this guide, we transferred the files manually.
The Third (and Final) Attempt #
The plan this time was a bit different. The TAC engineer thought it might help to rebuild PCD to rule out there being an issue with the server. During the week leading up to this attempt, I went ahead and rebuilt the server, but this time there was an updated version, so I upgraded from the 14 version to 15 at the same time. Surely, if there's a problem with the PCD server, a brand new one on a different version will resolve the problem.
On Saturday morning, I get up early and kick off the PCD migration task and COBRAS export with some hope that PCD will work this time, but plan to have a long day. A few hours into the task, several subscribers into the list the damned thing fails AGAIN, this time on yet another server. I get in touch with my primary customer contact and tell him what's going on. I gave him a few options in my order of preference on how to proceed and luckily he agrees to get everyone on board with my primary plan. We're going to rebuild this subscriber and the remaining six left on the list that haven't already been rebuilt.
Up to now, I've been using PCD to automate the rebuilds. I get the whole thing scripted out, defining all of the networking, username and password combos, certificate information, build order etc. and get ready to kick off the task. When I get to the step to define the media to install the OS from, the server doesn't show any valid .iso images. I do a quick check in the fresh_install directory, and I see the file sitting there. I had skipped an important step... I hadn't checked the release notes for PCD 15 during my rebuild. As it turns out, support for installing version 11 has been dropped in this new version. Looks like I'll be rebuilding all six of these CUCM nodes... manually and they have to be staggered two at a time to boot; this is a network of hospitals, after all.
Fast forward, it's now around 10PM and I've finally got ALL CUCM nodes rebuilt and restored from their backups. We've basically got a brand new cluster of 11.5 servers now. I'm pretty confident this time that at least the export stage of the PCD migration is going to work, so we kick it off again and hope for the best.
It. Worked.
We passed the last export step and moved onto the build step where PCD should shut down the 11.5 publisher and start the install of the 14 publisher with the same IP address. The requirement for the CUCM cluster was for all nodes to keep the same IPs and hostnames as the old servers, apart from one, which was moving from their corporate site to a newer hospital that needed a local node but didn't currently have one. PCD supports two migration types. From the PCD administration guide:
- Using the source node settings for all destination nodes option is called a simple migration.
- Entering new network settings for one or more destination nodes is called a network migration.
So as you can see, there should be no problem changing the IP address of a single node. However, I had noticed back before the first attempt when adding a simple migration task, the steps in the process looked something like this:
- Export data from all nodes
- Shutdown 11.5 node X
- Start installation task for node X on version 14
- Repeat the last two steps for all remaining nodes
The network migration tasks look more like this, regardless of whether the node is keeping the same IP address or not:
- Export data from all nodes
- Start installation task for node X on version 14
- Repeat installation task for all remaining nodes
- Shutdown 11.5 node X
- Repeat shutdown task for all remaining nodes
Can you spot the issue here? If the nodes that are keeping the same IP information don't have the 11.5 node shutdown first, there's going to be an IP address conflict. I actually noticed this during the prep for the first migration attempt and opened a case with Cisco TAC about it. The engineer I was working with on this case reference other TAC cases as documentation.
For the new network migration, PCD will not shut down the nodes as no conflict will occur. For same source migration PCD should shutdown the nodes before installing the destination node to avoid any conflicts.
I'm guessing you can see where this is going by now.
The publisher began its install step as expected but despite the PCD task being told it would be reusing the same IP information, the old node was not shutdown and the install failed due to an IP conflict.
Fortunately, this was a fairly simple to fix.
SSH into the 11.5 server, give it the old utils system shutdown
, reboot the node running the installation and PCD took over to finish the install from there.
But, this meant that I needed to remain at my computer for the duration of the installation process for all nodes and as a new group of servers began their installs, I had to login and shutdown the old ones.
While I was waiting for all servers to finish their install, I completed the IP changes on the UCXN, InformaCast, and Expressway-C servers, shutdown the old nodes, and upgraded the Expressway-C & E. Around 9AM, 25 hours from the start of the migration, we were finally done and cursory testing was completed. The two customer engineers and I all agreed to go get a few hours of sleep and meet again in the evening to wrap up things like licensing and certificates, which we did. The following day (Monday if you're keeping score), we monitored the system, took care of some minor issues that came in via their ticketing system, and all seems well.
Final Thoughts #
I think the biggest issue with PCD is how opaque the utility is. It's very much worth using, especially with large clusters, but you just don't get the same detail of logging out of it as you would running your tasks manually through the CLI of each server. Yes, you can open the console of your VM and watch the install process and the servers keep their own logs, but say your server gets exploded and the CLI is inaccessible from that point on. You have no ability to go pull information after the fact, where if you had the live output from the process like you do from a CLI install task, you'd have it in the scrollback of your console to reference. You don't even get anything equivalent to a progress bar to know if some process has hung, just scheduled, in progress, and success messages for each task. I also really take issue with the fact that the network migration task didn't shutdown servers to avoid IP conflicts like it clearly should. This would have worked with a simple migration, but because of the single node we had changing, it malfunctioned and defeat the automated installation, which is kinda the whole point.
The process I used for all of the other servers went great. I really couldn't have asked for any of the stuff outside of CUCM & PCD to go any smoother. I learned that sometimes for the sake of time, it may be better to cut your losses and take the last resort measure (rebuilding in this case) sooner, rather than troubleshooting in circles. We may have been able to get this done in two tries or even a single long weekend if I'd made the decision to rebuild everything more aggressively instead of waiting for TAC to come up with an answer I might like better. I never did get any definitive answer from TAC on what the underlying issue was. The most detailed answer I got was:
In the logs [collected from PCD] we can see below error:
Line 84271: org.apache.axis2.AxisFault: Transport error: 404 Error: Not Found
Line 84290: 2023-11-12 16:13:07,288 WARN [DefaultQuartzScheduler_Worker-15] cisco.ucmap - SoapClientBase - Failure message = Transport error: 404 Error: Not Found
Suggested user to customer rebuilt PCD then restore. Based on the case notes and log analysis, I suspect the recovery disk and rebuild were suggested as database corruption was found on the dbIntegrity log, similar to symptoms seen on bug CSCvy68211, in which server rebuild is required to restore filesystem health, database integrity and proper feature functionality such as upgrade or migration.
Next time I have a migration project on my hands, I'm going to give serious thought to using the new fresh installation with data import process offered by v14+. For now, I'm just happy that I don't have to worry about spending another entire weekend fighting with PCD anytime in the near future.
If you've made it to the end of this whole saga, thank you for reading.
Have you had a similar experience with a maintenance window that went over like a lead balloon? Do you have any advice on any of the processes I followed? Shoot me an email using the button at the bottom of this post; I would love to hear from you.
Tagged with:
More posts: