Saturday, 16 June 2012

Updated core service status page

After an extended outage the majority of University IT services are now up and running. All core services are now running with the exception of Iceberg, which we will be testing extensively before restoring to service on Sunday.

If you are experiencing problems with web-based services you may need to refresh your browser.

In the end we didn't employ a workaround, we initiated a long-winded and complex process to impose the  correct configuration upon the problematic servers.

StatusCore ServiceDetails
CiCS Service Status pageavailable
Google appsavailable
MOLE 2available
Student printingavailable
Novell filestoreavailable
Web CMSavailable
Facility Timetableravailable
CRIS (passwords and account management)available
Java Appsavailable
Library Star Plusavailable

Final Checks Before Service Switch On

As you may have gathered the IT infrastructure upgrade did not go as well as we had anticipated.
  • The hardware and software upgrades on the filers went well.
  • The moving of filers, servers and HPC cabinerts went well.
  • Tests revealed that everything was working ok.
  • However, when powered up, the Solaris-based virtual servers would not talk to the new disc controllers - somewhere in the configuration was an instruction to find the old disc controllers
Much work concentrated on how we could fix this in collaboration with our consultant. However, we have been unable to isolate and correct the rogue setting.

Work then began on planning workarounds, knowing that each workaround is a compromise that carries risks.

We have now identified and tested a safe workaround. It will bring back services securely, but it is time consuming to implement. It will also take a lot of work to undo and repair once the consultants are have identified and provided a complete fix for the original fault - but this is a problem for another day, our priority is to restore University services as soon as possible today.

So rigorous testing is taking place beofre we get the all clear to implement. Once implemented services will start to come back of their own accord. We will manually bring back priority services such as MUSE and those services needed for exam processing.

Problem Encountered During Upgrade

CiCS is disappointed to report that we have hit upon a problem during the IT infrastructure upgrade and will not be in a position to restore most IT services by 9am as planned.

CiCS staff have worked through the night and have completed the hardware and software upgrade. However, testing has revealed a problem and we have not been able to restore the virtual servers used to host and run our services. We are continuing to work to understand and resolve this problem and will publish regular updates. 

However, it is inevitable that University IT services will not be returned to service as early as was planned. CICS apologises for this extended interruption to service.

Moving Towards Restoring Services

Well it's taken longer than expected, but it's been a very successful night so far. We couldn't be completely sure that the filers would move. During phase 1 we 'nudged' the filers, which means we tested to see whether these half tonne cabinets would move. This test was successful and as a result we succeeded in turning round all filers, all servers and all Iceberg cabinets.

Our objective in doing this was to concentrate all server air vents into the same aisle so we could concentrate our air conditioning coolers into a small area for more efficient and reliable cooling. As soon as we began the preliminary powering on of servers the difference was immediately apparent. The Unix team commented 'Wow you can already feel the new cold isle area, I'm impresses at the change so far'

Te main driver for the upgrade, replacing the netapp filer heads, with newer, faster, resilient controleers has been completed in both the computing centre and brunswick. The extensive recabling is complete in brunswick and the  testing is ongoing.

But already we can sense that the upgrade is nearly complete and we are ready to being restoring services, which is a whole new chapter...

Friday, 15 June 2012

Into Phase 2 of Upgrade

We're now into phase 2 of the upgrade. All services were shut down during the preparations, then in phase 1 we removed the old redundant discs in the Computing Centre and upgraded the filer software in Brunswick.

Now in phase 2 we have already spun round all the cabinets, each weighing around half a tonne! This allows all our servers to vent their hot air into the same aisle allowing us to concentrate our cooling into a small area, drastically reducing power costs.

We are currently recabling the computing centre filers (minus the old discs) and upgrading the filer software.

After a pizza break we will then mount the new filer controller heads, which is the central theme of this upgrade work.

Phase 1 of Upgrade Well Underway

The preparation phase of the upgrade, in which cics services were methodically and cleanly shut down, and all data copied to the backup filer, was completed around 8:15pm.

We were then able to start the actual upgrade phase 1 in earnest. Phase 1 has two parallel strands one taking place in the Computing Centre machine room, the other taking place in the Brunswick data centre.

The elements of phase 1 are:

Phase 1 - Start Upgrade 
Computing Centre 
  • Finish work to remove old shelves of discs, then test that the old config works with shelves removed
  • Change mains supply
  • Prepare for software upgrade
  • Upgrade netapp filer software and test
  • Carefully test to see if the netapp filers can be moved

At around 8:40 we completed the netapp filer upgrade in Brunswick. Efforts are currently concentrated on being in a position to move the shelves of discs from the filers in the Computing Centre.

The full maintennace schedule can be see in this blog post:
Maintenance schedule

Eduroam Not Working

Due to unforeseen dependencies, the eduroam wireless network is not working properly.

Our systems are many varied and complex and where possible we try to minimise the dependencies between them so that if one service fails it does't bring down other services with it. However, this upgrade work has revealed that eduroam was dependent on the virtual vmware servers in a way that we didn't appreciate. As a result once the filers were shut down eduroam stopped working properly.

We apologise to anyone who was depending on eduroam to work tonight, but at least now we understand this dependency we will work to eliminate it and thus simplify our systems after the upgrade.