Message of the Day

The Wrigley Field Compute Cluster at Michigan State University.

Current Alerts

Use strong passwords. Do not use your password on cap.pa.msu.edu for any other account.

Current News


Computer Center electrical work is planned for Thursday Dec 21. The CC will be off mains power from 7am until 6pm. Generators will be used to power systems during this time, but during the switch two and from generator power there will be no power. We will keep the compute cluster off during the entire work period. The cluster batch system will be halted approximately 12 hours before the outage to allow queues to flush.

-- TomRockwell - 10 Nov 2006

The cluster is running in BPS 1221. Contact me if you'd like a UserAccount. Currently, we have a couple DzeroReleases and AtlasReleases installed. The BatchSystem is also up.

-- TDR - 20 Sep 2005


Old News

There will be a short downtime tomorrow morning October 12 to impliment a network configuration change. The queues are off until then.

-- TomRockwell - 11 Oct 2006

The computer center router will be serviced from 06:00 to 06:30 October 10. This will result in the cluster being inaccessible from the outside. Processing on the cluster should continue without (much) disruption.

-- TDR - 9 Oct 2006

There are two planned power outages for the Computer Center coming up. The first is the evening of October 6th. The second is all day December 21. The building electrical feed will be upgraded December 21 which will require a long outage. They are going to bring in a generator to run the computer room for that day. The October 8th outage is to test the generator.

The Wrigley cluster will be off for the duration of both of these events. I have told the operations staff that we will remain off to reduce generator load. Note that since we don't have full UPS coverage on the cluster, we would want to shutdown worker nodes at the generator cut-over (begin and end of outage), it is simplier to just stay off.

-- TDR - 15 Sep 2006

This downtime has eneded. Kernels were upgraded on some machines during the outage.

There is a SAM database downtime scheduled from 9am to 1pm Tuesday Aug1. SAMGrid jobs will not be started after about 2am so that none will be running during this time. I will likely do some system maintence at this time. Systems may be rebooted if nobody is doing anything.

-- TDR - 31 Jul 2006

The cluster will be down twice due to power upgrades to the Computer Center room. These will occur Friday July 21 and August 4. The cluster will be turned off at 6 pm and will likely not be back up until 10 am Saturday. The power outage is shorter, but we will use a longer window in order to stay out of higher priority work in the room.

-- TDR - 21 Jun 2006

Dual-core CPUs are being installed in the original nodes, cc001-cc019 and cap. The compute nodes will be taken down one at a time without notice to do this. Cap will be taken down Friday June 23 at 10am. This outage will be about 1/2 hour.

-- TDR - 21 Jun 2006

Cap has hung and required a restart this afternoon.

-- TDR - 15 May 2006

I've altered the configuration of the ssh servers on the cluster so that the permissions on your home directory are not checked when using public key logins.

-- TDR - 27 Dec 2005

The cluster move to the Computer Center is complete.

-- TDR - 18 Nov 2005

I am working to bring the new nodes up. The worker nodes will be up and down for hte next day or so. The head node should remain available.

-- TDR - 31 Mar 2006

The permissions on your home directory must be 0700 in order for ssh logins to the compute nodes to work. This is required for delivery of output files from batch jobs. I'm investigating issue. For now, set the permission on your home directory to 0700 (running gensshkeys will also do this).

-- TDR - 22 Dec 2005

The cluster move is currently scheduled for Wednesday morning November 16. Until then, the head node is on, but the worker nodes will remain off. The IP address (DNS) change is scheduled to be active today, which means that cap.pa.msu.edu won't resolve correctly until the move is complete.

-- TDR - 14 Nov 2005

The cluster will be relocated to the Computer Center on Monday November 14. The cluster will be down from Sunday evening until the move is completed. The network addresses of the cluster nodes will be changed at this time; the name will remain the same. I'll send an email to all users once the cluster is back online.

-- TDR - 10 Nov 2005

Node cc018 is back online after power supply replacement.

The Ethernet devices have had firmware updates and the networking will be more stable now.

-- TDR - 28 Oct 2005

An upgrade to the firmware for the Broadcom Ethernet devices is needed to fix network freezing issue. This has been performed on cap and cc019 and testing shows that it is effective. The remaining nodes will be rebooted in order to perform the update today or tomorrow.

-- TDR - 20 Oct 2005

Node cc018 is down for power supply replacement.

-- TDR - 15 Oct 2005
Topic revision: r24 - 16 Oct 2009, TomRockwell
 

This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback