Tech. Team weekly review for 31 July, 2017

Jonathon Anderson

2017-07-31 14:00

This week at the University of Colorado Research Computing Tech. Team...

New HPC Storage Admin, Patricia Egeland

Were're bringing on a new team member, Patricia Egeland, as HPC Storage Administrator, starting Tuesday. Patricia will share general system administration duties with the other RC tech. team operational staff; but will carry primary responsibility for RC data storage systems, notably RC Core Storage and the PetaLibrary. She also plans to contribute to ongoing and upcoming software development efforts at RC, and we're looking forward to seeing more interest in that space both internally and in our user community.

Patricia worked most recently as a systems analyst for the Dark Energy Survey (DES) Science Portal, and previously worked as a server, system, and application administrator for the CERN Compact Muon Solenoid (CMS) experiment.

We couldn't be more pleased to have Patricia as a member of the RC tech. team! Please join me in welcoming her if you have an opportunity to work with her.

Final deployment of HPCF UPS

We'll have the final of our three UPS-related HPCF outages starting on Wednesday, and closing out in the afternoon on Friday. During this outage we'll be...

re-routing the power cabling between the UPS infrastructure and the in-row power-distribution infrastructure for future maintainability;
decommissioning the legacy UPS;
installing additional in-row power distribution infrastructure;
and bringing the new UPS into full production.

The new HPCF UPS will provide not only power conditioning (remediating a power quality issue that has led to several past Summit compute outages) but also at least 15 minutes of UPS-backed runtime in the event of a complete utility power outage (which should eventually provide us sufficient time to power the system off in a controlled manner).

PetaLibrary/2 RFP goes live

Research Computing successful PetaLibrary service is getting a refresh! Or, at least, that's the intent. We're publishing an RFP today for a new, unified infrastructure, which should extend the life of the PetaLibrary, simplify our service offerings, make the infrastructure more maintainable, and eventually allow us to add additional features and services.

Monday, 31 July 2017: RFP Posted online
Friday, 4 August 2017 (09:00): Optional Pre-bid call
Friday, 11 August 2017: written questions are due
Tuesday, 5 September: RFP responses are due

https://bids.sciquest.com/apps/Router/PublicEvent?CustomerOrg=Colorado

XSEDE SSO hub authentication progress

RMACC Summit is intended, as its full name implies, to be an RMACC resource, not just a CU or CSU resource. We've planned from the beginning to support access to Summit through XSEDE credentials, but this has required additional (though already planned) service development at XSEDE. Those services are ready for beta testing now, and CU is on hand as an early adopter for their new "single sign-on hub for L3 service providers" service (XCI-36). We're working on deploying this now, and will hopefully be able to start bringing on early-adopters from the RMACC community soon.

Misc. other things

We're rebuilding the RC login environment. We've been through a few prototype efforts, but the current plan is to start by deploying a new tutorial login node, tlogin1, which will also be the first recipient of new XSEDE authentication services.
We're continuing to develop our internal "curc-bench" automated benchmarking utility for validating the performance of RC HPC resources over time (notably after we make changes). Development is primarily driven by Aaron Holt.
We had to rebuild Sneffels (originally "the viz cluster") after a security incident. That work is largely done, and service has been restored, but OIT is sill reviewing viz1 as part of our incident response process.
We're updating the Globus software for our data-transfer service, starting with dtn01. We're further taking this opportunity to re-build our DTN configuration in general, which should lead to better and more reliable data-transfer performance due to the correction of a number of networking irregularities on these servers. This work is being done primarily by Dan Milroy.