Vc41207 Hpc Data Centre Specialist - Cambridge, United Kingdom - University of Cambridge

Tom O´Connor

Posted by:

Tom O´Connor

beBee Recruiter


Description

As an HPC Data Centre Specialist, you will be responsible for planning, integrating and maintaining the data centre infrastructure owned by the Research Computing Services (RCS).

You will work on technical, physical and software infrastructure projects, to ensure that RCS infrastructure is reliable, secure, and efficient.

You will lead and mentor HPC Data Centre Technicians in their daily work, you will ensure that, alongside the HPC Data Centre Infrastructure Manager, their work is clearly planned, checked and signed off.

You will have experience working in an agile environment both in maintaining existing and building new HPC Infrastructure hardware.

There are 2 roles available; one role is server focused, the other is network focused, the role responsibilities are:

Network focused role responsibilities

  • Maintain the configuration of the ethernet switches onsite, employing templated techniques such as configuration deployed via ansible.
  • Ensure that network hardware is secure, accessible and operational.
  • Data connection design for new integrations.
  • Asset management of network hardware and data connections.
  • HPC fabric network (i.e. InfiniBand) topology design and integration.
  • BoM creation for ethernet and InfiniBand based on design brief.
  • L2 and L3 network fault escalations.

Server focused role responsibilities

  • Monitor and maintain the power load, including the implications of adding new hardware, phase balancing, PDU selection and configuration.
  • Monitor and maintain cooling systems experience with RDC's and DLC is a strong advantage.
  • Assist (alongside vendors and the HPC Data Centre Infrastructure Manager) with the design of cooling for new systems.
  • BoM creation for server, cooling and power based on design brief.
  • Initial startup configuration of servers such as BMC (Baseboard Management Controller).
  • Ensure that servers and cabling is finished to a high standard for new integrations.
  • L2 and L3 server fault escalations.
  • Hardware asset management, health monitoring.
  • Planning integration and maintenance.
  • Services assist with the smooth running of the fault rota, triage requests from internal customers such as the platforms team and the RSE team.
  • Lead and mentor HPC Data Centre Technicians.
  • Contractor management, such as managing data cablers during hardware integrations.
  • Manage relationships with vendors and service providers to ensure that the infrastructure is supported and maintained.
  • Develop and maintain documentation for the infrastructure and its processes.
  • Keep up to date with emerging technologies and trends in research computing.
The University has a responsibility to ensure that all employees are eligible to live and work in the UK.

More jobs from University of Cambridge