No more applications are being accepted for this job

Senior Site Reliability Engineer - Greater London, United Kingdom - NetMind

NetMind Greater London, United Kingdom

1 week ago

Description

Job Description

Job Title: Senior Site Reliability Engineer / Senior Devops engineer

Location: London

Position Type: Full-time, Onsite, 5 days a week

About Us:

Netmind Power is a cutting-edge decentralized computing platform for Machine Learning, AI training, fine-tuning, and inference. We are seeking experienced SRE/DevOps engineers to develop our decentralized Kubernetes infrastructure and maintain its stability, as well as to drive the functionality and capability of our platform to the next level. This would be an excellent opportunity for professionals seeking to advance their careers in a collaborative and forward-thinking environment.

Key Responsibilities:

Deploy and administrate Kubernetes clusters both on-prem and in cloud (AWS).
Collaborate with software engineers to build and enhance the cutting-edge decentralized machine learning infrastructure.
Design, develop, automate, and continuously improve platform services and pipelines, such as monitoring, alerting, logging, tracing, CI/CD, etc.
Enhance the decentralized VPN network while ensuring security and efficiency.
Design and architect containerization solutions to streamline application builds and deployments across our organization.
Improve Kubernetes system efficiency and debug issues related to networking, storage, scheduling, etc.
Evaluate and implement new technologies and tools to enhance the overall reliability of the infrastructure.
Participate in on-call rotations and own, triage, investigate and resolve service issues with an emphasis on broad communications, learning & teaching throughout the process.

Minimum Qualifications:

4+ years of experience in Kubernetes administration, with a solid understanding of containerization and orchestration technologies.
3+ years of experience in Unix/Linux systems from kernel to shell and beyond.
Experience with 2 or more programming languages, such as Python, Go and Bash. Python experience is a plus.
Public cloud experience such as AWS, Azure, GCP.
Experience in designing, analyzing, and building automation tools and CI/CD for large scale and complex systems.
Experience in networking technologies such TCP/IP, BGP, DNS, load balancers, etc.
Strong communication problem-solving and teamwork skills.

Preferred Qualifications

CKA (Certified Kubernetes Administrator) certification.
Experience with Kubernetes CNI deployment and troubleshooting, including (but not limited to) the following CNIs: Cilium, Kube-Router, Calico, Flannel.
Experience in using and contributing to open-source projects in Kubernetes ecosystem, e.g. Kubespray, CNI, Helm, KubeEdge, Istio/Linkerd, Prometheus, ArgoCD, OPA, Harbor, Envoy, etc.
Experience in setting up secured VPN networks.
Experience in systems development in an IT or data center environment.
Proficient in managing AWS infrastructure and deploying resources using Terraform.
Previous experience with on-call rotations and incident response.

How to Apply:

Please send your resume, cover letter, and any relevant documentation through LinkedIn or to with the subject line "Job Application - [role name] - [name]". We look forward to learning more about how you can contribute to the success of NetMind.AI.

Senior Site Reliability Engineer - Greater London, United Kingdom - NetMind

Description

Job Description

for Recruiters

Information