Technical Duty Officer, Network Operations - London, United Kingdom - Box

Box
Box
Verified Company
London, United Kingdom

2 weeks ago

Tom O´Connor

Posted by:

Tom O´Connor

beBee Recruiter


Description

WHAT IS BOX?
Box is the market leader for Cloud Content Management. Our mission is to power how the world works together.

Box is partnering with enterprise organizations to accelerate their digital transformation by creating a single platform for secure content management, collaboration and workflow.

We have an amazing opportunity to further establish ourselves as leaders in the space, and we need strong advocates to help us achieve that goal.


By joining Box, you will have the unique opportunity to help capture a majority of this developing market and define what content management looks like for the digital enterprise.

Today, Box powers over 97,000 businesses, including 70% of the Fortune 500 who trust Box to manage their content in the cloud.


WHY BOX NEEDS YOU


Box is looking for a dynamic Global Site Reliability Technical Duty Officer to help lead our Global Technical Operations and oversee the continuous health, availability, and reliability of an industry-leading platforms and SaaS offerings.

It is the responsibility of the TDO team to lead 24x7 GTOC teams in preventing, monitoring, identifying, troubleshooting, mitigating, and resolving issues that affect the availability and quality of Box's platforms and services.


This is an integral shift-based leader and single point of technical escalation within the GTOC organization, assuming accountability for overall production site health and the performance of core customer facing journeys.

This role will help maintain total site awareness, detecting metric and service deviations, final level of change approval, and the proactive identification of potential issues; resolving them before they escalate to customer impacting incidents.

We are building a world class Operations Center and need the best talent possible to get us there. That's where you come in


WHAT YOU'LL DO

  • Own and direct livesite Major Incident Management from detection, identification, escalation, mitigation, and recovery.
  • Triage, refine, and verify the Problem Statement, notifies and coordinate the efforts of all appropriate SME resources, and lead crossfunctional Incident Bridges to quickly identify and mitigate the problem and restore service. You'll be evaluated in how well you are able to reduce MTTD to MTTR.
  • Ensure accurate, valid and timely communication to key stakeholders and business entities.
  • Lead daily Incident and Change ticket reviews, coordinate and monitor change windows, and coordinate with Problem Management on TopOps Issues and action items.
  • Operate across organizational boundaries (Business, Dev, Ops, CS) to protect our customers, their data, and the availability of all Box services, from internal and external security threats, unanticipated volume surges, and significant performance issues.
  • Troubleshoot and identify critical problems in a SOA/APIbased, global hybrid cloud, distributed edge architecture on multiple enterprise and public clouds regions.
  • Provide day to day technical expertise and experience to the organization to address issues in globally diverse, high velocity 24x7 environments from policy and procedural decisions to key architectural and tooling insights to improve Box's Incident, Change, and Problem Management engineering capabilities.
  • Lead daily reviews of planned changes (CAB) in Jira; accountable for reviewing and minimizing change risk, ensuring adequate and appropriate change timing and duration, and complete rollout, validation, and rollback plans that are optimized to prevent site or service impact.
  • Ensure all customerimpacting Incident tickets are completely and correctly documented and augmented with appropriate metrics, timelines, actions taken, and actions still pending.
  • Contributes and reviews Incident postmortems to ensure adequate documentation and appropriate prioritization of action items related to reducing MTTI, MTTM and MTTR.
  • Participates in Problem Management scrums and Postmortems to identify leading organizational and companywide technical issues, threats, and trends that block the ability of the organization or teams to perform their roles and provide services optimally and reliably.
  • Lead projects to improve tools and processes related to overall site and service manageability, observability, and resiliency.
  • Coordinate regularly with Infosec, Customer Success, Platform and Dev leaders to continuously access new security and customer onboarding threats and known issues.
  • Continuously mentor and train Global NOC and system engineers.

WHO YOU ARE

  • You have 5+ years of largescale production/platform operations experience in a large, SaaS provider environments, preferably as a TDO/Major Incident Manager, SRE team leader or Infrastructure (IaaS) or Platform (PaaS) Architecture SME in a Managed Service Provider environment.
  • Experience in bare metal, Openstack, and K8 architectures supporting a large number of SOA-APIbased services.
  • Exposure to Open Source Service-Meshes, Pr

More jobs from Box