Site Reliability Engineer

Position Ref: SRE0819BW

Staffordshire, Central Manchester



Closing date

September 25, 2019


bet365, one of the world’s leading online gambling companies, is a driving force in the development of enterprise and Internet technology. We have rapidly grown into a global operation, delivering an unrivalled online experience to more than 45 million customers in 20 languages.

The Site Reliability team is looking for a Site Reliability Engineer to join this exciting and vital part of the bet365 family.

The DevOps department is a new function within bet365’s technology business. As a key part of this function we are creating a Site Reliability Engineering (SRE) team. The SRE team will work with Development, Platform Delivery (Networks, Database and Infrastructure), IT Services and other teams in the DevOps department to determine aspects of applications that should be monitored, alerts that should be raised and what tooling or automation should be put in place to aid issue resolution and capacity planning.

Site Reliability Engineers will utilise a range of technologies to complete the required development and configuration work needed to produce dashboards, automation and tooling. They will utilise system data to help the business make more informed decisions regarding capacity requirements and application health in the production estate.

We hire people with a broad set of technical skills who are ready to tackle some of technology’s greatest challenges and continue to break new ground in software innovation.


Main Responsibilities:

• Developing bespoke in house tooling using a range of technologies to aid colleagues in IT Operations to complete their duties, e.g. Javascript, Golang, Python, Powershell.
• Working with automation and orchestration platforms (e.g. Ansible, Jenkins), to automate manual activity.
• Building sophisticated monitoring dashboards using log data, monitoring and graphing technologies (e.g. Grafana, Nimsoft, Splunk, Kibana).
• Ongoing maintenance and administration of existing monitoring and analytics toolsets.
• Undertaking development and configuration activities to agreed timescales in line with the agreed Software Development Lifecycle for the SRE team.
• QA checking the work of colleague.
• Mentoring colleagues in the use of new technologies or practices.
• Contributing to the evolution of team processes and approaches.
• Collaborating with DevOps, Software Development, Platform Delivery and IT Services departments to determine requirements and solutions, to solve problems and progress work.
• Supporting other parts of DevOps, Platform Delivery and IT Services regarding software engineering practices.
• Contributing to discussions regarding the suitable architecture and technology choice for SRE software.
• Working with IT Operations to provide and support the use of tooling that will enable them to offer increasing levels of value to the business.
• Working with Incident and Problem Managers to support and enable Incident and Problem Management.
• Taking part in an on call rota to support systems built by the SRE team.

Essential Skills, Experience and Attributes:

• Working knowledge of contemporary monitoring, analytics tooling and best practice.
• Working knowledge of automation tooling and best practice.
• Working knowledge of SNMP, SQL and Procedural Programming (e.g. Javascript, Golang, Python, Powershell).
• Excellent investigative and diagnosis abilities.
• Keen interest in industry trends, particularly DevOps.
• Strong appetite for learning and applying learning on the job.
• Ability to handle and thrive under pressure, often multitasking and dealing with reprioritisation of work.
• Accountability for your own high performance expectations with ability to learn from mistakes in an open and transparent way.
• Ability to work with autonomy but also to collaborate well and progress work as part of a cross functional team.

Desirable Skills, Experience and Attributes:

• Previous experience of administration with CA UIM (aka Nimsoft) and Nagios.
• Automation and orchestration platforms (e.g. Ansible and Jenkins).
• Experience utilising log data to investigate and diagnose issues and build dashboards.
• Working directly with infrastructure, networking and application monitoring systems.
• Experience of working within a NOC environment and understanding the challengers faced by an IT Support team.
• Experience working in a large scale, 24/7 enterprise where system uptime and stability is of paramount importance to the business.

Apply For This Job

If you believe you possess the skills and experience necessary for this role then please email your CV and Covering Letter quoting the Position Reference SRE0819BW to Alternatively you can send the application by post to Human Resources Department, Hillside (Shared Services 2018) Limited, bet365 House, Media Way, Stoke-on-Trent, England, ST1 5SZ.

By applying to us you are agreeing to share your Personal Data in accordance with our Recruitment Privacy Policy.