Site Reliability Engineer

  • Technology
  • London, United Kingdom

Site Reliability Engineer

Job description

We are looking for a Senior Systems Engineer with a strong DevOps / Site Reliability Engineering background to join our highly skilled Information Systems team. We want your wealth of experience in maintaining complex global platforms, and your knowledge of telemetry, tooling and analysis. The person we are looking for would take a leading role in our Site Reliability team which is responsible for the performance and capacity of LMAX's technology and global operations.

About You:

  • You have genuine energy and enthusiasm for Linux and open source technologies.
  • You value automation, and use such tools to reduce incidents and entropy.
  • If you see something that's sub-optimal, you want to spend 10 minutes there and then to make it better.
  • You can quickly grasp "the big picture", be able to predict capacity and growth and make non techies understand the issues/tradeoffs.
  • You have a systematic problem solving approach, coupled with a strong sense of ownership and drive.
  • You can step up and provide leadership in a team where needed, and be a mentor for junior staff.
  • You can manage your own projects, because what do you need a project manager for?

You would be joining the Information Systems team, made up of Systems, Networks and Security Specialists whose job is to build and maintain LMAX's infrastructure, enforce regulatory compliance, and provide a stable
platform for our state of the art trading engine. We use a combination of Kanban, Agile and Devops methodologies blended together to manage the diverse needs of the business. This leads to varied work and strong knowledge transfer between our specialists. We're not averse to being on the leading edge and coming up with our own creative solutions.

  • We make infrastructure happen.
  • We run custom optimised kernels.
  • We keep everything running, fast, and reliable.

What you will be doing:

  • Looking at our platforms from a customers point of view and finding problems before they do.
  • Introduce new monitoring, metrics, and analysis to improve performance and capacity.
  • Working closely with the development and commercial teams to deliver fast, stable software.
  • Low level TCP debugging for diagnosing errors and latency issues.
  • Investigation and resolution of production incidents.
  • Scripting and automating processes to ensure consistency and repeatability.
  • Puppet development and testing in a continuous deployment process.
  • Pair working to spread knowledge throughout the team.
  • Participate in on-call rotation.

Requirements

You should have:

  • Linux expertise is essential, Fedora/RedHat/CentOS is the best to have.
  • Experience with a variety of metrics and information stores (eg: ELK, InfluxDB, Prometheus).
  • Knowledge of open source monitoring systems (eg: Nagios, Icinga, Sensu).
  • Experience with configuration management systems (eg: Puppet, Ansible, Chef).
  • Strong Experience with virtualization: KVM, Libvirt.
  • Demonstrated ability to script Bash required, and one of Python or Ruby.
  • Understanding of TCP/IP networking, ideally including Multicast. Must be happy with tcpdump/wireshark.


It is desirable if you have experience with:

  • Stream/data processing engines (eg: Kapacitor, Kafka, Storm, etc)
  • Kernel tracing and performance knowledge (eg: perf, systemtap, ftrace)
  •  "Infrastructure as a Code" / Test Driven Development / Continuous Delivery.
  •  Databases on Linux such as MySQL or Postgres.
  •  Running and debugging Java application servers.
  •  Understanding of networking kit, ideally including experience with load balancers.
  •  Microsoft Operating Systems.
  •  Private cloud infrastructure and automation.