Software Reliability Engineer

Roles & Responsibilities:

Responsible for Toil Reduction, implementing identified improvement opportunities, handling minor enhancement and non-ticketed activity.
Define and monitor service level metrics that include incident management KPIs like: MTTD, MTTR, MTBF, MTTF, Unavailability rate, Incident count, etc.
Create rules to optimise incident response by metrics, streamlining alert flows, and collaboration and communication across squads.
Proactively identify the issues that might disrupt the service in production
Address incoming service request to their support groups/Jira tool
Create and maintain alerts
Change validation or change planning related requests
Assist business stakeholder in determining SLO or adjusting threshold limits
Demand and capacity management & make corrections to SLI/SLO threshold limits
Gather and analyse metrics from both operating systems and applications to assist in performance tuning and fault finding
Partner with development teams to improve services through rigorous testing and release procedures
Participate in system design consulting, platform management, and capacity planning
Create sustainable systems and services through automation and uplifts
Balance feature development speed and reliability with well-defined service level objective (SLO, SLI)
Debug production issues across services and levels of the stack
Monitoring and audit the production operations and policies related to infrastructure

Bachelor’s Degree in Software Engineering, Computer Science or related field
Software engineering and task automation skills with Bash, Python
Familiarity with the Agile software development lifecycle
Deep background in Linux systems and engineering
Highly experienced with engineering and automating on Amazon Web Services (AWS)
Experience supporting web applications running on Java / Apache / Tomcat in a live production environment
Prior experience with IaC tools like Terraform
Prior experience with DevOps tools (Git, Gitlab)
Production-At-Scale support background in a heavily microservice-based world
Hands-on engineering and ops expertise in containerization (Docker, Kubernetes/EKS, CNI, and Ingress networking)
Strong understanding of Single-Sign-On, SAML, and OAuth (Bonus if the hands-on experience with Okta)
Seasoned expertise around x.509 certificate technology and basic concepts of encryption
Experience working with Relational Databases such as MongoDB, Postgresql, Sql
Advanced exposure to application development, web UI (design and development), JSON, application architecture
Experience strongly utilising observability tools (logging/APM) like Datadog, Cloud Watch, and PagerDuty.