Roles & Responsibilities:
- Responsible for Toil Reduction, implementing identified improvement opportunities, handling minor enhancement and non-ticketed activity.
- Define and monitor service level metrics that include incident management KPIs like: MTTD, MTTR, MTBF, MTTF, Unavailability rate, Incident count, etc.
- Create rules to optimise incident response by metrics, streamlining alert flows, and collaboration and communication across squads.
- Proactively identify the issues that might disrupt the service in production
- Address incoming service request to their support groups/Jira tool
- Create and maintain alerts
- Change validation or change planning related requests
- Assist business stakeholder in determining SLO or adjusting threshold limits
- Demand and capacity management & make corrections to SLI/SLO threshold limits
- Gather and analyse metrics from both operating systems and applications to assist in performance tuning and fault finding
- Partner with development teams to improve services through rigorous testing and release procedures
- Participate in system design consulting, platform management, and capacity planning
- Create sustainable systems and services through automation and uplifts
- Balance feature development speed and reliability with well-defined service level objective (SLO, SLI)
- Debug production issues across services and levels of the stack
- Monitoring and audit the production operations and policies related to infrastructure
Education & Experience Requirements:
- Bachelor’s Degree in Software Engineering, Computer Science or related field
- Software engineering and task automation skills with Bash, Python
- Familiarity with the Agile software development lifecycle
- Deep background in Linux systems and engineering
- Highly experienced with engineering and automating on Amazon Web Services (AWS)
- Experience supporting web applications running on Java / Apache / Tomcat in a live production environment
- Prior experience with IaC tools like Terraform
- Prior experience with DevOps tools (Git, Gitlab)
- Production-At-Scale support background in a heavily microservice-based world
- Hands-on engineering and ops expertise in containerization (Docker, Kubernetes/EKS, CNI, and Ingress networking)
- Strong understanding of Single-Sign-On, SAML, and OAuth (Bonus if the hands-on experience with Okta)
- Seasoned expertise around x.509 certificate technology and basic concepts of encryption
- Experience working with Relational Databases such as MongoDB, Postgresql, Sql
- Advanced exposure to application development, web UI (design and development), JSON, application architecture
- Experience strongly utilising observability tools (logging/APM) like Datadog, Cloud Watch, and PagerDuty.