Cellulant is Africa’s no.1 company in the payments & transfers category - FIntech Awards 2016. We are a PPISP (Payment Platform Infrastructure Service Provider) regulated by the Central Bank of Nigeria (CBN) and insured by Nigerian Deposit Insurance Corporation (NDIC).
We are recruiting to fill the position below:
Job Title: Senior Site Reliability Engineer ( SRE) - Observability
Location: Lagos
Job type: Full time
Department: Technology
Job Description
As a member of the Observability team, you will be responsible to maintain and develop the observability tooling.
You will use industry best practices to help with the design and development of observability processes and tooling using your software development/ systems administration knowledge.
You will be expected to champion automation efforts within the team from deployment of your code to identifying opportunities for end to end automation in event and incident management.
Core Responsibilities
Your role is to Build, scale and manage our observability stack across our multi-tenant infrastructure including managing our observability tooling clusters, logging pipelines and telemetry system data.
You will Actively engage and help our developers to improve the monitoring of their services
Actively drive initiatives towards better system design and implementation of new technologies.
You will work to develop additional capabilities on our observability platforms by incorporating additional data types like clickstream data and frontend user interactions.
You will drive key initiatives in modern observability concepts like, SLIs, SLOs, error budgets, distributed tracing, canonical logging, etc.
You will collaborate with architects, leads and managers to foster a data driven culture based on observability and reliability
You will be responsible for developing machine learning capabilities into the observability systems to enhance signal and reduce noise.
You will participate in observability on-call rotation to support any issues affecting the observability systems and to support other technology teams in investigations during major incidents
Key Relationship
Customer Success
Software Engineering
Platform Engineering
Service Operations
Qualification and Experience
Bachelor's Degree in an appropriate field of study, including computer science, engineering, information technology, Statistics, or related study with 5+ years of experience.
Familiar with programming language concepts (Go, Java, Ruby, Python, Javascript)
Experience with cloud infrastructure and services, especially AWS.
Experience defining, measuring, and improving Reliability Metrics (SLO/SLI), Observability (Monitoring, Logging-Tracing solutions), Operations Processes (Incident, Problem Management), and Operations Toil Reduction through Automation;
Experience in programming/ Scripting. Knowledge in python and MySQL, bash, Terraform, Ansible, gitlab and other scripting/ automation tools.
Experience with distributed systems in a production operations environment.
Multi-tasking and effective oral and communication skills
Good understanding of AWS services (Glue and Athena, Amazon Cloudwatch, QuickSight), Kubernetes (EKS), ElasticSearch/ OpenSearch , Newrelic or similar observability tools, Zabbix, Grafana, PagerDuty.
Proven use of AI (ML/ DL) for data management/ Analysis
Solid experience is Software Development and/or Systems Administration
Skills:
Programming and Scripting Skills - PHP, Python, Bash, Perl, Java.