Location Limerick

Country Ireland

Category Development & Product Management

Type Full Time

Experience Senior

Language English

Sustainability that means business

Who we are:

Sustainability software specialist, AMCS, is headquartered in Ireland, with offices in Europe, the USA, and Australasia. With over 1,300 highly-skilled employees across 22 countries, we specialise in delivering technology solutions to facilitate a carbon neutral future.

What we do:

Our innovative SaaS solutions increase efficiency and boost sustainability in resource-intensive industries. Over 5,000 customers across 23 countries already benefit from our Performance Sustainability software, ensuring we deliver practical solutions for improved profitability and environmental resilience across the globe.

Our people

AMCS offers team members more than just a job, but an opportunity to map out a career with a company that is growing, evolving and setting out new ways of working that are having a positive impact on the world around us. AMCS was established in Ireland and holds onto those local roots and 'start-up' mentality with a culture of connection. Connection to our work, our customers, our colleagues and our community that creates a working environment that fosters openness, collaboration and creativity.

Job Description:

We are seeking a highly skilled and motivated DevOps/SRE Tech Lead to join our dynamic engineering team. The ideal candidate will have a deep understanding of cloud technologies, a strong technical background and a passion for driving operational excellence. As a Tech Lead, you will not only mentor and guide our DevOps engineers but also participate in architectural and key decision-making forums regarding our infrastructure and application development processes ensuring a focus is always on the reliability of our systems and centred on positive customer experience. You will collaborate with cross-functional teams to ensure the reliability, scalability, and security of our systems and infrastructure.

Key Responsibilities:

Build SLIs, SLOs, and SLAs: Partner with development and business teams to define indicators and objectives that reflect real customer experience
Incident Response: Lead through complex incidents and continuously improve how quickly we detect, diagnose, and resolve issues — sharpening alerting, tooling, and on-call practices to shorten MTTD and MTTR over time.
Evolve Monitoring and Observability Stack: Consistently improve the observability stack (Prometheus, Grafana, Mimir, Loki, Tempo, OpenTelemetry) with a customer-centric lens leading our operations to be more effective
Drive RCAs and Postmortems: Run blameless root cause analyses and postmortems that turn incidents into durable improvements, closing the developer and operations loop
High Availability & Performance: Ensure platform availability and responsiveness meet customer expectations. Identify and remove performance bottlenecks before they impact customer
AI for Operations: Apply AI/LLM capabilities to incident triage, log/trace analysis, runbook execution, and anomaly detection to shorten MTTR and reduce on-call load.
Optimization for Cost: Right-size workloads, eliminate waste, and design for cost-efficient scaling across our cloud platforms (Azure, AWS, GCP) and container infrastructure (Docker, Kubernetes).

Toil Reduction: Build automated processes to reduce toil within SRE, such as remediation for known failure modes so the platform heals itself where possible, escalating to humans only when judgement is genuinely required.
Architectural Oversight: Participate in architectural design and decision-making processes, ensuring that design choices align with organisational goals and best practices.

What Success Looks Like:

High-Signal Alerting: Alerts are accurate and actionable — when something fires, it matters, and the team trusts it. Noise is actively driven down rather than tolerated.
Fewer Production Incidents: The number and severity of customer-impacting incidents trend down over time, as recurring failure modes are addressed at the root rather than worked around.
Tight Product–SRE Feedback Loop: Continuous, two-way feedback between product engineering and SRE — reliability concerns shape what gets built, and operational learnings flow back into product decisions.
Reduced Toil: Engineers spend less time on repetitive operational work and more time on improvements that compound — measured by what gets automated, eliminated, or self-healed away.

Qualifications:

Education: Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience).
Experience:5+ years of experience in DevOps, Site Reliability Engineering (SRE), or related fields, with at least 2 years in a leadership or mentoring role.
Cloud Technologies: Deep understanding of cloud providers (Azure, AWS, GCP) and hands-on experience with cloud architecture.
Architectural Design: Proven experience in providing architectural oversight, with a strong ability to make informed decisions that drive system performance and scalability.
Containerization: Proven experience with container orchestration platforms, particularly Kubernetes.
Scripting: Proficiency in scripting languages such as PowerShell, Python or Bash.
Monitoring and Logging: Familiarity with monitoring and logging tools like Prometheus, Grafana, and the Grafana stack.
Automation Tools: Experience with automation tools such as Ansible, Terraform, or Chef.
Soft Skills: Strong leadership qualities, excellent communication skills, and a collaborative mindset.

Preferred Qualifications:

Experience with CI/CD pipelines and relevant tools (Azure DevOps, Jenkins, GitLab CI, CircleCI, etc.).
Kubernetes certification (CKA, CKAD) and/or cloud certifications (Azure, AWS, GCP) are highly desirable.
Knowledge of security best practices and compliance standards in cloud environments.
Familiarity with Agile methodologies and project management tools.

#LI-JA1