Senior Site Reliability Engineer

EPAM Systems

  • Argentina
  • Permanente
  • Tiempo completo
  • Hace 4 horas
We are seeking an experienced Senior Site Reliability Engineer to join our team.This role will cover the LatAm timezone, working collaboratively with a team of SREs and a hands-on Lead SRE, while also coordinating with a European-based SRE team. The position ensures follow-the-sun 24/7 on-call support for a customer platform that includes multiple Java backend services.ResponsibilitiesProvide 12/7 on-call support for Java backend services, ensuring platform reliability and availabilityOwn API Gateway observability to monitor and maintain service healthPrepare and deploy patches to resolve issues in Java code and related service cloud infrastructureDevelop and maintain metrics and dashboards to assess and improve platform healthCreate and enhance runbooks for all EOS backend services to streamline operational processesMonitor Service Level Objectives (SLOs) for backend services, addressing errors and submitting code changes to improve themTroubleshoot complex system issues using logs and telemetry to identify and resolve root causes efficientlyCollaborate with cross-functional teams to ensure operational excellence and incident response readinessRequirementsBachelor's or Master's degree in Computer Science or a related fieldAt least 3 years of experience in Java backend development and DevOps/SRE rolesProficiency with Amazon DynamoDB, Amazon ElastiCache, and other AWS servicesExperience with Git and Gradle for version control and build automationStrong understanding of observability and troubleshooting in distributed systemsAbility to quickly process and apply large amounts of information during on-call effortsSkilled in using logs and telemetry to diagnose and resolve complex systems issuesEffective written communication skills to document operational issues during live incident responsesMotivation to track and improve SLOs across multiple systems through repeatable processesFluent English communication skills, both written and spoken, at a B2 level or higherNice to haveFamiliarity with Apache Cassandra for distributed database managementExperience with Apache Kafka for real-time data streamingKnowledge of Grafana for building and maintaining observability dashboardsProficiency in Java and Scala for backend developmentHands-on experience with Kubernetes for container orchestrationUnderstanding of New Relic for application performance monitoringKnowledge of Terraform for infrastructure as codeWe offer/Benefits
  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Healthcare benefits
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn

EPAM Systems

Empleos similares

  • Site Reliability Engineer - (CUF544)

    Careers at SunDevs

    • Buenos Aires
    **Descripción del puesto**: Como Site Reliability Engineer en SunDevs, colaborarás con otros ingenieros de software senior y Platform Engineers para diseñar y desarrollar sistemas…
    • Hace 12 horas
  • RM-081] Senior Network Engineer

    Netser Group

    • Buenos Aires
    En Netser Group estamos en la búsqueda de un Networking Engineer Sr. con conocimientos en redes. Tendrá a su cargo la configuración y el soporte de las redes WAN/LAN y proyectos …
    • Hace 13 horas
  • AL528 - Site Reliability Engineer Sr Sre

    Ripio

    • Buenos Aires
    ¡Hola, futurx ripionauta! Si hay una palabra que nos define es **ACCESO**: nuestra misión es ser la puerta de acceso al mundo cripto. Cultivamos el trabajo en equipo, la tolera…
    • Hace 12 horas