DevOps Engineer - AIOps

Buenos Aires
Permanente
Tiempo completo

Hace 1 mes

Company DescriptionTechnology is our how. And people are our why. For over two decades, we have been harnessing technology to drive meaningful change.By combining world-class engineering, industry expertise and a people-centric mindset, we consult and partner with leading brands from various industries to create dynamic platforms and intelligent digital experiences that drive innovation and transform businesses.From prototype to real-world impact - be part of a global shift by doing work that matters.Job DescriptionWe are seeking a hands-on Site Reliability Engineer (SRE) / AI Platform DevOps Engineer to own infrastructure provisioning, CI/CD automation, telemetry pipelines, and production deployment for AI-powered services, agents, and orchestration systems.This is an SRE-heavy, infrastructure-first role, focused on ensuring AI systems operating in production are:ReliableObservableScalableSecureCost-efficientSafe to deploy and operateYou will play a critical role in building and maintaining the platform foundation that enables AI services to run safely and efficiently at scale.Key Responsibilities1. Infrastructure Provisioning & AutomationDesign and manage cloud infrastructure using Infrastructure as Code (Terraform or similar)Provision and maintain Kubernetes clusters and supporting servicesAutomate environment setup across development, staging, and productionManage networking, IAM, secrets, storage, and compute scalingEnsure high availability, resilience, and disaster recovery readiness2. CI/CD & Deployment EngineeringBuild and maintain CI/CD pipelines for:AI servicesAgent frameworksOrchestratorsModel artifactsImplement automated testing and reliability validation gatesEnable blue/green and canary deploymentsBuild safe rollback mechanisms for services and modelsIntegrate reliability and health checks into deployment workflows3. Model & Agent Deployment GovernancePackage, version, and deploy models into containerized environmentsManage model artifact storage and promotion across environmentsMonitor model performance and detect degradationSupport retraining cycle integration and model refresh workflowsEnsure safe rollout and rollback of model versionsImplement monitoring for inference latency, throughput, and cost4. Data Pipelines for Telemetry & ObservabilityDesign and maintain data pipelines to ingest, clean, and process high-volume telemetry (logs, metrics, traces, events)Enable structured telemetry for AI and orchestration workflowsEnsure reliability for real-time and batch processingOptimize pipeline scalability and performance5. AIOps Platform IntegrationEvaluate, deploy, and integrate AIOps platformsImprove anomaly detection, correlation, and alert intelligenceReduce alert noise and improve signal qualityIntegrate AIOps outputs into operational workflows and incident management6. Intelligent Incident AutomationAutomate incident detection and remediation workflowsBuild self-healing scripts and intelligent runbooksReduce MTTD and MTTR through automationIntegrate AI-driven root cause analysis insights into operational toolingImprove prevention of recurring incidents7. Production Reliability & SRE ExcellenceDefine and manage SLIs, SLOs, and error budgetsImplement monitoring, dashboards, and alerting systemsParticipate in on-call rotationLead incident triage and root cause analysisImprove resilience, scaling, and failure handlingImplement circuit breakers, rate limits, and failover mechanisms8. Security & GovernanceImplement least-privilege access controlsManage secrets and credential rotationEnforce environment isolationEnsure auditability and compliance for AI systemsQualificationsRequired Experience5+ years of experience in Site Reliability Engineering, DevOps, or Platform Engineering rolesStrong hands-on experience with cloud platforms (AWS, Azure, or GCP)Proven expertise with Kubernetes and containerized workloadsExperience with Infrastructure as Code (Terraform, CloudFormation, etc.)Strong CI/CD implementation experience (GitHub Actions, GitLab CI, Jenkins, etc.)Experience building observability stacks (Prometheus, Grafana, OpenTelemetry, ELK, Datadog, etc.)Experience defining and managing SLIs/SLOs and error budgetsHands-on experience with incident response and production supportStrong scripting skills (Python, Bash, or similar)AI/ML Platform Experience (Strongly Preferred)Experience deploying and managing AI/ML services in productionFamiliarity with model packaging, versioning, and artifact managementUnderstanding of model lifecycle management and retraining workflowsExperience monitoring inference performance, latency, and costExposure to AIOps tools and intelligent alerting systemsAdditional SkillsStrong understanding of distributed systems reliability patternsKnowledge of security best practices in cloud-native environmentsExperience implementing high-availability and disaster recovery strategiesExcellent problem-solving and root cause analysis skillsStrong communication skills and ability to collaborate across engineering and AI teamsAdditional InformationDiscover some of the global benefits that empower our people to become the best version of themselves:

Finance: Competitive salary package, share plan, company performance bonuses, value-based recognition awards, referral bonus;
Career Development: Career coaching, global career opportunities, non-linear career paths, internal development programmes for management and technical leadership;
Learning Opportunities: Complex projects, rotations, internal tech communities, training, certifications, coaching, online learning platforms subscriptions, pass-it-on sessions, workshops, conferences;
Work-Life Balance: Hybrid work and flexible working hours, employee assistance programme;
Health: Global internal wellbeing programme, access to wellbeing apps;
Community: Global internal tech communities, hobby clubs and interest groups, inclusion and diversity programmes, events and celebrations.

At Endava, we're committed to creating an open, inclusive, and respectful environment where everyone feels safe, valued, and empowered to be their best. We welcome applications from people of all backgrounds, experiences, and perspectives-because we know that inclusive teams help us deliver smarter, more innovative solutions for our customers. Hiring decisions are based on merit, skills, qualifications, and potential. If you need adjustments or support during the recruitment process, please let us know.

Endava

Postularse