Design and build scalable AI platform capabilities for training, fine-tuning, inference, evaluation, and experimentation.
Develop and operate shared platform services for: model serving,vector databases,feature/data access,prompt and agent workflows, GPU workload orchestration,secrets and configuration management.
Build reusable MLOps/LLMOps pipelines for model packaging, deployment, rollback, versioning, and lifecycle management.
Enable secure deployment and operation of: open-source models, commercial model APIs, retrieval-augmented generation systems, agent-based workloads.
Create internal self-service tooling, templates for AI application teams.
Implement platform controls for: authentication and authorization, rate limiting and quota management, audit logging, data protection, policy enforcement, guardrails.
Build observability for AI workloads, including: latency,throughput,token usage,GPU utilization,model/system health,drift and quality indicators.
Improve reliability and efficiency of AI infrastructure through automation, SRE practices, and performance tuning.
Partner with data scientists, software engineers, architects, security teams, and business stakeholders to translate AI use cases into robust platform capabilities.
Define standards and best practices for AI platform architecture, CI/CD, monitoring, governance, and operations.
Support evaluation and integration of emerging AI infrastructure technologies, frameworks, and tools.
Namizədə tələblər
Strong experience with Kubernetes and containerized workloads in production
5+ years experience in production infrastructure as a Platform, SRE, DevOps, or MLOps engineer
Strong scripting and automation in Go and/or Python — enough to write a Kubernetes controller, or a non-trivial operational tool
Experience with CI/CD pipelines, Infrastructure-as-code and GitOps: Gitlab CI, Terraform, Ansible, Helm, ArgoCD or Flux
Solid understanding of Linux, networking, storage, and security
Experience with monitoring and observability tools such as Prometheus, Grafana, OpenTelemetry
Understanding of the ML lifecycle: training, deployment, inference, evaluation, and monitoring
Experience building or operating shared platforms used by multiple teams
Ability to work closely with data scientists, ML engineers, software engineers, and security teams
Depth in at least one of: GPU infrastructure and inference ops (vLLM, NVIDIA GPU operator, MIG, quantization, inference performance tuning); platform SRE at scale (multi-tenant Kubernetes, 99.9%+ SLOs, capacity planning); MLOps (model registry, deployment pipelines, canary and shadow rollouts, evaluation, Langfuse or equivalent LLM observability); or API gateway operations (Kong, Envoy, or Istio at production scale, plugin development, request-path performance tuning)