Triangle

Job Details

Job Title:

Tier 3 Engineer – PCAI & AI Factory Expert with Exp 8+Yrs @Bangalore

Job Description:

We are seeking a Subject Matter Expert (SME) – Admin, Operate & Manage (HPE PCAI & AI Factory Solutions) to manage and optimize HPE’s next-generation AI infrastructure platforms. The ideal candidate will have deep hands-on expertise in AI, HPC, and GPU-accelerated environments, with strong knowledge of HPE Cray, NVIDIA AI Enterprise, Containerized workloads, and Automation frameworks. This role focuses on the operational stability, lifecycle management, and continuous improvement of large-scale Private Cloud for AI (PCAI) and AI Factory deployments. Qualifications & Experience: • Bachelor’s / Master’s degree in computer science, IT, or equivalent field. • 8+ years of IT infrastructure administration experience, including 3+ years in AI/HPC or GPU-based environments. • Proven experience in platform operations, monitoring, and lifecycle management of enterprise-grade AI and HPC environments. • Hands-on experience in automation and orchestration across bare metal and containerized infrastructure. Key Responsibilities: 1. Platform Administration • Administer and maintain HPE PCAI and AI Factory environments, ensuring optimal uptime and performance. • Manage compute nodes (HPE DL380a, DL325, Cray XD670), GPU clusters (NVIDIA L40S/H100/H200), and InfiniBand NDR networks. • Administer virtualization and container platforms such as vSphere, RHEL/RHOS, Kubernetes. • Perform configuration, patching, version upgrades, and firmware updates across hardware and software layers. • Manage cluster lifecycle through HPE Performance Cluster Manager (HPCM), and SLURM. 2. AI Platform & Software Operations • Support NVIDIA AI Enterprise (NVAIE) components including NeMO Curator, NeMO Customizer, NeMo Evaluator, NVIDIA NIM, NeMo Guardrails, NeMO Retriever 3. Operational Monitoring & Incident Management • Handle alerts, performance anomalies, and incidents across GPU, network, and storage layers. • Lead root cause analysis (RCA) and corrective action plans to prevent recurring issues. • Maintain operational documentation, runbooks, and incident logs. 4. Continuous Improvement & Knowledge Enablement • Optimize automation workflows to reduce manual intervention and improve service response time. • Drive service health reviews, operational dashboards, and SLA compliance reporting. • Conduct enablement sessions for L1/L2 teams and act as the final escalation point for operational issues. • Collaborate with HPE Engineering for patch validation, release readiness, and operational feedback. Required Skills & Technical Expertise: Core Infrastructure Skills • Strong knowledge of NVIDIA GPU stack, InfiniBand NDR, and Spectrum-X switches. • Administration of HPE DL380a, DL325, Cray XD670, and GPU-based Compute environments. • Experience in managing VAST, WEKA, or Alletra MP storage systems is an added advantage. Software & Platform Operations • Virtualization: vSphere, RHEL • Containers: Kubernetes • Automation: Ansible, AWX, SLURM Preferred Certifications: • HPE ASE / Master ASE (Compute, Storage, or Ezmeral) • NVIDIA Certified Professional - NVAIE Certification • RHCE / Kubernetes Administrator (CKA) / VMware VCP Soft Skills: • Strong analytical and troubleshooting capabilities. • Excellent communication and collaboration skills across global teams. • Ability to lead operations improvement initiatives and mentor support engineers. • Focused on reliability, scalability, and service excellence. For Internal Job Movement: • Approval of the employee's current manager is required. • Employees are expected to notify their manager prior to an interview. • Employees in Performance Improvement Plan are not eligible to apply. • Minimum level should be EXP if applying as part of Internal Job Posting.

Job Code:

Job Location:

Bangalore

Experience:

8+ Years

Skill:

AI, HPC, and GPU-accelerated environments, with strong knowledge of HPE Cray, NVIDIA AI Enterprise