Job Description:
We are seeking a Subject Matter Expert (SME) – Admin, Operate & Manage (HPE PCAI & AI Factory Solutions) to manage and optimize HPE’s next-generation AI infrastructure platforms. The ideal candidate will have deep hands-on expertise in AI, HPC, and GPU-accelerated environments, with strong knowledge of HPE Cray, NVIDIA AI Enterprise, Containerized workloads, and Automation frameworks. This role focuses on the operational stability, lifecycle management, and continuous improvement of large-scale Private Cloud for AI (PCAI) and AI Factory deployments.
Qualifications & Experience:
•
Bachelor’s / Master’s degree in computer science, IT, or equivalent field.
•
8+ years of IT infrastructure administration experience, including 3+ years in AI/HPC or GPU-based environments.
•
Proven experience in platform operations, monitoring, and lifecycle management of enterprise-grade AI and HPC environments.
•
Hands-on experience in automation and orchestration across bare metal and containerized infrastructure.
Key Responsibilities:
1. Platform Administration
•
Administer and maintain HPE PCAI and AI Factory environments, ensuring optimal uptime and performance.
•
Manage compute nodes (HPE DL380a, DL325, Cray XD670), GPU clusters (NVIDIA L40S/H100/H200), and InfiniBand NDR networks.
•
Administer virtualization and container platforms such as vSphere, RHEL/RHOS, Kubernetes.
•
Perform configuration, patching, version upgrades, and firmware updates across hardware and software layers.
•
Manage cluster lifecycle through HPE Performance Cluster Manager (HPCM), and SLURM.
2. AI Platform & Software Operations
•
Support NVIDIA AI Enterprise (NVAIE) components including NeMO Curator, NeMO Customizer, NeMo Evaluator, NVIDIA NIM, NeMo Guardrails, NeMO Retriever
3. Operational Monitoring & Incident Management
•
Handle alerts, performance anomalies, and incidents across GPU, network, and storage layers.
•
Lead root cause analysis (RCA) and corrective action plans to prevent recurring issues.
•
Maintain operational documentation, runbooks, and incident logs.
4. Continuous Improvement & Knowledge Enablement
•
Optimize automation workflows to reduce manual intervention and improve service response time.
•
Drive service health reviews, operational dashboards, and SLA compliance reporting.
•
Conduct enablement sessions for L1/L2 teams and act as the final escalation point for operational issues.
•
Collaborate with HPE Engineering for patch validation, release readiness, and operational feedback.
Required Skills & Technical Expertise:
Core Infrastructure Skills
•
Strong knowledge of NVIDIA GPU stack, InfiniBand NDR, and Spectrum-X switches.
•
Administration of HPE DL380a, DL325, Cray XD670, and GPU-based Compute environments.
•
Experience in managing VAST, WEKA, or Alletra MP storage systems is an added advantage.
Software & Platform Operations
•
Virtualization: vSphere, RHEL
•
Containers: Kubernetes
•
Automation: Ansible, AWX, SLURM
Preferred Certifications:
•
HPE ASE / Master ASE (Compute, Storage, or Ezmeral)
•
NVIDIA Certified Professional - NVAIE Certification
•
RHCE / Kubernetes Administrator (CKA) / VMware VCP
Soft Skills:
•
Strong analytical and troubleshooting capabilities.
•
Excellent communication and collaboration skills across global teams.
•
Ability to lead operations improvement initiatives and mentor support engineers.
•
Focused on reliability, scalability, and service excellence.
For Internal Job Movement:
•
Approval of the employee's current manager is required.
•
Employees are expected to notify their manager prior to an interview.
•
Employees in Performance Improvement Plan are not eligible to apply.
•
Minimum level should be EXP if applying as part of Internal Job Posting.