Site Reliability Engineer - Data Platforms
Lex
Software Engineering
Hyderabad, Telangana, India
Posted on May 6, 2026
Summary
Join the AI and Data Platforms team at Apple, where we build and manage cloud-based data platforms handling petabytes of data at scale. We are looking for a passionate and independent Software Engineer specializing in reliability engineering for data platforms, with a strong understanding of data and ML systems. If you thrive in a fast-paced environment, love crafting solutions that don't yet exist, and possess excellent communication skills to collaborate across diverse teams, we invite you to contribute to Apple’s high standards in an exciting and dynamic setting.
Description
As part of our team, you will be responsible for developing and operating our big data platform using open source or other solutions to aid critical applications, such as analytics, reporting, and AI/ML apps. This includes working to optimize performance and cost, automate operations, and identifying and resolving production errors and issues to ensure the best data platform experience.
Responsibilities
Join the AI and Data Platforms team at Apple, where we build and manage cloud-based data platforms handling petabytes of data at scale. We are looking for a passionate and independent Software Engineer specializing in reliability engineering for data platforms, with a strong understanding of data and ML systems. If you thrive in a fast-paced environment, love crafting solutions that don't yet exist, and possess excellent communication skills to collaborate across diverse teams, we invite you to contribute to Apple’s high standards in an exciting and dynamic setting.
Description
As part of our team, you will be responsible for developing and operating our big data platform using open source or other solutions to aid critical applications, such as analytics, reporting, and AI/ML apps. This includes working to optimize performance and cost, automate operations, and identifying and resolving production errors and issues to ensure the best data platform experience.
Responsibilities
- Develop and operate large-scale big data platforms using open source and other solutions.
- Support critical applications including analytics, reporting, and AI/ML apps.
- Optimize platform performance and cost efficiency.
- Automate operational tasks for big data systems.
- Identify and resolve production errors and issues to ensure platform reliability and user experience
- 3+ years of experience in software engineering, or site reliability engineering roles supporting large-scale data platforms.
- Strong programming skills in at least one of Java, Scala, Python with a focus on building tooling, automation frameworks, or reliability platforms
- Hands-on experience operating and troubleshooting distributed data processing systems (e.g. Apache Spark) in production environments
- Experience in incident management, including troubleshooting, root cause analysis in complex production environments.
- Experience working with containerized environments and orchestration platforms such as Kubernetes in production.
- Understanding of reliability fundamentals: observability (metrics, logs, tracing), alerting, SLAs/SLOs, and system health monitoring.
- Working knowledge of big data ecosystems, including familiarity with tools such as Apache Hive, Hive Metastore (HMS), IRC and how they interact with data lake architectures.
- Experience with table formats and data lake technologies such as Apache Iceberg, ensuring scalability, reliability, and optimized query performance.
- Contributions to open-source projects or active participation in the data engineering ecosystem
- Familiarity with cloud platforms (AWS, GCP, Azure) and distributed storage systems.
- Experience with workflow orchestration tools (e.g., Airflow, DBT) and data pipeline reliability patterns.
- Understanding of data modelling, partitioning strategies, and data warehousing concepts
- Exposure to ML/AI infrastructure (e.g. MLflow, GPUs, LLM workloads).
- Strong grasp of software engineering best practices, including CI/CD, testing, and secure coding.
- Growth mindset and ability to improve system reliability, team practices, and operational maturity.