Senior Python Pyspark Developer
We are seeking a skilled and proactive Python / PySpark Developer to join our data engineering or analytics team. The ideal candidate will be responsible for building scalable data pipelines, performing large-scale data processing, and collaborating with data scientists, analysts, and business stakeholders.
Key Responsibilities:
- Design, develop, and optimize ETL data pipelines using PySpark on big data platforms (e.g., Hadoop, Databricks, EMR).
- Write clean, efficient, and modular code in Python for data processing and integration tasks.
- Work with large datasets to extract insights, transform raw data, and ensure data quality.
- Collaborate with cross-functional teams to understand business requirements and translate them into technical solutions.
- Implement performance tuning and debugging of PySpark jobs.
- Monitor and troubleshoot data workflows and batch jobs in production environments.
- Document solutions and maintain code repositories (e.g., Git).
Required Skills & Qualifications:
- Proficient in Python with experience in building data-centric applications.
- Strong experience with PySpark and understanding of Spark internals (RDDs, DataFrames, Spark SQL).
- Hands-on experience with Hadoop ecosystem, Hive, or cloud-based big data platforms like AWS EMR, Azure Databricks, or GCP DataProc.
- Familiarity with workflow orchestration tools like Airflow, Oozie, or Luigi.
- Good understanding of SQL and relational databases.
- Experience with version control systems like Git.
- Strong problem-solving skills and ability to work independently or in a team.
- Bachelors degree in Computer Science, Engineering, or a related field.
Preferred Qualifications:
- Experience with CI/CD pipelines and DevOps practices.
- Knowledge of data warehousing and data modeling.
- Exposure to streaming technologies Kafka, Spark Streaming).
- Familiarity with containerization tools like Docker or Kubernetes.