Databricks + PySpark
Detailed Job Description for Databricks + PySpark Developer:
· Data Pipeline Development: Design, implement, and maintain scalable and efficient data pipelines using PySpark and Databricks for ETL processing of large volumes of data.
· Cloud Integration: Develop solutions leveraging Databricks on cloud platforms (AWS/Azure/GCP) to process and analyze data in a distributed computing environment.
· Data Modeling: Build robust data models, ensuring high-quality data integration and consistency across multiple data sources.
· Optimization: Optimize PySpark jobs for performance, ensuring the efficient use of resources and cost-effective execution.
· Collaborative Development: Work closely with data scientists, analysts, and other stakeholders to understand data requirements and deliver actionable insights.
· Automation & Monitoring: Implement monitoring solutions for data pipeline health, performance, and failure detection.
· Documentation & Best Practices: Maintain comprehensive documentation of architecture, design, and code. Ensure adherence to best practices for data engineering, version control, and CI/CD processes.
· Mentorship: Provide guidance to junior data engineers and help with the design and implementation of new features and components.
Required Skills & Qualifications:
· Experience: 6+ years of experience in data engineering or software engineering roles, with a strong focus on PySpark and Databricks.
Technical Skills:
· Proficient in PySpark for distributed data processing and ETL pipelines.
· Experience working with Databricks for running Apache Spark workloads in a cloud environment.
· Solid knowledge of SQL, data wrangling, and data manipulation.
· Experience with cloud platforms (AWS, Azure, or GCP) and their respective data storage services (S3, ADLS, BigQuery, etc.).
· Familiarity with data lakes, data warehouses, and NoSQL databases (e.g., MongoDB, Cassandra, HBase).
· Experience with orchestration tools like Apache Airflow, Azure Data Factory, or DBT.
· Familiarity with containerization (Docker, Kubernetes) and DevOps practices.
· Problem Solving: Strong ability to troubleshoot and debug issues related to distributed computing, performance bottlenecks, and data quality.
· Version Control: Proficient in Git based workflows and version control.
· Communication Skills: Excellent written and verbal communication skills, with the ability to explain complex technical concepts to both technical and non-technical stakeholders.
· Education: Bachelor or Master’s degree in Computer Science, Engineering, or a related field (or equivalent practical experience).