
Happy almost 6 month anniversary to my first course on PySpark Essentials that launched on LinkedIn Learning last year! If you’ve always wanted to try your hand at PySpark, this is a great course for beginner to learn data engineering basics with PySpark hands-on.
Here’s the course description:
PySpark is a powerful library that brings Apache Spark’s distributed computing capabilities to Python, making it a key tool for processing large-scale data efficiently. In this course, data engineer and analyst Sam Bail provides a structured and hands-on introduction to PySpark, starting with an overview of Apache Spark, its architecture, and its ecosystem. Learn about Spark’s core concepts, such as the DataFrame API, transformations, lazy evaluations, and actions, before setting up a lab environment and working with a real dataset. Plus, gain insights into how PySpark fits into a broader data engineering ecosystem and best practices on running PySpark in a production environment.
Learning objectives:
- Build your understanding of the core concepts of Spark and PySpark.
- Understand how to install PySpark, load, manipulate, and analyze large datasets in a notebook environment.
- Gain an understanding of how PySpark fits into a wider data engineering ecosystem.
- Understand best practices about executing PySpark in a production environment.
The entire course is around 1 hour 18 minutes long and includes hands-on exercises in a Google Collab notebook. You can watch it here with a LinkedIn Learning subscription!
You must be logged in to post a comment.