home..

Spark basics

spark

Spark

Spark Architecture

Spark follows a well-designed layered architecture that enables efficient and scalable distributed data processing. It consists of three main layers: the driver, the cluster manager, and the worker nodes.

  1. Driver Layer: The driver is responsible for coordinating and managing the Spark application. It interacts with the user and submits tasks to the cluster. The driver also maintains the overall execution plan and orchestrates the flow of data and computation.
  2. Cluster Manager: The cluster manager is responsible for acquiring and managing resources in the cluster. It allocates resources to the Spark application, such as CPU and memory, and monitors their usage. Popular cluster managers for Spark include Apache Mesos, Hadoop YARN, and standalone mode.
  3. Worker Layer: The worker layer consists of multiple worker nodes that execute tasks assigned by the driver. Each worker node runs tasks in parallel and manages the local resources allocated to it. The worker nodes communicate with the driver and perform data processing operations.

Spark Architecture Diagram

         +-----------------+
         |                 |
         |     Driver      |
         |                 |
         +-----------------+
                   |
                   |
            +--------------+
            |              |
            | Cluster      |
            | Manager      |
            |              |
            +--------------+
                   |
                   |
          +-----------------+
          |                 |
          |   Worker        |
          |   Nodes         |
          |                 |
          +-----------------+

The driver, cluster manager, and worker nodes are designed to be loosely coupled, allowing Spark to scale horizontally and handle large-scale data processing. This architecture, combined with Spark’s in-memory processing and parallel computing capabilities, contributes to its high performance and speed compared to traditional systems like Hadoop.

Spark’s layered architecture and distributed computing model make it a powerful tool for processing and analyzing large amounts of data efficiently.

Life cycle of Spark application

Spark attributes

Terminologies:

Spark libraries

Languages

Driver node architecture

Untitled

Worker node architecture

Untitled

On-heap memory

******Off heap memory******

RDD vs Dataframe vs Dataset

Similarities in RDD vs dataframe vs dataset

Untitled

Spark - Transformation and action

Transformation:

Action:

****Types of transformation:****

Untitled

© 2025 Jithendra Yenugula