100 Most Repeated Apache Spark Interview Questions & Answers for 2021

Top 100 Common Apache Spark Interview Questions & Answer
100 Most Repeated Apache Spark Interview Questions & Answers

You should must prepare 100 Apache Spark Interview Questions & Answer in 2021

Do you want your Apache Spark skills to get a job, do you? How ambitious!-How ambitious! Are you ready, are you ready? First, you’ll need to get the job. And it’s an interview. And questions. And questions. Most of it. All of it. But don’t be afraid, we will help you. We give you 100 top questions and answers for Spark ‘s interviews.

The year in which the broad data, research and emerging technology, the decision-making processes based on data and the analysis of results have made considerable progress. Big Data and Company (BDA) worldwide sales will increase from 130,1 billion dollars in 2016 to over 203 billion dollars in 2020 (source IDC). Taking the top questions for Apache Spark interviews, prepare them to take stock of the evolving large data market, in which big or small global and local companies pursue high-quality big data and hadoop experts.

Will you like to make a difference in your career? See the Top Technology Trends Report.

As a Big Data professional, understanding all the terminology and technologies associated with this field is important to you, including Apache Spark, one of the most common technology in the Big Data sector. Go through these questions for the Apache Spark interview to prepare for work interviews to get you started in your big data career:

What you expect from a Spark interview?

The Apache Spark questions are primarily technical in interviews with the goal of knowing your information functions and processes. As a Spark practitioner, much of the interview is possibly about questions and responses to the Spark interview, but you should also be able to answer general interview questions more broadly.

Spark questions can involve demonstrable system awareness, so consider reviewing Apache Spark ‘s programming and providing examples of functions you have successfully completed. Consider if it makes sense to tell from experience when answering questions. Some of these problems may only have to be described simply and succinctly, while others will have to say more about experiences.

Most commonly Apache Spark Interview questions that are to be asked during the interview are as follows:

Question 1: What is Shark?

Most of the data users know only SQL and are not good at programming. Shark is a tool, developed for people who are from a database background – to access Scala MLib capabilities through Hive like SQL interface. Shark tool helps data users run Hive on Spark – offering compatibility with Hive metastore, queries and data.

Question 2: List some use cases where Spark outperforms Hadoop in processing.

  • Sensor Data Processing –Apache Spark’s ‘In-memory computing’ works best here, as data is retrieved and combined from different sources.
  • Spark is preferred over Hadoop for real time querying of data
  • Stream Processing – For processing logs and detecting frauds in live streams for alerts, Apache Spark is the best solution.

Question 3: Compare Spark vs Hadoop MapReduce

CriteriaHadoop MapReduceApache Spark
Memory Does not leverage the memory of the hadoop cluster to maximum.Let’s save data on memory with the use of RDD’s.
Disk usageMapReduce is disk oriented.Spark caches data in-memory and ensures low latency.
ProcessingOnly batch processing is supportedSupports real-time processing through spark streaming.
InstallationIs bound to hadoop.Is not bound to Hadoop.

Simplicity, Flexibility and Performance are the major advantages of using Spark over Hadoop.

  • Spark is 100 times faster than Hadoop for big data processing as it stores the data in-memory, by placing it in Resilient Distributed Databases (RDD).
  • Spark is easier to program as it comes with an interactive mode.
  • It provides complete recovery using lineage graph whenever something goes wrong.

Question 4: What is a Sparse Vector?

A sparse vector has two parallel arrays –one for indices and the other for values. These vectors are used for storing non-zero entries to save space.

Question 5: What is RDD?

RDDs (Resilient Distributed Datasets) are basic abstraction in Apache Spark that represent the data coming into the system in object format. RDDs are used for in-memory computations on large clusters, in a fault tolerant manner. RDDs are read-only portioned, collection of records, that are –

  • Immutable – RDDs cannot be altered.
  • Resilient – If a node holding the partition fails the other node takes the data.

Question 6: Explain about transformations and actions in the context of RDDs.

Transformations are functions executed on demand, to produce a new RDD. All transformations are followed by actions. Some examples of transformations include map, filter and reduceByKey.

Actions are the results of RDD computations or transformations. After an action is performed, the data from RDD moves back to the local machine. Some examples of actions include reduce, collect, first, and take.

Question 7: What are the languages supported by Apache Spark for developing big data applications?

Apache Spark supports the following four languages: Scala, Java, Python and R. Among these languages, Scala and Python have interactive shells for Spark. The Scala shell can be accessed through ./bin/spark-shell and the Python shell through ./bin/pyspark. Scala is the most used among them because Spark is written in Scala and it is the most popularly used for Spark.

Question 8: Can you use Spark to access and analyse data stored in Cassandra databases?

Yes, it is possible if you use Spark Cassandra Connector.

Question 9: Is it possible to run Apache Spark on Apache Mesos?

Yes, Apache Spark can be run on the hardware clusters managed by Mesos.

Question 10: What are benefits of Spark over MapReduce?

Spark has the following benefits over MapReduce:

  • Due to the availability of in-memory processing, Spark implements the processing around 10 to 100 times faster than Hadoop MapReduce whereas MapReduce makes use of persistence storage for any of the data processing tasks.
  • Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop.
  • Unlike Hadoop, Spark provides inbuilt libraries to perform multiple tasks from the same core like batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing.
  • Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage.

Question 11: Is there any benefit of learning MapReduce if Spark is better than MapReduce?

Yes, MapReduce is a paradigm used by many big data tools including Spark as well. It is extremely relevant to use MapReduce when the data grows bigger and bigger. Most tools like Pig and Hive convert their queries into MapReduce phases to optimize them better.

Question 12: Explain about the different cluster managers in Apache Spark

The 3 different clusters managers supported in Apache Spark are:

  • YARN
  • Apache Mesos -Has rich resource scheduling capabilities and is well suited to run Spark along with other applications. It is advantageous when several users run interactive shells because it scales down the CPU allocation between commands.
  • Standalone deployments – Well suited for new deployments which only run and are easy to set up.

Question 13: How can Spark be connected to Apache Mesos?

To connect Spark with Mesos-

  • Configure the spark driver program to connect to Mesos. Spark binary package should be in a location accessible by Mesos. (or)
  • Install Apache Spark in the same location as that of Apache Mesos and configure the property ‘spark.mesos.executor.home’ to point to the location where it is installed.

Question 14: What is YARN?

Similar to Hadoop, YARN is one of the key features in Spark, providing a central and resource management platform to deliver scalable operations across the cluster. YARN is a distributed container manager, like Mesos for example, whereas Spark is a data processing tool. Spark can run on YARN, the same way Hadoop Map Reduce can run on YARN. Running Spark on YARN necessitates a binary distribution of Spark as built on YARN support. 

Question 15: How can you minimize data transfers when working with Spark?

Minimizing data transfers and avoiding shuffling helps write spark programs that run in a fast and reliable manner. The various ways in which data transfers can be minimized when working with Apache Spark are:

  • Using Broadcast Variable- Broadcast variable enhances the efficiency of joins between small and large RDDs.
  • Using Accumulators – Accumulators help update the values of variables in parallel while executing.
  • The most common way is to avoid operations ByKey, repartition or any other operations which trigger shuffles.

Question 16:  Why is there a need for broadcast variables when working with Apache Spark?

These are read only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup ().

Question 17:  Is it possible to run Spark and Mesos along with Hadoop?

Yes, it is possible to run Spark and Mesos with Hadoop by launching each of these as a separate service on the machines. Mesos acts as a unified scheduler that assigns tasks to either Spark or Hadoop.

Question 18: What is lineage graph?

The RDDs in Spark, depend on one or more other RDDs. The representation of dependencies in between RDDs is known as the lineage graph. Lineage graph information is used to compute each RDD on demand, so that whenever a part of persistent RDD is lost, the data that is lost can be recovered using the lineage graph information.

Question 19: What is PageRank in GraphX?

PageRank measures the importance of each vertex in a graph, assuming an edge from u to vrepresents an endorsement of v’s importance by u. For example, if a Twitter user is followed by many others, the user will be ranked highly.

GraphX comes with static and dynamic implementations of PageRank as methods on the PageRank Object. Static PageRank runs for a fixed number of iterations, while dynamic PageRank runs until the ranks converge (i.e., stop changing by more than a specified tolerance). GraphOps allows calling these algorithms directly as methods on Graph.

Question 20: How can you trigger automatic clean-ups in Spark to handle accumulated metadata?

You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long running jobs into different batches and writing the intermediary results to the disk.

Question 21: Explain about the major libraries that constitute the Spark Ecosystem

  • Spark MLib– Machine learning library in Spark for commonly used learning algorithms like clustering, regression, classification, etc.
  • Spark Streaming – This library is used to process real time streaming data.
  • Spark GraphX – Spark API for graph parallel computations with basic operators like joinVertices, subgraph, aggregateMessages, etc.
  • Spark SQL – Helps execute SQL like queries on Spark data using standard visualization or BI tools.

Question 22: What are the benefits of using Spark with Apache Mesos?

It renders scalable partitioning among various Spark instances and dynamic partitioning between Spark and other big data frameworks.

Question 23: What is the significance of Sliding Window operation?

Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.

Question 24: What is a DStream?

Discretized Stream is a sequence of Resilient Distributed Databases that represent a stream of data. DStreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume. DStreams have two operations

  • Transformations that produce a new DStream.
  • Output operations that write data to an external system.

Question 25: When running Spark applications, is it necessary to install Spark on all the nodes of YARN cluster?

Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.

Question 26: Is there an API for implementing graphs in Spark?

GraphX is the Spark API for graphs and graph-parallel computation. Thus, it extends the Spark RDD with a Resilient Distributed Property Graph.

The property graph is a directed multi-graph which can have multiple edges in parallel. Every edge and vertex have user defined properties associated with it. Here, the parallel edges allow multiple relationships between the same vertices. At a high-level, GraphX extends the Spark RDD abstraction by introducing the Resilient Distributed Property Graph: a directed multigraph with properties attached to each vertex and edge.

To support graph computation, GraphX exposes a set of fundamental operators (e.g., subgraph, joinVertices, and mapReduceTriplets) as well as an optimized variant of the Pregel API. In addition, GraphX includes a growing collection of graph algorithms and builders to simplify graph analytics tasks.

Question 27: What is a Parquet file?

Parquet is a columnar format file supported by many other data processing systems. Spark SQL performs both read and write operations with Parquet file and consider it be one of the best big data analytics formats so far. 

Parquet is a columnar format, supported by many data processing systems. The advantages of having a columnar storage are as follows:

  • Columnar storage limits IO operations.
  • It can fetch specific columns that you need to access.
  • Columnar storage consumes less space.
  • It gives better-summarized data and follows type-specific encoding.

Question 28: What is Catalyst framework?

Catalyst framework is a new optimization framework present in Spark SQL. It allows Spark to automatically transform SQL queries by adding new optimizations to build a faster processing system.

Question 29: Name a few companies that use Apache Spark in production.

Pinterest, Conviva, Shopify, Open Table

Question 30: Why is BlinkDB used?

BlinkDB is a query engine for executing interactive SQL queries on huge volumes of data and renders query results marked with meaningful error bars. BlinkDB helps users balance ‘query accuracy’ with response time.

Question 31: How can you compare Hadoop and Spark in terms of ease of use?

Hadoop MapReduce requires programming in Java which is difficult, though Pig and Hive make it considerably easier. Learning Pig and Hive syntax takes time. Spark has interactive APIs for different languages like Java, Python or Scala and also includes Shark i.e. Spark SQL for SQL lovers – making it comparatively easier to use than Hadoop.

Question 32: What are the common mistakes developers make when running Spark applications?

Developers often make the mistake of-

  • Hitting the web service several times by using multiple clusters.
  • Run everything on the local node instead of distributing it.

Developers need to be careful with this, as Spark makes use of memory for processing.

Question 33: How is Streaming implemented in Spark? Explain with examples.

Spark Streaming is used for processing real-time streaming data. Thus it is a useful addition to the core Spark API. It enables high-throughput and fault-tolerant stream processing of live data streams. The fundamental stream unit is DStream which is basically a series of RDDs (Resilient Distributed Datasets) to process the real-time data. The data from different sources like Flume, HDFS is streamed and finally processed to file systems, live dashboards and databases. It is similar to batch processing as the input data is divided into streams like batches.

Question 34: What is the advantage of a Parquet file?

Parquet file is a columnar format file that helps –

  • Limit I/O operations
  • Consumes less space
  • Fetches only required columns.

Question 35: What are the various data sources available in SparkSQL?

  • Parquet file
  • JSON Datasets
  • Hive tables

Question 36: What are the key features of Apache Spark that you like?

The key features of Apache Spark as follows:

  • Polyglot
  • Machine Learning
  • Multiple Format Support
  • Speed
  • Lazy Evaluation
  • Hadoop Integration
  • Real Time Computation

Polyglot: Spark provides high-level APIs in Java, Scala, Python and R. Spark code can be written in any of these four languages. It provides a shell in Scala and Python. The Scala shell can be accessed through ./bin/spark-shell and Python shell through ./bin/pyspark from the installed directory.

Machine Learning: Spark’s MLlib is the machine learning component which is handy when it comes to big data processing. It eradicates the need to use multiple tools, one for processing and one for machine learning. Spark provides data engineers and data scientists with a powerful, unified engine that is both fast and easy to use.

Multiple Formats: Spark supports multiple data sources such as Parquet, JSON, Hive and Cassandra. The Data Sources API provides a pluggable mechanism for accessing structured data though Spark SQL. Data sources can be more than just simple pipes that convert data and pull it into Spark.

Speed: Spark runs upto 100 times faster than Hadoop MapReduce for large-scale data processing. Spark is able to achieve this speed through controlled partitioning. It manages data using partitions that help parallelize distributed data processing with minimal network traffic.

Lazy Evaluation: Apache Spark delays its evaluation till it is absolutely necessary. This is one of the key factors contributing to its speed. For transformations, Spark adds them to a DAG of computation and only when the driver requests some data, does this DAG actually gets executed.

Hadoop Integration: Apache Spark provides smooth compatibility with Hadoop. This is a great boon for all the Big Data engineers who started their careers with Hadoop. Spark is a potential replacement for the MapReduce functions of Hadoop, while Spark has the ability to run on top of an existing Hadoop cluster using YARN for resource scheduling.

Real Time Computation: Spark’s computation is real-time and has less latency because of its in-memory computation. Spark is designed for massive scalability and the Spark team has documented users of the system running production clusters with thousands of nodes and supports several computational models.

Question 37: What do you understand by Pair RDD?

Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to as Pair RDDs. Pair RDDs allow users to access each key in parallel. They have a reduceByKey () method that collects data based on each key and a join () method that combines different RDDs together, based on the elements having the same key.

Question 38: Which one will you choose for a project –Hadoop MapReduce or Apache Spark?

The answer to this question depends on the given project scenario – as it is known that Spark makes use of memory instead of network and disk I/O. However, Spark uses large amount of RAM and requires dedicated machine to produce effective results. So the decision to use Hadoop or Spark varies dynamically with the requirements of the project and budget of the organization.

Question 39: Explain about the different types of transformations on DStreams?

  • Stateless Transformations- Processing of the batch does not depend on the output of the previous batch. Examples – map (), reduceByKey (), filter ().
  • Stateful Transformations- Processing of the batch depends on the intermediary results of the previous batch. Examples –Transformations that depend on sliding windows.

Question 40: Name types of Cluster Managers in Spark.

The Spark framework supports three major types of Cluster Managers:

  • Standalone: A basic manager to set up a cluster.
  • Apache Mesos: Generalized/commonly-used cluster manager, also runs Hadoop MapReduce and other applications.
  • YARN: Responsible for resource management in Hadoop.

Question 41: Explain about the popular use cases of Apache Spark

Apache Spark is mainly used for

  • Iterative machine learning.
  • Interactive data analytics and processing.
  • Stream processing
  • Sensor data processing

Question 42: Is Apache Spark a good fit for Reinforcement learning?

No. Apache Spark works well only for simple machine learning algorithms like clustering, regression, classification.

Question 43: What is Spark Core?

It has all the basic functionalities of Spark, like – memory management, fault recovery, interacting with storage systems, scheduling tasks, etc.

Question 44: How can you remove the elements with a key present in any other RDD?

Use the subtractByKey () function

Question 45: Name the components of Spark Ecosystem.

  • Spark Core: Base engine for large-scale parallel and distributed data processing
  • Spark Streaming: Used for processing real-time streaming data
  • Spark SQL: Integrates relational processing with Spark’s functional programming API
  • GraphX: Graphs and graph-parallel computation
  • MLlib: Performs machine learning in Apache Spark

Question 46: What is the difference between persist() and cache()

persist () allows the user to specify the storage level whereas cache () uses the default storage level.

Question 47: What is RDD Lineage?

Spark does not support data replication in the memory and thus, if any data is lost, it is rebuild using RDD lineage. RDD lineage is a process that reconstructs lost data partitions. The best is that RDD always remembers how to build from other datasets.

Question 48: What are the various levels of persistence in Apache Spark?

Apache Spark automatically persists the intermediary data from various shuffle operations, however it is often suggested that users call persist () method on the RDD in case they plan to reuse it. Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels.

The various storage/persistence levels in Spark are:

  • MEMORY_ONLY
  • MEMORY_ONLY_SER
  • MEMORY_AND_DISK
  • MEMORY_AND_DISK_SER, DISK_ONLY
  • OFF_HEAP

Question 49: How Spark handles monitoring and logging in Standalone mode?

Spark has a web based user interface for monitoring the cluster in standalone mode that shows the cluster and job statistics. The log output for each job is written to the work directory of the slave nodes.

Question 50: What operations does RDD support?

RDD (Resilient Distributed Dataset) is main logical data unit in Spark. An RDD has distributed a collection of objects. Distributed means, each RDD is divided into multiple partitions. Each of these partitions can reside in memory or stored on the disk of different machines in a cluster. RDDs are immutable (Read Only) data structure. You can’t change original RDD, but you can always transform it into different RDD with all changes you want.

RDDs support two types of operations: transformations and actions. 

Actions: Actions return final results of RDD computations. Actions triggers execution using lineage graph to load the data into original RDD, carry out all intermediate transformations and return final results to Driver program or write it out to file system.

Transformations: Transformations create new RDD from existing RDD like map, reduceByKey and filter we just saw. Transformations are executed on demand. That means they are computed lazily.

Question 51: Does Apache Spark provide check pointing?

Lineage graphs are always useful to recover RDDs from a failure but this is generally time consuming if the RDDs have long lineage chains. Spark has an API for check pointing i.e. a REPLICATE flag to persist. However, the decision on which data to checkpoint – is decided by the user. Checkpoints are useful when the lineage graphs are long and have wide dependencies.

Question 52: What is Spark Driver?

Spark Driver is the program that runs on the master node of the machine and declares transformations and actions on data RDDs. In simple terms, a driver in Spark creates SparkContext, connected to a given Spark Master.
The driver also delivers the RDD graphs to Master, where the standalone cluster manager runs.

Question 53: How can you launch Spark jobs inside Hadoop MapReduce?

Using SIMR (Spark in MapReduce) users can run any spark job inside MapReduce without requiring any admin rights.

Question 54: What is Spark Executor?

When SparkContext connects to a cluster manager, it acquires an Executor on nodes in the cluster. Executors are Spark processes that run computations and store the data on the worker node. The final tasks by SparkContext are transferred to executors for their execution.

Question 55: Define Actions in Spark.

An action helps in bringing back the data from RDD to the local machine. An action’s execution is the result of all previously created transformations. Actions triggers execution using lineage graph to load the data into original RDD, carry out all intermediate transformations and return final results to Driver program or write it out to file system.

reduce() is an action that implements the function passed again and again until one value if left. take() action takes all the values from RDD to a local node.

1moviesData.saveAsTextFile(“MoviesData.txt”)

As we can see here, moviesData RDD is saved into a text file called MoviesData.txt

Question 56: How Spark uses Akka?

Spark uses Akka basically for scheduling. All the workers request for a task to master after registering. The master just assigns the task. Here Spark uses Akka for messaging between the workers and masters.

Question 57: How can you achieve high availability in Apache Spark?

  • Implementing single node recovery with local file system
  • Using StandBy Masters with Apache ZooKeeper.

Question 58: Hadoop uses replication to achieve fault tolerance. How is this achieved in Apache Spark?

Data storage model in Apache Spark is based on RDDs. RDDs help achieve fault tolerance through lineage. RDD always has the information on how to build from other datasets. If any partition of a RDD is lost due to failure, lineage helps build only that particular lost partition.

Question 59: Explain about the core components of a distributed Spark application.

  • Driver- The process that runs the main () method of the program to create RDDs and perform transformations and actions on them.
  • Executor –The worker processes that run the individual tasks of a Spark job.
  • Cluster Manager-A pluggable component in Spark, to launch Executors and Drivers. The cluster manager allows Spark to run on top of other external managers like Apache Mesos or YARN.

Question 60: What do you understand by Lazy Evaluation?

Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget – but it does nothing, unless asked for the final result. When a transformation like map () is called on a RDD-the operation is not performed immediately. Transformations in Spark are not evaluated till you perform an action. This helps optimize the overall data processing workflow.

Question 61: How is Spark SQL different from HQL and SQL?

Spark SQL is a special component on the Spark Core engine that supports SQL and Hive Query Language without changing any syntax. It is possible to join SQL table and HQL table to Spark SQL.

Question 62: Define a worker node.

A node that can run the Spark application code in a cluster can be called as a worker node. A worker node can have more than one worker which is configured by setting the SPARK_ WORKER_INSTANCES property in the spark-env.sh file. Only one worker is started if the SPARK_ WORKER_INSTANCES property is not defined.

Question 63: Illustrate some demerits of using Spark.

The following are some of the demerits of using Apache Spark:

  • Since Spark utilizes more storage space compared to Hadoop and MapReduce, there may arise certain problems.
  • Developers need to be careful while running their applications in Spark.
  • Instead of running everything on a single node, the work must be distributed over multiple clusters.
  • Spark’s “in-memory” capability can become a bottleneck when it comes to cost-efficient processing of big data.
  • Spark consumes a huge amount of data when compared to Hadoop.

Question 64: What do you understand by SchemaRDD?

An RDD that consists of row objects (wrappers around basic string or integer arrays) with schema information about the type of data in each column.

Question 65: How Spark uses Akka?

Spark uses Akka basically for scheduling. All the workers request for a task to master after registering. The master just assigns the task. Here Spark uses Akka for messaging between the workers and masters.

Question 66: What file systems does Spark support?

The following three file systems are supported by Spark:

  • Hadoop Distributed File System (HDFS).
  • Local File system.
  • Amazon S3

Question 66: When running Spark applications, is it necessary to install Spark on all the nodes of YARN cluster?

Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.

Question 67: List some use cases where Spark outperforms Hadoop in processing.

  • Sensor Data Processing: Apache Spark’s “In-memory” computing works best here, as data is retrieved and combined from different sources.
  • Real Time Processing: Spark is preferred over Hadoop for real-time querying of data. e.g. Stock Market AnalysisBankingHealthcareTelecommunications, etc.
  • Stream Processing: For processing logs and detecting frauds in live streams for alerts, Apache Spark is the best solution.
  • Big Data Processing: Spark runs upto 100 times faster than Hadoop when it comes to processing medium and large-sized datasets.

Question 68: What are the various data sources available in Spark SQL?

Parquet file, JSON datasets and Hive tables are the data sources available in Spark SQL.

Question 69: What are the various levels of persistence in Apache Spark?

Apache Spark automatically persists the intermediary data from various shuffle operations, however, it is often suggested that users call persist () method on the RDD in case they plan to reuse it. Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels.

Question 70: What are the disadvantages of using Apache Spark over Hadoop MapReduce?

Apache spark does not scale well for compute intensive jobs and consumes large number of system resources. Apache Spark’s in-memory capability at times comes a major roadblock for cost efficient processing of big data. Also, Spark does have its own file management system and hence needs to be integrated with other cloud based data platforms or apache hadoop.

Question 71: Can you use Spark to access and analyze data stored in Cassandra databases?

Yes, it is possible if you use Spark Cassandra Connector.To connect Spark to a Cassandra cluster, a Cassandra Connector will need to be added to the Spark project. In the setup, a Spark executor will talk to a local Cassandra node and will only query for local data. It makes queries faster by reducing the usage of the network to send data between Spark executors (to process data) and Cassandra nodes (where data lives).

Question 72: Is it necessary to install spark on all the nodes of a YARN cluster  while running Apache Spark on YARN ?

No, because Spark runs on top of YARN. Spark runs independently from its installation. Spark has some options to use YARN when dispatching jobs to the cluster, rather than its own built-in manager, or Mesos. Further, there are some configurations to run YARN. They include masterdeploy-modedriver-memoryexecutor-memoryexecutor-cores, and queue.

Question 73: Is it possible to run Apache Spark on Apache Mesos?

Yes, Apache Spark can be run on the hardware clusters managed by Mesos. In a standalone cluster deployment, the cluster manager in the below diagram is a Spark master instance. When using Mesos, the Mesos master replaces the Spark master as the cluster manager. Mesos determines what machines handle what tasks. Because it takes into account other frameworks when scheduling these many short-lived tasks, multiple frameworks can coexist on the same cluster without resorting to a static partitioning of resources.

Question 74: How can you trigger automatic clean-ups in Spark to handle accumulated metadata?

You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long running jobs into different batches and writing the intermediary results to the disk.

Question 75: What is the significance of Sliding Window operation?

Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.

Question 76: What do you understand by Executor Memory in a Spark application?

Every spark application has same fixed heap size and fixed number of cores for a spark executor. The heap size is what referred to as the Spark executor memory which is controlled with the spark.executor.memory property of the –executor-memory flag. Every spark application will have one executor on each worker node. The executor memory is basically a measure on how much memory of the worker node will the application utilize.

Question 77: Define the functions of Spark SQL.

Spark SQL is capable of:

  • Loading data from a variety of structured sources.
  • Querying data using SQL statements, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC). For instance, using business intelligence tools like Tableau. 
  • Providing rich integration between SQL and regular Python/Java/Scala code, including the ability to join RDDs and SQL tables, expose custom functions in SQL, and more.

Question 78:How can Spark be connected to Apache Mesos?

To connect Spark with Mesos:

  1. Configure the spark driver program to connect to Mesos.
  2. Spark binary package should be in a location accessible by Mesos.
  3. Install Apache Spark in the same location as that of Apache Mesos and configure the property ‘spark.mesos.executor.home’ to point to the location where it is installed.

Question 79: What are broadcast variables?

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

Question 80: What does the Spark Engine do?

Spark engine schedules, distributes and monitors the data application across the spark cluster.

Question 81: Define Partitions in Apache Spark.

As the name suggests, partition is a smaller and logical division of data similar to ‘split’ in MapReduce. It is a logical chunk of a large distributed data set. Partitioning is the process to derive logical units of data to speed up the processing process. Spark manages data using partitions that help parallelize distributed data processing with minimal network traffic for sending data between executors. By default, Spark tries to read data into an RDD from the nodes that are close to it. Since Spark usually accesses distributed partitioned data, to optimize transformation operations it creates partitions to hold the data chunks. Everything in Spark is a partitioned RDD.

Question 82: Explain accumulators in Apache Spark.

Accumulators are variables that are only added through an associative and commutative operation. They are used to implement counters or sums. Tracking accumulators in the UI can be useful for understanding the progress of running stages. Spark natively supports numeric accumulators. We can create named or unnamed accumulators.

Question 83: What makes Apache Spark good at low-latency workloads like graph processing and machine learning?

Apache Spark stores data in-memory for faster model building and training. Machine learning algorithms require multiple iterations to generate a resulting optimal model and similarly graph algorithms traverse all the nodes and edges.These low latency workloads that need multiple iterations can lead to increased performance. Less disk access and  controlled network traffic make a huge difference when there is lots of data to be processed.

Question 84: Is it necessary to start Hadoop to run any Apache Spark Application ?

Starting hadoop is not manadatory to run any spark application. As there is no seperate storage in Apache Spark, it uses Hadoop HDFS but it is not mandatory. The data can be stored in local file system, can be loaded from local file system and processed.

Question 85: What is the default level of parallelism in apache spark?

If the user does not explicitly specify then the number of partitions are considered as default level of parallelism in Apache Spark.

Question 86: Explain about the common workflow of a Spark program

  • The foremost step in a Spark program involves creating input RDD’s from external data.
  • Use various RDD transformations like filter() to create new transformed RDD’s based on the business logic.
  • persist() any intermediate RDD’s which might have to be reused in future.
  • Launch various RDD actions() like first(), count() to begin parallel computation , which will then be optimized and executed by Spark.

Question 87: In a given spark program, how will you identify whether a given operation is Transformation or Action ?

One can identify the operation based on the return type –

i) The operation is an action, if the return type is other than RDD.

ii) The operation is transformation, if the return type is same as the RDD.

Question 88: What according to you is a common mistake Apache spark developers make when using spark ?

  • Maintaining the required size of shuffle blocks.
  • Spark developer often make mistakes with managing directed acyclic graphs (DAG’s.)

Question 89: What operations does an RDD support?Question 88:7. What operations does an RDD support?

  • Transformations
  • Actions

Question 90: Name the commonly used Spark Ecosystems.

  • Spark SQL (Shark) for developers
  • Spark Streaming for processing live data streams
  • GraphX for generating and computing graphs
  • MLlib (Machine Learning Algorithms)
  • SparkR to promote R programming in the Spark engine

Question 91: What does MLlib do?

MLlib is a scalable Machine Learning library provided by Spark. It aims at making Machine Learning easy and scalable with common learning algorithms and use cases like clustering, regression filtering, dimensional reduction, and the like.

Question 92: What file systems does Apache Spark support?

  • Hadoop Distributed File System (HDFS)
  • Local file system
  • Amazon S3

Question 93: What are Spark DataFrames?

When a dataset is organized into SQL-like columns, it is known as a DataFrame.

Question 94: How is Spark different from MapReduce? Is Spark faster than MapReduce?

Yes, Spark is faster than MapReduce. There are few important reasons why Spark is faster than MapReduce and some of them are below:

There is no tight coupling in Spark i.e., there is no mandatory rule that reduce must come after map.

Spark tries to keep the data “in-memory” as much as possible.

In MapReduce, the intermediate data will be stored in HDFS and hence takes longer time to get the data from a source but this is not the case with Spark.

Question 95: How do you specify the number of partitions while creating an RDD? What are the functions?

You can specify the number of partitions while creating a RDD either by using the sc.textFile or by using parallelize functions as follows:

Val rdd = sc.parallelize(data,4)

val data = sc.textFile(“path”,4)

Question 96: How can you connect Hive to Spark SQL?

The first important thing is that you have to place hive-site.xml file in conf directory of Spark.

Then with the help of Spark session object we can construct a data frame as,

result = spark.sql(“select * from ”)

Question 97: What is Sliding Window?

In Spark Streaming, you have to specify the batch interval. For example, let’s take your batch interval is 10 seconds, Now Spark will process the data whatever it gets in the last 10 seconds i.e., last batch interval time.
But with Sliding Window, you can specify how many last batches has to be processed. In the below screen shot, you can see that you can specify the batch interval and how many batches you want to process.
Apart from this, you can also specify when you want to process your last sliding window. For example you want to process the last 3 batches when there are 2 new batches. That is like when you want to slide and how many batches has to be processed in that window.
Hope this post helped you know some important spark interview questions that are often asked in the Apache Spark topic.

Question 98: Can the Apache Spark provide checkpointing to the user?

Lineage graphs are have been provided within the Apache Spark to recover RDDs from a failure. However, this is time-consuming if the RDDs have long lineage chains. Spark has been provided with an API for checkpointing i.e. a REPLICATE flag to persist. The decision on which data to checkpoint is decided by the user. Checkpoints are useful when the lineage graphs are long and have wide dependencies.

Question 99: How does Apache Spark achieve fault tolerance?

The data storage model in Apache Spark is based on RDDs. The RDDs help achieve fault tolerance through lineage graphs. The RDD has been designed to always store information on how to build from other datasets. If any partition of an RDD is lost due to failure, the lineage helps to build only that particular lost partition.

Question 100: Differentiate between Spark SQL and Hive.

The following are the differences between Spark SQL and Hive:
Any Hive query can easily be executed in Spark SQL but vice-versa is not true.

  • Spark SQL is faster than Hive.
  • It is not compulsory to create a metastore in Spark SQL but it is compulsory to create a Hive metastore.
  • Spark SQL is a library while Hive is a framework.
  • Spark SQL automatically deduces the schema while in Hive, the schema needs to be explicitly declared.

Other Links:

How to Uninstall Skype for Business on Window 10?

Leave a Reply

Your email address will not be published. Required fields are marked *