Associate-Developer-Apache-Spark Practice Exam Tests Latest Updated on Jul-2022 [Q88-Q103]

Share

Associate-Developer-Apache-Spark Practice Exam Tests Latest Updated on Jul-2022

Pass Associate-Developer-Apache-Spark Exam in First Attempt Guaranteed Dumps!


How to Register for the Databricks Associate-Developer-Apache-Spark Exam

  • The on-screen steps will show you how to arrange an exam with our partner.

  • You can see all the available certificate exams by Clicking on the Certifications tab.

  • Go to create an account.

  • You can register for the exam by clicking the Register button.

 

NEW QUESTION 88
Which of the following statements about executors is correct?

  • A. Executors stop upon application completion by default.
  • B. Each node hosts a single executor.
  • C. An executor can serve multiple applications.
  • D. Executors are launched by the driver.
  • E. Executors store data in memory only.

Answer: A

Explanation:
Explanation
Executors stop upon application completion by default.
Correct. Executors only persist during the lifetime of an application.
A notable exception to that is when Dynamic Resource Allocation is enabled (which it is not by default). With Dynamic Resource Allocation enabled, executors are terminated when they are idle, independent of whether the application has been completed or not.
An executor can serve multiple applications.
Wrong. An executor is always specific to the application. It is terminated when the application completes (exception see above).
Each node hosts a single executor.
No. Each node can host one or more executors.
Executors store data in memory only.
No. Executors can store data in memory or on disk.
Executors are launched by the driver.
Incorrect. Executors are launched by the cluster manager on behalf of the driver.
More info: Job Scheduling - Spark 3.1.2 Documentation, How Applications are Executed on a Spark Cluster | Anatomy of a Spark Application | InformIT, and Spark Jargon for Starters. This blog is to clear some of the... | by Mageswaran D | Medium

 

NEW QUESTION 89
Which of the following code blocks returns DataFrame transactionsDf sorted in descending order by column predError, showing missing values last?

  • A. transactionsDf.desc_nulls_last("predError")
  • B. transactionsDf.sort(asc_nulls_last("predError"))
  • C. transactionsDf.orderBy("predError").asc_nulls_last()
  • D. transactionsDf.orderBy("predError").desc_nulls_last()
  • E. transactionsDf.sort("predError", ascending=False)

Answer: E

Explanation:
Explanation
transactionsDf.sort("predError", ascending=False)
Correct! When using DataFrame.sort() and setting ascending=False, the DataFrame will be sorted by the specified column in descending order, putting all missing values last. An alternative, although not listed as an answer here, would be transactionsDf.sort(desc_nulls_last("predError")).
transactionsDf.sort(asc_nulls_last("predError"))
Incorrect. While this is valid syntax, the DataFrame will be sorted on column predError in ascending order and not in descending order, putting missing values last.
transactionsDf.desc_nulls_last("predError")
Wrong, this is invalid syntax. There is no method DataFrame.desc_nulls_last() in the Spark API. There is a Spark function desc_nulls_last() however (link see below).
transactionsDf.orderBy("predError").desc_nulls_last()
No. While transactionsDf.orderBy("predError") is correct syntax (although it sorts the DataFrame by column predError in ascending order) and returns a DataFrame, there is no method DataFrame.desc_nulls_last() in the Spark API. There is a Spark function desc_nulls_last() however (link see below).
transactionsDf.orderBy("predError").asc_nulls_last()
Incorrect. There is no method DataFrame.asc_nulls_last() in the Spark API (see above).
More info: pyspark.sql.functions.desc_nulls_last - PySpark 3.1.2 documentation and pyspark.sql.DataFrame.sort - PySpark 3.1.2 documentation (https://bit.ly/3g1JtbI , https://bit.ly/2R90NCS) Static notebook | Dynamic notebook: See test 1 (https://flrs.github.io/spark_practice_tests_code/#1/32.html ,
https://bit.ly/sparkpracticeexams_import_instructions)

 

NEW QUESTION 90
The code block displayed below contains an error. The code block should save DataFrame transactionsDf at path path as a parquet file, appending to any existing parquet file. Find the error.
Code block:

  • A. transactionsDf.format("parquet").option("mode", "append").save(path)
  • B. save() is evaluated lazily and needs to be followed by an action.
  • C. The code block is missing a reference to the DataFrameWriter.
  • D. The mode option should be omitted so that the command uses the default mode.
  • E. Given that the DataFrame should be saved as parquet file, path is being passed to the wrong method.
  • F. The code block is missing a bucketBy command that takes care of partitions.

Answer: C

Explanation:
Explanation
Correct code block:
transactionsDf.write.format("parquet").option("mode", "append").save(path)

 

NEW QUESTION 91
Which of the following describes characteristics of the Dataset API?

  • A. The Dataset API does not support unstructured data.
  • B. In Python, the Dataset API mainly resembles Pandas' DataFrame API.
  • C. The Dataset API does not provide compile-time type safety.
  • D. The Dataset API is available in Scala, but it is not available in Python.
  • E. In Python, the Dataset API's schema is constructed via type hints.

Answer: D

Explanation:
Explanation
The Dataset API is available in Scala, but it is not available in Python.
Correct. The Dataset API uses fixed typing and is typically used for object-oriented programming. It is available when Spark is used with the Scala programming language, but not for Python. In Python, you use the DataFrame API, which is based on the Dataset API.
The Dataset API does not provide compile-time type safety.
No - in fact, depending on the use case, the type safety that the Dataset API provides is an advantage.
The Dataset API does not support unstructured data.
Wrong, the Dataset API supports structured and unstructured data.
In Python, the Dataset API's schema is constructed via type hints.
No, this is not applicable since the Dataset API is not available in Python.
In Python, the Dataset API mainly resembles Pandas' DataFrame API.
The Dataset API does not exist in Python, only in Scala and Java.

 

NEW QUESTION 92
Which of the following code blocks applies the Python function to_limit on column predError in table transactionsDf, returning a DataFrame with columns transactionId and result?

  • A. 1.spark.udf.register("LIMIT_FCN", to_limit)
    2.spark.sql("SELECT transactionId, LIMIT_FCN(predError) AS result FROM transactionsDf") (Correct)
  • B. 1.spark.udf.register(to_limit, "LIMIT_FCN")
    2.spark.sql("SELECT transactionId, LIMIT_FCN(predError) AS result FROM transactionsDf")
  • C. 1.spark.udf.register("LIMIT_FCN", to_limit)
    2.spark.sql("SELECT transactionId, to_limit(predError) AS result FROM transactionsDf") spark.sql("SELECT transactionId, udf(to_limit(predError)) AS result FROM transactionsDf")
  • D. 1.spark.udf.register("LIMIT_FCN", to_limit)
    2.spark.sql("SELECT transactionId, LIMIT_FCN(predError) FROM transactionsDf AS result")

Answer: A

Explanation:
Explanation
spark.udf.register("LIMIT_FCN", to_limit)
spark.sql("SELECT transactionId, LIMIT_FCN(predError) AS result FROM transactionsDf") Correct! First, you have to register to_limit as UDF to use it in a sql statement. Then, you can use it under the LIMIT_FCN name, correctly naming the resulting column result.
spark.udf.register(to_limit, "LIMIT_FCN")
spark.sql("SELECT transactionId, LIMIT_FCN(predError) AS result FROM transactionsDf") No. In this answer, the arguments to spark.udf.register are flipped.
spark.udf.register("LIMIT_FCN", to_limit)
spark.sql("SELECT transactionId, to_limit(predError) AS result FROM transactionsDf") Wrong, this answer does not use the registered LIMIT_FCN in the sql statement, but tries to access the to_limit method directly. This will fail, since Spark cannot access it.
spark.sql("SELECT transactionId, udf(to_limit(predError)) AS result FROM transactionsDf") Incorrect, there is no udf method in Spark's SQL.
spark.udf.register("LIMIT_FCN", to_limit)
spark.sql("SELECT transactionId, LIMIT_FCN(predError) FROM transactionsDf AS result") False. In this answer, the column that results from applying the UDF is not correctly renamed to result.
Static notebook | Dynamic notebook: See test 3

 

NEW QUESTION 93
Which of the following code blocks reads JSON file imports.json into a DataFrame?

  • A. spark.read.json("/FileStore/imports.json")
  • B. spark.read().json("/FileStore/imports.json")
  • C. spark.read.format("json").path("/FileStore/imports.json")
  • D. spark.read().mode("json").path("/FileStore/imports.json")
  • E. spark.read("json", "/FileStore/imports.json")

Answer: A

Explanation:
Explanation
Static notebook | Dynamic notebook: See test 1
(https://flrs.github.io/spark_practice_tests_code/#1/25.html ,
https://bit.ly/sparkpracticeexams_import_instructions)

 

NEW QUESTION 94
Which of the following describes the difference between client and cluster execution modes?

  • A. In cluster mode, the driver runs on the worker nodes, while the client mode runs the driver on the client machine.
  • B. In client mode, the cluster manager runs on the same host as the driver, while in cluster mode, the cluster manager runs on a separate node.
  • C. In cluster mode, the driver runs on the master node, while in client mode, the driver runs on a virtual machine in the cloud.
  • D. In cluster mode, each node will launch its own executor, while in client mode, executors will exclusively run on the client machine.
  • E. In cluster mode, the driver runs on the edge node, while the client mode runs the driver in a worker node.

Answer: A

Explanation:
Explanation
In cluster mode, the driver runs on the master node, while in client mode, the driver runs on a virtual machine in the cloud.
This is wrong, since execution modes do not specify whether workloads are run in the cloud or on-premise.
In cluster mode, each node will launch its own executor, while in client mode, executors will exclusively run on the client machine.
Wrong, since in both cases executors run on worker nodes.
In cluster mode, the driver runs on the edge node, while the client mode runs the driver in a worker node.
Wrong - in cluster mode, the driver runs on a worker node. In client mode, the driver runs on the client machine.
In client mode, the cluster manager runs on the same host as the driver, while in cluster mode, the cluster manager runs on a separate node.
No. In both modes, the cluster manager is typically on a separate node - not on the same host as the driver. It only runs on the same host as the driver in local execution mode.
More info: Learning Spark, 2nd Edition, Chapter 1, and Spark: The Definitive Guide, Chapter 15. ()

 

NEW QUESTION 95
The code block displayed below contains an error. The code block should return a DataFrame in which column predErrorAdded contains the results of Python function add_2_if_geq_3 as applied to numeric and nullable column predError in DataFrame transactionsDf. Find the error.
Code block:
1.def add_2_if_geq_3(x):
2. if x is None:
3. return x
4. elif x >= 3:
5. return x+2
6. return x
7.
8.add_2_if_geq_3_udf = udf(add_2_if_geq_3)
9.
10.transactionsDf.withColumnRenamed("predErrorAdded", add_2_if_geq_3_udf(col("predError")))

  • A. UDFs are only available through the SQL API, but not in the Python API as shown in the code block.
  • B. The udf() method does not declare a return type.
  • C. Instead of col("predError"), the actual DataFrame with the column needs to be passed, like so transactionsDf.predError.
  • D. The operator used to adding the column does not add column predErrorAdded to the DataFrame.
  • E. The Python function is unable to handle null values, resulting in the code block crashing on execution.

Answer: D

Explanation:
Explanation
Correct code block:
def add_2_if_geq_3(x):
if x is None:
return x
elif x >= 3:
return x+2
return x
add_2_if_geq_3_udf = udf(add_2_if_geq_3)
transactionsDf.withColumn("predErrorAdded", add_2_if_geq_3_udf(col("predError"))).show() Instead of withColumnRenamed, you should use the withColumn operator.
The udf() method does not declare a return type.
It is fine that the udf() method does not declare a return type, this is not a required argument. However, the default return type is StringType. This may not be the ideal return type for numeric, nullable data - but the code will run without specified return type nevertheless.
The Python function is unable to handle null values, resulting in the code block crashing on execution.
The Python function is able to handle null values, this is what the statement if x is None does.
UDFs are only available through the SQL API, but not in the Python API as shown in the code block.
No, they are available through the Python API. The code in the code block that concerns UDFs is correct.
Instead of col("predError"), the actual DataFrame with the column needs to be passed, like so transactionsDf.predError.
You may choose to use the transactionsDf.predError syntax, but the col("predError") syntax is fine.

 

NEW QUESTION 96
Which of the following code blocks sorts DataFrame transactionsDf both by column storeId in ascending and by column productId in descending order, in this priority?

  • A. transactionsDf.sort("storeId").sort(desc("productId"))
  • B. transactionsDf.sort("storeId", asc("productId"))
  • C. transactionsDf.sort("storeId", desc("productId"))
  • D. transactionsDf.sort(col(storeId)).desc(col(productId))
  • E. transactionsDf.order_by(col(storeId), desc(col(productId)))

Answer: C

Explanation:
Explanation
In this question it is important to realize that you are asked to sort transactionDf by two columns. This means that the sorting of the second column depends on the sorting of the first column.
So, any option that sorts the entire DataFrame (through chaining sort statements) will not work. The two columns need to be channeled through the same call to sort().
Also, order_by is not a valid DataFrame API method.
More info: pyspark.sql.DataFrame.sort - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2

 

NEW QUESTION 97
Which of the following code blocks returns approximately 1000 rows, some of them potentially being duplicates, from the 2000-row DataFrame transactionsDf that only has unique rows?

  • A. transactionsDf.sample(False, 0.5)
  • B. transactionsDf.sample(True, 0.5, force=True)
  • C. transactionsDf.sample(True, 0.5)
  • D. transactionsDf.take(1000).distinct()
  • E. transactionsDf.take(1000)

Answer: C

Explanation:
Explanation
To solve this question, you need to know that DataFrame.sample() is not guaranteed to return the exact fraction of the number of rows specified as an argument. Furthermore, since duplicates may be returned, you should understand that the operator's withReplacement argument should be set to True. A force= argument for the operator does not exist.
While the take argument returns an exact number of rows, it will just take the first specified number of rows (1000 in this question) from the DataFrame. Since the DataFrame does not include duplicate rows, there is no potential of any of those returned rows being duplicates when using take(), so the correct answer cannot involve take().
More info: pyspark.sql.DataFrame.sample - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2

 

NEW QUESTION 98
Which of the following code blocks returns about 150 randomly selected rows from the 1000-row DataFrame transactionsDf, assuming that any row can appear more than once in the returned DataFrame?

  • A. transactionsDf.sample(0.15, False, 3142)
  • B. transactionsDf.resample(0.15, False, 3142)
  • C. transactionsDf.sample(0.85, 8429)
  • D. transactionsDf.sample(0.15)
  • E. transactionsDf.sample(True, 0.15, 8261)

Answer: E

Explanation:
Explanation
Answering this question correctly depends on whether you understand the arguments to the DataFrame.sample() method (link to the documentation below). The arguments are as follows:
DataFrame.sample(withReplacement=None, fraction=None, seed=None).
The first argument withReplacement specified whether a row can be drawn from the DataFrame multiple times. By default, this option is disabled in Spark. But we have to enable it here, since the question asks for a row being able to appear more than once. So, we need to pass True for this argument.
About replacement: "Replacement" is easiest explained with the example of removing random items from a box. When you remove those "with replacement" it means that after you have taken an item out of the box, you put it back inside. So, essentially, if you would randomly take 10 items out of a box with 100 items, there is a chance you take the same item twice or more times. "Without replacement" means that you would not put the item back into the box after removing it. So, every time you remove an item from the box, there is one less item in the box and you can never take the same item twice.
The second argument to the withReplacement method is fraction. This referes to the fraction of items that should be returned. In the question we are asked for 150 out of 1000 items - a fraction of 0.15.
The last argument is a random seed. A random seed makes a randomized processed repeatable. This means that if you would re-run the same sample() operation with the same random seed, you would get the same rows returned from the sample() command. There is no behavior around the random seed specified in the question. The varying random seeds are only there to confuse you!
More info: pyspark.sql.DataFrame.sample - PySpark 3.1.1 documentation
Static notebook | Dynamic notebook: See test 1

 

NEW QUESTION 99
The code block shown below should return a new 2-column DataFrame that shows one attribute from column attributes per row next to the associated itemName, for all suppliers in column supplier whose name includes Sports. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Sample of DataFrame itemsDf:
1.+------+----------------------------------+-----------------------------+-------------------+
2.|itemId|itemName |attributes |supplier |
3.+------+----------------------------------+-----------------------------+-------------------+
4.|1 |Thick Coat for Walking in the Snow|[blue, winter, cozy] |Sports Company Inc.|
5.|2 |Elegant Outdoors Summer Dress |[red, summer, fresh, cooling]|YetiX |
6.|3 |Outdoors Backpack |[green, summer, travel] |Sports Company Inc.|
7.+------+----------------------------------+-----------------------------+-------------------+ Code block:
itemsDf.__1__(__2__).select(__3__, __4__)

  • A. 1. where
    2. col(supplier).contains("Sports")
    3. explode(attributes)
    4. itemName
  • B. 1. where
    2. col("supplier").contains("Sports")
    3. "itemName"
    4. "attributes"
  • C. 1. filter
    2. col("supplier").isin("Sports")
    3. "itemName"
    4. explode(col("attributes"))
  • D. 1. filter
    2. col("supplier").contains("Sports")
    3. "itemName"
    4. explode("attributes")
  • E. 1. where
    2. "Sports".isin(col("Supplier"))
    3. "itemName"
    4. array_explode("attributes")

Answer: D

Explanation:
Explanation
Output of correct code block:
+----------------------------------+------+
|itemName |col |
+----------------------------------+------+
|Thick Coat for Walking in the Snow|blue |
|Thick Coat for Walking in the Snow|winter|
|Thick Coat for Walking in the Snow|cozy |
|Outdoors Backpack |green |
|Outdoors Backpack |summer|
|Outdoors Backpack |travel|
+----------------------------------+------+
The key to solving this question is knowing about Spark's explode operator. Using this operator, you can extract values from arrays into single rows. The following guidance steps through the answers systematically from the first to the last gap. Note that there are many ways to solving the gap questions and filtering out wrong answers, you do not always have to start filtering out from the first gap, but can also exclude some answers based on obvious problems you see with them.
The answers to the first gap present you with two options: filter and where. These two are actually synonyms in PySpark, so using either of those is fine. The answer options to this gap therefore do not help us in selecting the right answer.
The second gap is more interesting. One answer option includes "Sports".isin(col("Supplier")). This construct does not work, since Python's string does not have an isin method. Another option contains col(supplier). Here, Python will try to interpret supplier as a variable. We have not set this variable, so this is not a viable answer. Then, you are left with answers options that include col ("supplier").contains("Sports") and col("supplier").isin("Sports"). The question states that we are looking for suppliers whose name includes Sports, so we have to go for the contains operator here.
We would use the isin operator if we wanted to filter out for supplier names that match any entries in a list of supplier names.
Finally, we are left with two answers that fill the third gap both with "itemName" and the fourth gap either with explode("attributes") or "attributes". While both are correct Spark syntax, only explode ("attributes") will help us achieve our goal. Specifically, the question asks for one attribute from column attributes per row - this is what the explode() operator does.
One answer option also includes array_explode() which is not a valid operator in PySpark.
More info: pyspark.sql.functions.explode - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3

 

NEW QUESTION 100
In which order should the code blocks shown below be run in order to assign articlesDf a DataFrame that lists all items in column attributes ordered by the number of times these items occur, from most to least often?
Sample of DataFrame articlesDf:
1.+------+-----------------------------+-------------------+
2.|itemId|attributes |supplier |
3.+------+-----------------------------+-------------------+
4.|1 |[blue, winter, cozy] |Sports Company Inc.|
5.|2 |[red, summer, fresh, cooling]|YetiX |
6.|3 |[green, summer, travel] |Sports Company Inc.|
7.+------+-----------------------------+-------------------+

  • A. 2, 3, 4
  • B. 5, 2
  • C. 4, 5
  • D. 1. articlesDf = articlesDf.groupby("col")
    2. articlesDf = articlesDf.select(explode(col("attributes")))
    3. articlesDf = articlesDf.orderBy("count").select("col")
    4. articlesDf = articlesDf.sort("count",ascending=False).select("col")
    5. articlesDf = articlesDf.groupby("col").count()
  • E. 2, 5, 4
  • F. 2, 5, 3

Answer: A

Explanation:
Explanation
Correct code block:
articlesDf = articlesDf.select(explode(col('attributes')))
articlesDf = articlesDf.groupby('col').count()
articlesDf = articlesDf.sort('count',ascending=False).select('col')
Output of correct code block:
+-------+
| col|
+-------+
| summer|
| winter|
| blue|
| cozy|
| travel|
| fresh|
| red|
|cooling|
| green|
+-------+
Static notebook | Dynamic notebook: See test 2

 

NEW QUESTION 101
Which of the following code blocks returns a single row from DataFrame transactionsDf?
Full DataFrame transactionsDf:
1.+-------------+---------+-----+-------+---------+----+
2.|transactionId|predError|value|storeId|productId| f|
3.+-------------+---------+-----+-------+---------+----+
4.| 1| 3| 4| 25| 1|null|
5.| 2| 6| 7| 2| 2|null|
6.| 3| 3| null| 25| 3|null|
7.| 4| null| null| 3| 2|null|
8.| 5| null| null| null| 2|null|
9.| 6| 3| 2| 25| 2|null|
10.+-------------+---------+-----+-------+---------+----+

  • A. transactionsDf.filter(col("storeId")==25).select("predError","storeId").distinct()
  • B. transactionsDf.where(col("value").isNull()).select("productId", "storeId").distinct()
  • C. transactionsDf.filter((col("storeId")!=25) | (col("productId")==2))
  • D. transactionsDf.select("productId", "storeId").where("storeId == 2 OR storeId != 25")
  • E. transactionsDf.where(col("storeId").between(3,25))

Answer: A

Explanation:
Explanation
Output of correct code block:
+---------+-------+
|predError|storeId|
+---------+-------+
| 3| 25|
+---------+-------+
This question is difficult because it requires you to understand different kinds of commands and operators. All answers are valid Spark syntax, but just one expression returns a single-row DataFrame.
For reference, here is what the incorrect answers return:
transactionsDf.filter((col("storeId")!=25) | (col("productId")==2)) returns
+-------------+---------+-----+-------+---------+----+
|transactionId|predError|value|storeId|productId| f|
+-------------+---------+-----+-------+---------+----+
| 2| 6| 7| 2| 2|null|
| 4| null| null| 3| 2|null|
| 5| null| null| null| 2|null|
| 6| 3| 2| 25| 2|null|
+-------------+---------+-----+-------+---------+----+
transactionsDf.where(col("storeId").between(3,25)) returns
+-------------+---------+-----+-------+---------+----+
|transactionId|predError|value|storeId|productId| f|
+-------------+---------+-----+-------+---------+----+
| 1| 3| 4| 25| 1|null|
| 3| 3| null| 25| 3|null|
| 4| null| null| 3| 2|null|
| 6| 3| 2| 25| 2|null|
+-------------+---------+-----+-------+---------+----+
transactionsDf.where(col("value").isNull()).select("productId", "storeId").distinct() returns
+---------+-------+
|productId|storeId|
+---------+-------+
| 3| 25|
| 2| 3|
| 2| null|
+---------+-------+
transactionsDf.select("productId", "storeId").where("storeId == 2 OR storeId != 25") returns
+---------+-------+
|productId|storeId|
+---------+-------+
| 2| 2|
| 2| 3|
+---------+-------+
Static notebook | Dynamic notebook: See test 2

 

NEW QUESTION 102
The code block shown below should add a column itemNameBetweenSeparators to DataFrame itemsDf. The column should contain arrays of maximum 4 strings. The arrays should be composed of the values in column itemsDf which are separated at - or whitespace characters. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Sample of DataFrame itemsDf:
1.+------+----------------------------------+-------------------+
2.|itemId|itemName |supplier |
3.+------+----------------------------------+-------------------+
4.|1 |Thick Coat for Walking in the Snow|Sports Company Inc.|
5.|2 |Elegant Outdoors Summer Dress |YetiX |
6.|3 |Outdoors Backpack |Sports Company Inc.|
7.+------+----------------------------------+-------------------+
Code block:
itemsDf.__1__(__2__, __3__(__4__, "[\s\-]", __5__))

  • A. 1. withColumnRenamed
    2. "itemName"
    3. split
    4. "itemNameBetweenSeparators"
    5. 4
  • B. 1. withColumn
    2. "itemNameBetweenSeparators"
    3. split
    4. "itemName"
    5. 5
  • C. 1. withColumnRenamed
    2. "itemNameBetweenSeparators"
    3. split
    4. "itemName"
    5. 4
  • D. 1. withColumn
    2. "itemNameBetweenSeparators"
    3. split
    4. "itemName"
    5. 4
    (Correct)
  • E. 1. withColumn
    2. itemNameBetweenSeparators
    3. str_split
    4. "itemName"
    5. 5

Answer: D

Explanation:
Explanation
This question deals with the parameters of Spark's split operator for strings.
To solve this question, you first need to understand the difference between DataFrame.withColumn() and DataFrame.withColumnRenamed(). The correct option here is DataFrame.withColumn() since, according to the question, we want to add a column and not rename an existing column. This leaves you with only 3 answers to consider.
The second gap should be filled with the name of the new column to be added to the DataFrame. One of the remaining answers states the column name as itemNameBetweenSeparators, while the other two state it as "itemNameBetweenSeparators". The correct option here is
"itemNameBetweenSeparators", since the other option would let Python try to interpret itemNameBetweenSeparators as the name of a variable, which we have not defined. This leaves you with 2 answers to consider.
The decision boils down to how to fill gap 5. Either with 4 or with 5. The question asks for arrays of maximum four strings. The code in gap 5 relates to the limit parameter of Spark's split operator (see documentation linked below). The documentation states that "the resulting array's length will not be more than limit", meaning that we should pick the answer option with 4 as the code in the fifth gap here.
On a side note: One answer option includes a function str_split. This function does not exist in pySpark.
More info: pyspark.sql.functions.split - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3

 

NEW QUESTION 103
......


The Exam cost of Databricks Associate Developer Apache Spark Exam?

The cost of the Databricks Associate Developer Apache Spark Exam is 200 USD per attempt.

 

Databricks Certification Free Certification Exam Material from PDF4Test with 179 Questions: https://testking.pdf4test.com/Associate-Developer-Apache-Spark-actual-dumps.html