After switching from being a data scientist to big data developer while developing the largest financial database, I miss the Juypter notebook integration with Scala/pySpark where you can quickly experiment new functions, modules. As a developer, we typically use Maven to manage the dependencies and it makes a quick spark test job kind of time-consuming. Now I’m introducing a new way to quick prototyping/testing spark jobs locally. This is tested on windows because I use Windows here. Should work for Mac too. The advantage of using sbt to quickly test spark jobs is:
- easier dependency set up
- interactively command and see results immediately.
Here are the steps:
Quick spark test with sbt:
1. install sbt by following: https://www.scala-sbt.org/1.0/docs/Setup.html and start sbt from Win+Cmd
2. start scala console in sbt with specific version with command:
++ 2.11.8 console (if you’re going to use 2.11.8 scala version)
3. Manage dependency in sbt with the following in build.sbt. You can simply by creating build.sbt in an empty folder.
name := "HelloSpark" version := "0.1" scalaVersion := "2.11.8" val sparkVersion = "2.3.1" val sparkSqlVersion = "2.3.1" // https://mvnrepository.com/artifact/org.apache.spark/spark-core libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkVersion, "org.apache.spark" %% "spark-sql" % sparkSqlVersion )
Bonus: You can even add common import to sbt by adding:
initialCommands in console += "
import org.apache.spark.sql.SparkSession"
4. Start to write spark code in the console
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.appName("test").master("local").getOrCreate() spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") val NUM_SAMPLES = 10000 val count = spark.sparkContext.parallelize(1 to NUM_SAMPLES).filter { _ => val x = math.random val y = math.random x*x + y*y < 1 }.count() println(s"Pi is roughly ${4.0 * count / NUM_SAMPLES}")
Example screenshot:
Ref:
Create a sbt project with command line: