March 2019 – All about cool stuff.

After switching from being a data scientist to big data developer while developing the largest financial database, I miss the Juypter notebook integration with Scala/pySpark where you can quickly experiment new functions, modules. As a developer, we typically use Maven to manage the dependencies and it makes a quick spark test job kind of time-consuming. Now I’m introducing a new way to quick prototyping/testing spark jobs locally. This is tested on windows because I use Windows here. Should work for Mac too. The advantage of using sbt to quickly test spark jobs is:

easier dependency set up
interactively command and see results immediately.

Here are the steps:

Quick spark test with sbt:
1. install sbt by following: https://www.scala-sbt.org/1.0/docs/Setup.html and start sbt from Win+Cmd

2. start scala console in sbt with specific version with command:
++ 2.11.8 console (if you’re going to use 2.11.8 scala version)

3. Manage dependency in sbt with the following in build.sbt. You can simply by creating build.sbt in an empty folder.

name := "HelloSpark"

version := "0.1"

scalaVersion := "2.11.8"

val sparkVersion = "2.3.1"
val sparkSqlVersion = "2.3.1"

// https://mvnrepository.com/artifact/org.apache.spark/spark-core
libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion,
 "org.apache.spark" %% "spark-sql" % sparkSqlVersion
)

Bonus: You can even add common import to sbt by adding:

initialCommands in console += "import org.apache.spark.sql.SparkSession"

4. Start to write spark code in the console

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder.appName("test").master("local").getOrCreate()
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

val NUM_SAMPLES = 10000
val count = spark.sparkContext.parallelize(1 to NUM_SAMPLES).filter { _ =>
  val x = math.random
 val y = math.random
 x*x + y*y < 1
}.count()
println(s"Pi is roughly ${4.0 * count / NUM_SAMPLES}")

Example screenshot:

Ref:

Create a sbt project with command line:

https://docs.scala-lang.org/getting-started-sbt-track/getting-started-with-scala-and-sbt-on-the-command-line.html

Month: March 2019

Quick spark prototyping with sbt (scala)