Getting Started with Spark

What is Spark?

From the Apache Spark Documentation:

  • Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs.

Installation

  1. Install Spark

    • If you do not currently have the Java JDK (version 7 or higher) installed, download it and follow the steps to install it for your operating system.
    • Visit the Spark downloads page, select a pre-built package, and download Spark. Double-click the archive file to expand its contents ready for use.
    • Move the expanded folder into a location suitable for your experiments!
  2. Write some code!

    • Lets write some Scala code and run it on Spark!
    • If you still haven’t written any scala code, look at this previous blog post.
    • This is the example code that we will run in Spark this time!
  3. Package a jar containing your application

    • At the root of your project execute: sbt package
    • You should see something like the following in the console: Packaging ... playing-with-spark/target/scala-2.11/learning-with-spark_2.11-1.0.jar
  4. Use spark-submit to run your application YOUR_SPARK_HOME/bin/spark-submit --class "com.learning.spark.LetterCounter" --master local[4] target/scala-2.11/learning-with-spark_2.11-1.0.jar

  5. You should see in the output: Lines with a: 14, Lines with b: 9

  6. Keep on learning about the Spark API with the Spark Programming Guide

  7. For running applications on a cluster, go to deployment overview

  8. Spark includes several Scala examples in the examples directory

    • You can use YOUR_SPARK_HOME/bin/run-example EXAMPLE_NAME to run the Scala Examples

References