Spark In MapReduce (SIMR) by databricks

What is SIMR?

SIMR provides a quick way for Hadoop MapReduce 1 users to use Apache Spark. It enables running Spark jobs, as well as the Spark shell, on Hadoop MapReduce clusters without having to install Spark or Scala, or have administrative rights. Note that this is for Hadoop MapReduce 1, Hadoop YARN users can the Spark on Yarn method.

After downloading SIMR, it can be tried out by typing
./simr --shell

$ ./simr --shell
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 0.8.1
      /_/

Using Scala version 2.9.3 (Java HotSpot(TM) 64-Bit VM, Java 1.6.0_65)
Type in expressions to have them evaluated.
Type :help for more information.
Created spark context..
Spark context available as sc.

scala>

Note that the goal of SIMR is to provide a quick way to try out Spark. While this suffices for batch and interactive jobs, we recommend installing Spark for production use.

Download

Download 3 files: simr runtime script, as well as the
simr-<hadoop-version>.jar and
spark-assembly-<hadoop-version>.jar that match the version of Hadoop your cluster is running. If it is not provided, you will have to build it yourself. [See below](#advanced-configuration).

SIMR runtime script: simr
SIMR jar for different Hadoop versions:
- 1.0.4 (HDP 1.0 - 1.2)
  simr-hadoop-1.0.4.jar /
  spark-assembly-hadoop-1.0.4.jar
- 1.2.x (HDP 1.3)
  simr-hadoop-1.2.0.jar /
  spark-assembly-hadoop-1.2.0.jar
- 0.20 (CDH3)
  simr-cdh3.jar /
  spark-assembly-hadoop-cdh3.jar
- 2.0.0 (CDH4)
  simr-cdh4.jar /
  spark-assembly-hadoop-cdh4.jar
Source: GitHub

Place simr, simr-<hadoop-version>.jar, and spark-assembly-<hadoop-version>.jar in a directory and execute simr to get usage information. Try running the shell. If you get stuck, continue reading.


./simr --shell

We've crafted some handsome templates for you to use. Go ahead and continue to layouts to browse through them. You can easily go back to edit your page before publishing. After publishing your page, you can revisit the page generator and switch to another theme. Your Page content will be preserved if it remained markdown format.

Requirements

SIMR automatically includes Scala 2.9.3 and Spark 0.8.1. They are already in the above jars and are thus not required.

Java v1.6
Hadoop versions 1.0.4 (HDP 1.0 - 1.2), 1.2.x (HDP 1.3), 0.20 (CDH3), 2.0.0 (CDH4).

Installation

Ensure the hadoop executable is in the PATH. If it is not, set $HADOOP to point to the binary, or the hadoop/bin directory. Set $SIMRJAR and $SPARKJAR to specify which SIMR and Spark jars to use, otherwise jars will be selected from the current directory.

To run a Spark application, package it up as a JAR file and execute:


./simr jar_file main_class parameters [--outdir=] [--slots=N] [--unique]

jar_file is a JAR file containing all your programs, e.g. spark-examples.jar. Note that this jar file should contain all the third party dependencies that your job has (this can be achieved with the Maven assembly plugin or sbt-assembly).
main_class is the name of the class with a main method, e.g. org.apache.spark.examples.SparkPi
parameters is a list of parameters that will be passed to your main_class.
Important: the special parameter %spark_url% will be replaced with the Spark driver URL.
outdir is an optional parameter which sets the path (absolute or relative) in HDFS where your job's output will be stored, e.g. /user/alig/myjob11.
If this parameter is not set, a directory will be created using the current time stamp in the form of yyyy-MM-dd_kk_mm_ss, e.g. 2013-12-01_11_12_13
slots is an optional parameter that specifies the number of Map slots SIMR should utilize. By default, SIMR sets the value to the number of nodes in the cluster.
This value must be at least 2, otherwise no executors will be present and the task will never complete.
unique is an optional parameter which ensures that each node in the cluster will run at most 1 SIMR executor.

Your output will be placed in the outdir in HDFS, this includes output from stdout/stderr for the driver and all executors. **Important**: to ensure that your Spark jobs terminate without errors, you must end your Spark programs by calling stop() on SparkContext. In the case of the Spark examples, this usually means adding spark.stop() at the end of main().

Examples

Assuming spark-examples.jar exists and contains the Spark examples, the following will execute the example that computes pi in 100 partitions in parallel:


./simr spark-examples.jar org.apache.spark.examples.SparkPi %spark_url% 100

Alternatively, you can launch a Spark-shell like this:


./simr --shell

Configuration

SIMR expects its different components to communicate over the network, which requires opening ports for communication. SIMR does not have a set of static ports, as this would prevent multiple SIMR jobs from executing simultaneously on the same machines. Instead the ports are in the ephemeral range For SIMR to function properly ports in the ephemeral range should be opened in firewalls.

The $HADOOP environment variable should point at the hadoop binary or its directory. To specify the SIMR or Spark jar the runtime script should use, set the $SIMRJAR and $SPARKJAR environment variables respectively. If these variables are not set, the runtime script will default to a simr.jar and spark.jar in the current directory.

By default SIMR figures out the number of task trackers in the cluster and launches a job that is the same size as the cluster. This can be adjusted by supplying the command line parameter --slots=<integer> to simr or setting the Hadoop configuration parameter simr.cluster.slots.

Building Custom Versions

The following sections are targeted at users who aim to run SIMR on versions of Hadoop for which jars have not been provided. It is necessary to build both the appropriate version of simr-<hadoop-version>.jar and spark-assembly-<hadoop-version>.jar and place them in the same directory as the simr runtime script.

Step 1: Building Spark

In order to build SIMR, we must first compile a version of Spark that targets the version of Hadoop that SIMR will be run on.

Download Spark v0.8.1 or greater.
Unpack and enter the Spark directory.
Modify project/SparkBuild.scala. Change the value of DEFAULT_HADOOP_VERSION to match the version of Hadoop you are targeting, e.g.
val DEFAULT_HADOOP_VERSION = "1.2.0"
Run sbt/sbt assembly which creates a giant jumbo jar containing all of Spark in assembly/target/scala*/spark-assembly-<spark-version>-SNAPSHOT-<hadoop-version>.jar.
Copy assembly/target/scala*/spark-assembly-<spark-version>-SNAPSHOT-<hadoop-version>.jar to the same directory as the runtime script simr and follow the instructions below to build simr-<hadoop-version>.jar.

Step 2: Building SIMR

Checkout the SIMR repository from https://github.com/databricks/simr.git
Copy the Spark jumbo jar into the SIMR lib/ directory.
Important: Ensure the Spark jumbo jar is named spark-assembly.jar when placed in the lib/ directory, otherwise it will be included in the SIMR jumbo jar.
Run sbt/sbt assembly in the root of the SIMR directory. This will build the SIMR jumbo jar which will be output as target/scala*/simr.jar.
Copy target/scala*/simr.jar to the same directory as the runtime script simr and follow the instructions above to execute SIMR.

How it works under the hood

SIMR launches a Hadoop MapReduce job that only contains mappers. It ensures that a jumbo jar (simr.jar), containing Scala and Spark, gets uploaded to the machines of the mappers. It also ensures that the job jar you specified gets shipped to those nodes.

Once the mappers are all running with the right dependencies in place, SIMR uses HDFS to do leader election to elect one of the mappers as the Spark driver. SIMR then executes your job driver, which uses a new SIMR scheduler backend that generates and accepts driver URLs of the form simr://path. SIMR thereafter communicates the new driver URL to all the mappers, which then start Spark executors. The executors connect back to the driver, which executes your program.

All output to stdout and stderr is redirected to the specified HDFS directory. Once your job is done, the SIMR backend scheduler has additional functionality to shut down all the executors (hence the new required call to stop()).

About

SIMR was jointly developed by databricks and UC Berkeley AMPLab under the Apache license.