dummy-link

Spark

Julia binding for Apache Spark

Readme

Spark.jl

Build Status Build status

Julia interface to Apache Spark.

See Roadmap for current status.

Installation

Spark.jl requires at least Java 7 and Maven to be installed and available in PATH.

Pkg.clone("https://github.com/dfdx/Spark.jl")
Pkg.build("Spark")
# we also need latest master of JavaCall.jl
Pkg.checkout("JavaCall")

This will download and build all Julia and Java dependencies. To use Spark.jl type:

using Spark

RDD Interface: Examples

All examples below are runnable from REPL

Count lines in a text file

sc = SparkContext(master="local")
path = "file:///var/log/syslog"
txt = text_file(sc, path)
count(txt)
close(sc)

Map / Reduce on Standalone master, application name

sc = SparkContext(master="spark://spark-standalone:7077", appname="Say 'Hello!'")
path = "file:///var/log/syslog"
txt = text_file(sc, path)
rdd = map(txt, line -> length(split(line)))
reduce(rdd, +)
close(sc)

NOTE: currently named Julia functions cannot be fully serialized, so functions passed to executors should be either already defined there (e.g. in preinstalled library) or be anonymous functions.

Map partitions on Mesos and HDFS

sc = SparkContext(master="mesos://mesos-master:5050")
path = "hdfs://namenode:8020/user/hdfs/test.log"
txt = text_file(sc, path)
rdd = map_partitions(txt, it -> filter(line -> contains(line, "a"), it))
collect(rdd)
close(sc)

For the full supported API see the list of exported functions.

SQL Interface: Examples

All examples assume that you have a file people.json with content like this:

{"name": "Alice", "age": 27}
{"name": "Bob", "age": 32}

Read dataframe from JSON and collect to a driver:

spark = SparkSession()
df = read_json(spark, "/path/to/people.json")
collect(df)

Read JSON and write Parquet:

spark = SparkSession()
df = read_json(spark, "/path/to/people.json")
write_parquet(df, "/path/to/people.parquet")

First Commit

05/25/2015

Last Touched

8 days ago

Commits

137 commits

Used By: