“Every problem becomes childish when once it is explained to you”

Wednesday, August 5, 2015

Apache Spark- Getting Started with a Java Hello World















            Apache Spark™ is a fast and general engine for large-scale data processing. Spark can be integrated with existing Hadoop ecosystem.

Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
Applications can be written in many languages including Java, Scala, Python etc..


Quick Start-Java Spark Application- Local Mode

1. Create a simple maven project with below pom.xml
 
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.semanticeyes.sparkhelloworld</groupId>
<artifactId>spark-helloworld</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>Spark Helloworld</name>
<packaging>jar</packaging>
<repositories>
<repository>
<id>apache</id>
<url>https://repository.apache.org/content/repositories/releases</url>
</repository>
</repositories>
<dependencies>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.1.1</version>
<scope>provided</scope>
</dependency>
</dependencies>
</project>

2. Create a HelloWorld Java class
package com.semanticeyes.sparkhelloworld;

import java.util.Arrays;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.VoidFunction;

public class HelloWorld {
public static void main(String[] args) {

// Local mode
SparkConf sparkConf = new SparkConf().setAppName("HelloWorld").setMaster("local");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
String[] arr = new String[] { "John", "Paul", "Gavin", "Rahul", "Angel" };
List<String> inputList = Arrays.asList(arr);
JavaRDD<String> inputRDD = ctx.parallelize(inputList);
inputRDD.foreach(new VoidFunction<String>() {

public void call(String input) throws Exception {
System.out.println(input);

}
});

}
}

3. Thats all..Your first spark program is ready to execute

This program runs in local mode ie you can execute this program from your IDE.
You don't need a cluster to run this program.

While creating the Spark configuration, the master node is set as local to make this a stand alone program.
SparkConf sparkConf = new SparkConf().setAppName("HelloWorld").setMaster("local");

RDD is an immutable distributed collection of objects. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. In cluster mode execution these spark specific objects play a significant role in the faster & distributed processing of dataset.


In the below code snippet a normal java collection is converted into JavaRDD.

JavaRDD<String> inputRDD = ctx.parallelize(inputList);

The first phase of a spark program is this conversion, which brings distributed nature to the input dataset.

Functional Programming is a programming paradigm heavily utilized by the modern distributed systems. Hadoop is the first in that list which uses Map Reduce method for its processing.
Spark gives the power of functional programming to its developers, where the developer can code their program with any functional programming methods without limiting their thoughts to MapReduce.

In the below code snippet a function call is passed to the input dataset. For each set the call method is applied. We can write transformation/logic inside the call method.



inputRDD.foreach(new VoidFunction<String>() {

public void call(String input) throws Exception {
System.out.println(input);

}
});

VoidFunction is an inbuilt java class comes with spark java package. A list of such classes are available in spark java package. These classes brings functional programming nature to java programming language. From JDK8 onwards java supports functional programming by default.
The aforementioned classes can be used along with JDK 6 & 7. Spark utilises the anonymous class feature of java for bringing the functional nature.



List of classes in Spark-Java for Functional Programming



 This is an introductory tutorial for spark programming with java(6&7). Cluster mode execution and further topics will be covered in the next tutorials

13 comments:

  1. This is one such interesting and useful article that i have ever read. The way you have structured the content is so realistic and meaningful. Thank you so much for sharing this in here. Keep up this good work and I'm expecting more contents like this from you in future.

    Big Data Training in Chennai | Big Data Training | Hadoop Course in Chennai

    ReplyDelete
  2. Mmm... it doesn't work. I just got this:

    Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/api/java/function/VoidFunction
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:264)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:122)
    Caused by: java.lang.ClassNotFoundException: org.apache.spark.api.java.function.VoidFunction
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 3 more

    ReplyDelete
    Replies
    1. Please try rebuilding maven.
      Either through IDE or through console with mvn clean install..
      It will install all the dependencies for you..

      Delete
    2. remove provided and build it again.

      Delete
  3. I built it successfully, when I tried to ran as Java Application, I got the following output
    [97;97;98;99;13p [0m

    Should I select the spark-submit method to run the application, if so can you please give the instructions to run in local mode?

    ReplyDelete
  4. I figured it out.I ran it in the command line mode. Thank you so much for the nice blog. very helpful!

    ./bin/spark-submit \
    --class HelloWorld \
    --master local[8] \
    /path/to/spark-helloworld-0.0.1-SNAPSHOT.jar

    ReplyDelete
  5. I have read your blog its very attractive and impressive. I like it your blog.

    Java Training in Chennai Core Java Training in Chennai Core Java Training in Chennai

    Java Online Training Java Online Training Core Java 8 Training in Chennai Core java 8 online training JavaEE Training in Chennai Java EE Training in Chennai

    ReplyDelete
  6. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in Apache Spark MLIB, kindly contact us http://www.maxmunus.com/contact
    MaxMunus Offer World Class Virtual Instructor led training on Apache Spark MLIB. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
    For Demo Contact us:
    Name : Arunkumar U
    Email : arun@maxmunus.com
    Skype id: training_maxmunus
    Contact No.-+91-9738507310
    Company Website –http://www.maxmunus.com


    ReplyDelete
  7. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in Big Data Hadoop and Spark Developer, kindly contact us http://www.maxmunus.com/contact
    MaxMunus Offer World Class Virtual Instructor led training on TECHNOLOGY. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
    For Demo Contact us.
    Sangita Mohanty
    MaxMunus
    E-mail: sangita@maxmunus.com
    Skype id: training_maxmunus
    Ph:(0) 9738075708 / 080 - 41103383
    http://www.maxmunus.com/

    ReplyDelete
  8. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in Apache Spark TECHNOLOGY , kindly contact us http://www.maxmunus.com/contact
    MaxMunus Offer World Class Virtual Instructor-led training on TECHNOLOGY. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ pieces of training in India, USA, UK, Australia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
    For Demo Contact us.
    Pratik Shekhar
    MaxMunus
    E-mail: pratik@maxmunus.com
    Ph:(0) +91 9066268701
    http://www.maxmunus.com/

    ReplyDelete
  9. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    Android Training in Chennai
    Ios Training in Chennai

    ReplyDelete