“From the unreal, lead us to the Real; from darkness, lead us unto Light; from death, lead us to Immortality. Om peace, peace, peace.”

Friday, July 8, 2016

Spark Hello World with Java 8 and Maven





Java8 is a major release for Java in recent times. Support for functional programming makes Java a feature rich programming language with new additions like Lambda expression,new Stream API  etc..Here we can see how the new features of java can be leveraged to write big data applications using the popular spark framework.
             Spark is written in Scala, which is naturally a functional programming language. Eventhough most of the spark libraries can be accessed via its java API, it wasn't really straightforward to write Spark programs with Java7 due to the lack of Functional nature in java7.
Java8 makes it easy to write Spark programs  with its functional features. 

Below program is written in Java8 on Apache Spark. Those who wanted to see the same in Java7 can refer my previousblog.

Eclipse IDE is used to create and run this program.

Follow the below steps to quickly setup a sample project.

  • Create a simple maven project and update pom.xml with below configuration


<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.semanticeyes.sparkhelloworld</groupId>
<artifactId>spark-helloworld</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>Spark Helloworld</name>
<packaging>jar</packaging>
<repositories>
<repository>
<id>apache</id>
<url>https://repository.apache.org/content/repositories/releases</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.2.0</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.4</version>
</dependency>
</dependencies>
<properties>
<java.version>1.8</java.version>
</properties>
<build>
<pluginManagement>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>${java.version}</source>
<target>${java.version}</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.2.2</version>
<configuration>
<descriptors>
<descriptor>src/main/assembly/assembly.xml</descriptor>
</descriptors>
</configuration>
</plugin>
</plugins>
</pluginManagement>
</build>
</project>


·Create a HelloWorld Java class

package com.semanticeyes.sparkhelloworld;

import java.util.Arrays;
import java.util.List;

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class HelloSpark {
            public static void main(String[] args) {

                        // Local mode
                        JavaSparkContext sc = new JavaSparkContext("local", "HelloSpark");
                        String[] arr = new String[] { "John", "Paul", "Gavin", "Rahul", "Angel" };
                        List<String> inputList = Arrays.asList(arr);
                        JavaRDD<String> inputRDD = sc.parallelize(inputList);
                        inputRDD.foreach(x -> System.out.println(x));

            }
}


Thats it..Go and execute it...






Wednesday, August 5, 2015

Apache Spark- Getting Started with a Java Hello World















            Apache Spark™ is a fast and general engine for large-scale data processing. Spark can be integrated with existing Hadoop ecosystem.

Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
Applications can be written in many languages including Java, Scala, Python etc..


Quick Start-Java Spark Application- Local Mode

1. Create a simple maven project with below pom.xml
 
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.semanticeyes.sparkhelloworld</groupId>
<artifactId>spark-helloworld</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>Spark Helloworld</name>
<packaging>jar</packaging>
<repositories>
<repository>
<id>apache</id>
<url>https://repository.apache.org/content/repositories/releases</url>
</repository>
</repositories>
<dependencies>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.1.1</version>
<scope>provided</scope>
</dependency>
</dependencies>
</project>

2. Create a HelloWorld Java class
package com.semanticeyes.sparkhelloworld;

import java.util.Arrays;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.VoidFunction;

public class HelloWorld {
public static void main(String[] args) {

// Local mode
SparkConf sparkConf = new SparkConf().setAppName("HelloWorld").setMaster("local");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
String[] arr = new String[] { "John", "Paul", "Gavin", "Rahul", "Angel" };
List<String> inputList = Arrays.asList(arr);
JavaRDD<String> inputRDD = ctx.parallelize(inputList);
inputRDD.foreach(new VoidFunction<String>() {

public void call(String input) throws Exception {
System.out.println(input);

}
});

}
}

3. Thats all..Your first spark program is ready to execute

This program runs in local mode ie you can execute this program from your IDE.
You don't need a cluster to run this program.

While creating the Spark configuration, the master node is set as local to make this a stand alone program.
SparkConf sparkConf = new SparkConf().setAppName("HelloWorld").setMaster("local");

RDD is an immutable distributed collection of objects. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. In cluster mode execution these spark specific objects play a significant role in the faster & distributed processing of dataset.


In the below code snippet a normal java collection is converted into JavaRDD.

JavaRDD<String> inputRDD = ctx.parallelize(inputList);

The first phase of a spark program is this conversion, which brings distributed nature to the input dataset.

Functional Programming is a programming paradigm heavily utilized by the modern distributed systems. Hadoop is the first in that list which uses Map Reduce method for its processing.
Spark gives the power of functional programming to its developers, where the developer can code their program with any functional programming methods without limiting their thoughts to MapReduce.

In the below code snippet a function call is passed to the input dataset. For each set the call method is applied. We can write transformation/logic inside the call method.



inputRDD.foreach(new VoidFunction<String>() {

public void call(String input) throws Exception {
System.out.println(input);

}
});

VoidFunction is an inbuilt java class comes with spark java package. A list of such classes are available in spark java package. These classes brings functional programming nature to java programming language. From JDK8 onwards java supports functional programming by default.
The aforementioned classes can be used along with JDK 6 & 7. Spark utilises the anonymous class feature of java for bringing the functional nature.



List of classes in Spark-Java for Functional Programming



 This is an introductory tutorial for spark programming with java(6&7). Cluster mode execution and further topics will be covered in the next tutorials

Saturday, July 2, 2011

DRY your code with Hibernate Naming Strategy

DRY,KISS,Zero configuration,Configuration By Default etc..are the terms that we are frequently hearing from the world of Agile Software Development. In modern software development, lots of significance is there for Productivity. Various approaches are applying to achieve this including the above mentioned terms. NamingStrategy is such an approach to reduce your code or in the other way, to remove the boiler plate codes from your project.
2 kinds of mapping is used in Hibernate ORM. XML based and Annotation based. Both requires some work to map the entity class with corresponding table. By using
Naming strategy lots of work can be reduced and at the same time the coding standards can be maintained to a high level perfection.

Here I am explaining one naming strategy that I am using. The ImprovedNamingStrategy, which comes from the Master Gavin King.
The Naming strategy can be modified by writing your own style of working. Or in some exceptional cases you can use either annotation or xml to specify the special behaviour for some fields.

Here I am listing down a few pieces of code to explain the functionality of ImprovedNamingStrategy.


1)hibernate.cfg.xml





com.mysql.jdbc.Driver
jdbc:mysql://localhost:3306/naming_strategy
username
password

1

org.hibernate.dialect.MySQLDialect 

thread

org.hibernate.cache.NoCacheProvider

true

create




2)UserInfo.java(Entity class)
import javax.persistence.Entity;
import javax.persistence.GeneratedValue;
import javax.persistence.Id;

@Entity
public class UserInfo {

@Id
@GeneratedValue
private Integer id;
private String email;
private String firstName;

public Integer getId() {
return id;
}

public void setId(Integer id) {
this.id = id;
}

public String getEmail() {
return email;
}

public void setEmail(String email) {
this.email = email;
}

public String getFirstName() {
return firstName;
}

public void setFirstName(String firstName) {
this.firstName = firstName;
}

}


3.UserInfoDao.java
import org.hibernate.Session;
import org.hibernate.SessionFactory;
import org.hibernate.Transaction;
import org.hibernate.cfg.AnnotationConfiguration;
import org.hibernate.cfg.ImprovedNamingStrategy;

public class UserInfoDao {

public void save() {
UserInfo userInfo = new UserInfo();
userInfo.setEmail("naming@hibernate.com");
userInfo.setFirstName("Improved Naming Strategy");
SessionFactory sessionfactory = new AnnotationConfiguration()
.configure().setNamingStrategy(new ImprovedNamingStrategy())
.buildSessionFactory();
Session session = sessionfactory.getCurrentSession();
Transaction tx = session.beginTransaction();
session.save(userInfo);
tx.commit();
sessionfactory.close();

}
}



4)Main.java
public class Main {
public static void main(String a[]) {
UserInfoDao dao = new UserInfoDao();
dao.save();
}
}
In the above codes, you couldn't see the nasty XML mapping files. Also the entity class looks clean and simple in the absence of @Column annotaion every where. But if you want, you can write those things for overriding the behaviour in exceptional cases.

Table will be created automatically with following naming convention.


Table:user_info
Fields:id,email,first_name


These naming can be changed by creating your own naming strategy by implementing the hibernate NamingStrategy interface.


Following Libraries are used for the above application:
hibernate3.jar
antlr-2.7.6.jar
commons-collections-3.1.jar
dom4j-1.6.1.jar
javassist-3.9.0.GA.jar
jta-1.1.jar
log4j-1.2.16.jar
hibernate-jpa-2.0-api-1.0.0.Final.jar
slf4j-api-1.6.1.jar
mysql-connector-java-5.1.13-bin.jar

References:-
1.ImprovedNamingStrategy
2.DZONE-Article
3.DRY 4.KISS

Wednesday, June 29, 2011

Common Uses of Apache Commons-Part2(Builders)

The toString() method is an easy way to debug java objects. The toString code can be generated easily using IDEs. Thanks to the IDE. But when using agile methodology, it is a common thing to add/remove/rename the fields in Classes.This requires a rewrite of toString method. This can be avoided by using a reflection based toString builder from apache commons.The equals,hashcode and compareTo can be done in similar way. Due to some permission & performance issues, it is a good practice to use these builders without reflection.
It produces clean and maintainable code for you. Following examples explains the usage of such builders.

ToStringBuilder

ReflectionBased:-

public String toString() {
return ReflectionToStringBuilder.toString(this);
}


Without using Reflection:-

  public String toString() {
return new ToStringBuilder(this).
append("Name", name).
append("Age", age).toString();
}


EqualsBuilder

Using Reflection:-

public boolean equals(Object that) {
return EqualsBuilder.reflectionEquals(this, that);
}


WithoutReflection:-

public boolean equals(Object that) {
EqualsBuilder builder = new EqualsBuilder();
return builder.append(this.field1, that.field1)
.append(this.field2, that.field2)
.isEquals();
}


CompareToBuilder

Using Reflection:-

 public int compareTo(Object rhs) {
return CompareToBuilder.reflectionCompare(this, rhs);
}


WithoutReflection:-

public int compare(Object this, Object rhs) {
CompareToBuilder builder = new CompareToBuilder();
return builder.append(this.name, rhs.name).toComparison();
}


HashCodeBuilder

With Reflection:-

public int hashCode() {
return HashCodeBuilder.reflectionHashCode(this);
}


Without Reflection:-

public int hashCode() {
HashCodeBuilder builder = new HashCodeBuilder();
return builder.append(this.value1)
.append(this.value2)
.toHashCode();
}

Monday, June 27, 2011

Common uses of Apache Commons

Apache Commons is a collection of reusable java components which extends the official java functionality. These libraries helped me to reduce my code complexity.
Here I am listing down some handy methods that I used in my projects.

1.StringUtils
a)IsEmpty/IsBlank - checks if a String contains text
b)Trim/Strip - removes leading and trailing whitespace
c)Split/Join - splits a String into an array of substrings and vice versa
d)IsAlpha/IsNumeric/IsWhitespace/IsAsciiPrintable - checks the characters in a String


2.BeanUtils
a)copyProperties(Object dest, Object orig) Copy property values from the origin bean to the destination bean for all cases where the property names are the same.
b)describe(Object bean) Return the entire set of properties for which the specified bean provides a read method.
c)populate(Object bean, Map properties) Populate the JavaBeans properties of the specified bean, based on the specified name/value pairs.


3.CollectionUtils
a)intersection(java.util.Collection a, java.util.Collection b) Returns a Collection containing the intersection of the given Collections.
b)isEmpty(java.util.Collection coll) Null-safe check if the specified collection is empty.
c)subtract(java.util.Collection a, java.util.Collection b) Returns a new Collection containing a - b.