Setup Spark, Scala and Maven with Intellij IDEA

Setup Spark, Scala and Maven with Intellij IDEA

IntelliJ IDEA supports Scala and Apache Spark perfectly. You're able to browse a complete Spark project built with IntelliJ IDEA on GitHub: https://github.com/vinta/albedo

Useful Plugins:

Initiate a Maven Project

$ mvn archetype:generate
Choose a number: xxx
xxx: remote -> net.alchim31.maven:scala-archetype-simple

ref:
https://docs.scala-lang.org/tutorials/scala-with-maven.html

Example Configurations

The remaining section of this article assumes that you use this pom.xml which should be able to work out of the box.

<!-- in pom.xml -->
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>ws.vinta</groupId>
  <artifactId>albedo</artifactId>
  <version>1.0.0-SNAPSHOT</version>
  <packaging>jar</packaging>
  <name>${project.artifactId}</name>
  <description>A recommender system for discovering GitHub repos</description>
  <url>https://github.com/vinta/albedo</url>
  <inceptionYear>2017</inceptionYear>
  <properties>
    <java.version>1.8</java.version>
    <scala.version>2.11.8</scala.version>
    <scala.compactVersion>2.11</scala.compactVersion>
    <spark.version>2.2.0</spark.version>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>
  <repositories>
    <repository>
      <id>spark-packages</id>
      <name>Spark Packages Repository</name>
      <url>https://dl.bintray.com/spark-packages/maven/</url>
    </repository>
  </repositories>
  <dependencies>
    <!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library -->
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11 -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_${scala.compactVersion}</artifactId>
      <version>${spark.version}</version>
      <scope>compile</scope>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.11 -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_${scala.compactVersion}</artifactId>
      <version>${spark.version}</version>
      <scope>compile</scope>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib_2.11 -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-mllib_${scala.compactVersion}</artifactId>
      <version>${spark.version}</version>
      <scope>compile</scope>
    </dependency>
    <!-- https://mvnrepository.com/artifact/mysql/mysql-connector-java -->
    <dependency>
      <groupId>mysql</groupId>
      <artifactId>mysql-connector-java</artifactId>
      <version>5.1.42</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient -->
    <dependency>
      <groupId>org.apache.httpcomponents</groupId>
      <artifactId>httpclient</artifactId>
      <version>4.5.2</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.elasticsearch.client/elasticsearch-rest-high-level-client -->
    <dependency>
      <groupId>org.elasticsearch.client</groupId>
      <artifactId>elasticsearch-rest-high-level-client</artifactId>
      <version>5.6.2</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/com.hankcs/hanlp -->
    <dependency>
      <groupId>com.hankcs</groupId>
      <artifactId>hanlp</artifactId>
      <version>portable-1.3.4</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/com.github.rholder/snowball-stemmer -->
    <dependency>
      <groupId>com.github.rholder</groupId>
      <artifactId>snowball-stemmer</artifactId>
      <version>1.3.0.581.1</version>
    </dependency>
  </dependencies>
  <build>
    <sourceDirectory>src/main/scala</sourceDirectory>
    <testSourceDirectory>src/test/scala</testSourceDirectory>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.6.1</version>
        <configuration>
          <source>${java.version}</source>
          <target>${java.version}</target>
          <encoding>UTF-8</encoding>
        </configuration>
      </plugin>
      <plugin>
        <groupId>net.alchim31.maven</groupId>
        <artifactId>scala-maven-plugin</artifactId>
        <version>3.2.1</version>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
              <goal>testCompile</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-install-plugin</artifactId>
        <version>2.5.2</version>
        <configuration>
          <skip>true</skip>
        </configuration>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>3.1.0</version>
        <executions>
          <execution>
           <phase>install</phase>
            <goals>
              <goal>shade</goal>
            </goals>
            <configuration>
              <filters>
                <filter>
                  <artifact>*:*</artifact>
                  <excludes>
                    <exclude>META-INF/*.DSA</exclude>
                    <exclude>META-INF/*.RSA</exclude>
                    <exclude>META-INF/*.SF</exclude>
                  </excludes>
                </filter>
              </filters>
              <artifactSet>
                <excludes>
                  <exclude>com.apple:AppleJavaExtensions:*</exclude>
                  <exclude>javax.servlet:*</exclude>
                  <exclude>org.apache.hadoop:*</exclude>
                  <exclude>org.apache.maven.plugins:*</exclude>
                  <exclude>org.apache.parquet:*</exclude>
                  <exclude>org.apache.spark:*</exclude>
                  <exclude>org.scala-lang:*</exclude>
                </excludes>
              </artifactSet>
              <finalName>${project.artifactId}-${project.version}-uber</finalName>
            </configuration>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
</project>

ref:
https://davidb.github.io/scala-maven-plugin/example_compile.html

Generate a Thin JAR

Thin JAR only contains classes that you created, which means you should include your dependencies externally.

$ mvn clean package -DskipTests

You're able to specify different classes in the same JAR.

$ spark-submit \
--master spark://localhost:7077 \
--packages "mysql:mysql-connector-java:5.1.41" \
--class ws.vinta.albedo.LogisticRegressionRanker \
target/albedo-1.0.0-SNAPSHOT.jar

ref:
https://stackoverflow.com/questions/1082580/how-to-build-jars-from-intellij-properly
https://spark.apache.org/docs/latest/submitting-applications.html

Generate a Fat JAR, Shaded JAR or Uber JAR

CAUTION: DO NOT ENABLE <minimizeJar>true</minimizeJar> in the maven-shade-plugin, it will ruin your day!

<!-- in pom.xml -->
<project>
  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-install-plugin</artifactId>
        <version>2.5.2</version>
        <configuration>
          <skip>true</skip>
        </configuration>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>3.1.0</version>
        <executions>
          <execution>
            <phase>install</phase>
            <goals>
              <goal>shade</goal>
            </goals>
            <configuration>
              <artifactSet>
                <excludes>
                  <exclude>com.apple:AppleJavaExtensions:*</exclude>
                  <exclude>javax.servlet:*</exclude>
                  <exclude>org.apache.hadoop:*</exclude>
                  <exclude>org.apache.maven.plugins:*</exclude>
                  <exclude>org.apache.parquet:*</exclude>
                  <exclude>org.apache.spark:*</exclude>
                  <exclude>org.scala-lang:*</exclude>
                </excludes>
              </artifactSet>
              <finalName>${project.artifactId}-${project.version}-uber</finalName>
            </configuration>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
</project>
# the output jar will be located in "target/albedo-1.0.0-SNAPSHOT-uber.jar"
$ mvn clean install -DskipTests
$ spark-submit \
--master spark://localhost:7077 \
--class ws.vinta.albedo.LogisticRegressionRanker \
target/albedo-1.0.0-SNAPSHOT-uber.jar

ref:
http://maven.apache.org/plugins/maven-shade-plugin/examples/includes-excludes.html
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/missing_dependencies_in_jar_files.html

Run a Spark Application in Local Mode

Follow "Run > Edit Configurations":

  • VM options: -Xms12g -Xmx12g -Dspark.master="local[*]"
  • Before launch:
    • Build

Or

// in YourSparkApp.scala
package ws.vinta.albedo

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

object YourSparkApp {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
      .setMaster("local[*]")
      .set("spark.driver.memory", "12g")

    val spark = SparkSession
      .builder()
      .appName("YourSparkApp")
      .config(conf)
      .getOrCreate()

    spark.stop()
  }
}

ref:
https://github.com/jaceklaskowski/mastering-apache-spark-book/blob/master/spark-sparkcontext-creating-instance-internals.adoc
https://stackoverflow.com/questions/43054268/how-to-set-spark-memorystore-size-when-running-in-intellij-scala-console

Run a Spark Application in Standalone Mode

First, start your Spark Standalone cluster:

$ cd ${SPARK_HOME}
$ ./sbin/start-master.sh -h 0.0.0.0
$ ./sbin/start-slave.sh spark://localhost:7077

# print logs from Spark master and workers, useful for debuging
$ tail -f ${SPARK_HOME}/logs/*

Follow "Run > Edit Configurations":

  • VM options: -Dspark.master=spark://localhost:7077 -Dspark.driver.memory=2g -Dspark.executor.memory=12g -Dspark.executor.cores=3
    • Local cluster mode: -Dspark.master="local-cluster[x, y, z]" with x workers, y cores per worker, and z MB memory per worker
    • Local cluster mode doesn't need a real Spark Standalone cluster
  • Before launch:
    • Build
    • Run Maven Goal 'albedo: clean install'

Or

// in YourSparkApp.scala
package ws.vinta.albedo

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

object YourSparkApp {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
      .setMaster("spark://localhost:7077")
      .set("spark.driver.memory", "2g")
      .set("spark.executor.memory", "12g")
      .set("spark.executor.cores", "3")
      .setJars(List("target/albedo-1.0.0-SNAPSHOT-uber.jar"))
      // or
      .setMaster("local-cluster[1, 3, 12288]")
      .setJars(List("target/albedo-1.0.0-SNAPSHOT-uber.jar"))

    val spark = SparkSession
      .builder()
      .appName("YourSparkApp")
      .config(conf)
      .getOrCreate()

    spark.stop()
  }
}

In the end, there are some glossaries which need to be clarified:

  • compile: compile a single .java or .scala into *.class
  • make: compile changed files only
  • build: compile every files in the project

ref:
http://www.jianshu.com/p/b4e4658c459c

Specify a Custom Logging Configuration

$ cd PROJECT_ROOT
$ cp $SPARK_HOME/conf/log4j.properties.template log4j.properties

Follow "Run > Edit Configurations":

  • VM options: -Dlog4j.configuration=file:./log4j.properties

ref:
https://spark.apache.org/docs/latest/sql-programming-guide.html#getting-started
https://stackoverflow.com/questions/43054268/how-to-set-spark-memorystore-size-when-running-in-intellij-scala-console
https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Config-log4j-in-Spark/td-p/34968

Upload your Java Artifacts to Maven Central Repository

Upload your Java Artifacts to Maven Central Repository

你需要:

  1. 一個使用 Maven 管理的 Java project(廢話)
  2. 一個 GPG key(deploy 的時候會用來 sign 要提交的 .jar)
  3. 一個 Sonatype JIRA 的帳號
  4. 開一張 JIRA 的 ticket 告訴 Sonatype 的人你要發佈 library,告知他們你的 groupId
  5. 按照 Requirements 的指示完善你的 pom.xml
  6. deploy 到 snapshot repository
  7. deploy 到 staging repository
  8. 在 OSSRH 的 Staging Repositories 把你剛剛 deploy 的 library 給 close 掉,這樣才算是 release
  9. 回到那張 ticket,通知 Sonatype 讓他們把你的 library 同步到 Maven Central Repositir

最後一個步驟只有第一次 release 的時候才需要
之後 release 就會自動同步了

Requirements

http://maven.apache.org/guides/mini/guide-central-repository-upload.html
http://central.sonatype.org/pages/requirements.html
http://central.sonatype.org/pages/ossrh-guide.html
http://central.sonatype.org/pages/apache-maven.html
http://central.sonatype.org/pages/releasing-the-deployment.html

參考 Pangu.java 的 pom.xml
https://github.com/vinta/pangu.java/blob/master/pom.xml

Deployment

You need following plugins:

  • maven-source-plugin
  • maven-javadoc-plugin
  • maven-gpg-plugin
  • nexus-staging-maven-plugin
  • maven-release-plugin

deploy 之前
必須確定你的 local 的程式碼跟 scm 的程式碼是同步的
如果你要發布 1.0.0 版本的話
你的 pom.xml 裡要寫 1.0.0-SNAPSHOT
然後執行:

# deploy to snapshot repository
$ mvn clean deploy

你可以在 https://oss.sonatype.org/ 搜尋到
SNAPSHOT 版本測試都沒問題之後(當然你要先設定讓 Maven 能夠下載 SNAPSHOT 版本的 libraries)
就可以正式 release 了:

# cleanup for the release
$ mvn release:clean

# 要回答一些關於版本號的問題
# 它會自動幫你新增一個 tag 並且把 pom.xml 裡的 `<version>` 改成下個版本
$ mvn release:prepare

# deploy to staging repository
# 然後 Maven 會把上一步新增的 git tag 和 pom.xml 的變更直接 push 到 GitHub
$ mvn release:perform

Maven 會自動在 library 進到 staging repository 的時候把 -SNAPSHOT 字串拿掉

(第一次 release 才需要以下的動作)

然後你就可以在 https://oss.sonatype.org/#stagingRepositories
找到你剛剛 deploy 的 library
通常長得像是 wsvinta-1000(前面是 groupId)
要把它 close
然後再 release

除了第一次 release 要去 ticket 留言之外
之後 release 就會自動同步到 Maven Central Repository
不過通常會需要等一陣子才會在 Maven 上看到

ref:
http://dev.solita.fi/2014/10/22/publishing-to-maven-central-repository.html
http://lkrnac.net/blog/2014/03/deploy-to-maven-central/
http://kirang89.github.io/blog/2013/01/20/uploading-your-jar-to-maven-central/
http://superwang.me/2014/03/22/publish-your-library-to-maven-central-repository-part-1/
http://www.kongch.com/2013/05/deploy-to-central-repo/

如果你在 release 的過程中出了錯
要重新 release 的話
你得 revert 你的 git commit 到執行 mvn release:prepare 之前
然後再重新跑一次

Maven: The De Facto Build Tool for JVM Projects

Maven: The De Facto Build Tool for JVM Projects

Install

# on Mac OS X
$ brew install maven
$ brew install maven-completion

ref:
https://maven.apache.org/index.html

Commands

# create project: interactive mode
$ mvn archetype:generate \
-DarchetypeArtifactId=maven-archetype-quickstart \
-DinteractiveMode=true

# create project: non-interactive mode
$ mvn archetype:generate \
-DarchetypeArtifactId=maven-archetype-quickstart \
-DinteractiveMode=false \
-DgroupId=ws.vinta.pangu \
-DartifactId=pangu

# download dependencies
$ mvn dependency:copy-dependencies

# download dependencies to a specific directory
$ mvn dependency:copy-dependencies -DoutputDirectory=jars

# analyze unused dependencies
$ mvn dependency:analyze

$ mvn compile

$ mvn test

# run a specific class
$ mvn exec:java -Dexec.mainClass="pangu_example.App"

# package a JAR
$ mvn package

# 提交到 central repository 之前可以用這個來測試一下安裝有沒有問題
$ mvn clean install

build lifecycle
http://openhome.cc/Gossip/JUnit/BuildLifeCycle.html

  • src/main/java 放置專案原始碼
  • src/test/java 放置單元測試用原始碼
  • src/main/resources 放置設定檔,例如 log4j.properties
  • src/test/resources 放置測試用設定檔,如同測試程式本身不會被打包進 jar

Configuration

in pom.xml

放 per project 的設定

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <parent>
    <groupId>org.sonatype.oss</groupId>
    <artifactId>oss-parent</artifactId>
    <version>7</version>
  </parent>
  <groupId>ws.vinta</groupId>
  <artifactId>pangu</artifactId>
  <version>1.0.1-SNAPSHOT</version>
  <packaging>jar</packaging>
  <name>Pangu</name>
  <description>Paranoid text spacing for good readability, to insert whitespace between CJK (Chinese, Japanese, Korean), half-width English, digit and symbol characters automatically.</description>
  <url>https://github.com/vinta/pangu.java</url>
  <inceptionYear>2014</inceptionYear>
  <licenses>
    <license>
      <name>MIT License</name>
      <url>http://www.opensource.org/licenses/mit-license.php</url>
      <distribution>repo</distribution>
    </license>
  </licenses>
  <developers>
    <developer>
      <id>vinta</id>
      <name>Vinta</name>
      <email>[email protected]</email>
      <url>http://vinta.ws/</url>
    </developer>
  </developers>
  <scm>
    <connection>scm:git:[email protected]:vinta/pangu.java.git</connection>
    <developerConnection>scm:git:[email protected]:vinta/pangu.java.git</developerConnection>
    <url>[email protected]:vinta/pangu.java.git</url>
    <tag>HEAD</tag>
  </scm>
  <issueManagement>
    <system>GitHub Issues</system>
    <url>https://github.com/vinta/pangu.java/issues</url>
  </issueManagement>
  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>
  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.11</version>
      <scope>test</scope>
    </dependency>
  </dependencies>
  <distributionManagement>
    <snapshotRepository>
      <id>ossrh</id>
      <url>https://oss.sonatype.org/content/repositories/snapshots</url>
    </snapshotRepository>
    <repository>
      <id>ossrh</id>
      <url>https://oss.sonatype.org/service/local/staging/deploy/maven2/</url>
    </repository>
  </distributionManagement>
  <profiles>
    <profile>
      <id>release</id>
      <build>
        <plugins>
          <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-gpg-plugin</artifactId>
            <version>1.5</version>
            <executions>
              <execution>
                <id>sign-artifacts</id>
                <phase>verify</phase>
                <goals>
                  <goal>sign</goal>
                </goals>
              </execution>
            </executions>
          </plugin>
        </plugins>
      </build>
    </profile>
  </profiles>
  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.2</version>
        <configuration>
          <source>${maven.compile.source}</source>
          <target>${maven.compile.target}</target>
          <optimize>${maven.compile.optimize}</optimize>
          <encoding>UTF8</encoding>
        </configuration>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-source-plugin</artifactId>
        <version>2.4</version>
        <executions>
          <execution>
            <id>attach-sources</id>
            <goals>
              <goal>jar-no-fork</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-javadoc-plugin</artifactId>
        <version>2.10.1</version>
        <executions>
          <execution>
            <id>attach-javadocs</id>
            <goals>
              <goal>jar</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
      <plugin>
        <groupId>org.sonatype.plugins</groupId>
        <artifactId>nexus-staging-maven-plugin</artifactId>
        <version>1.6.5</version>
        <extensions>true</extensions>
        <configuration>
          <serverId>ossrh</serverId>
          <nexusUrl>https://oss.sonatype.org/</nexusUrl>
          <autoReleaseAfterClose>true</autoReleaseAfterClose>
        </configuration>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-release-plugin</artifactId>
        <version>2.5.1</version>
        <configuration>
          <autoVersionSubmodules>true</autoVersionSubmodules>
          <useReleaseProfile>false</useReleaseProfile>
          <releaseProfiles>release</releaseProfiles>
          <goals>deploy</goals>
        </configuration>
      </plugin>
    </plugins>
  </build>
</project>

ref:
http://maven.apache.org/pom.html
https://github.com/vinta/pangu.java/blob/master/pom.xml

Maven 的 groupId 基本上只是用來標示這個 artifact 屬於哪一個 group
基本上就是用你的 domain name 就好了
跟 Java 的 package 路徑沒有關係

in settings.xml

放 global 的設定

ref:
http://maven.apache.org/ref/3.2.3/maven-settings/settings.html

maven-source-plugin

maven-javadoc-plugin

How to attach source and javadoc artifacts?
http://maven.apache.org/plugin-developers/cookbook/attach-source-javadoc-artifacts.html

$ mvn source:jar
$ mvn javadoc:jar
# or
$ mvn package

Find packages

ref:
https://search.maven.org/
https://mvnrepository.com/

Issues

中文會是亂碼

<project>
  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>
</project>

ref:
http://maven.apache.org/general.html#encoding-warning

新增額外的 Maven Repositories

in pom.xml

<project>
    <repositories>
        <repository>
            <id>spark-packages</id>
            <name>Spark Packages Repository</name>
            <url>https://dl.bintray.com/spark-packages/maven/</url>
        </repository>
    </repositories>
</project>

ref:
https://maven.apache.org/guides/mini/guide-multiple-repositories.html

允許下載 SNAPSHOT 版本的 libraries

in ~/.m2/settings.xml

<settings>
  <profiles>
    <profile>
      <id>allow-snapshots</id>
      <activation>
        <activeByDefault>true</activeByDefault>
      </activation>
      <repositories>
        <repository>
          <id>ossrh-snapshots-repo</id>
          <url>https://oss.sonatype.org/content/repositories/snapshots</url>
          <releases>
            <enabled>false</enabled>
          </releases>
          <snapshots>
            <enabled>true</enabled>
          </snapshots>
        </repository>
      </repositories>
    </profile>
  </profiles>
</settings>

你可以新增多個 snapshots repo 的來源

Could not find artifact com.sun:tools:jar

ref:
http://maven.apache.org/general.html#tools-jar-dependency