Play with GitHub Archive dataset on BigQuery

Play with GitHub Archive dataset on BigQuery

Google BigQuery is a web service that lets you do interactive analysis of very massive datasets - analyzing billions of rows in seconds.

ref:
https://www.githubarchive.org/#bigquery
https://bigquery.cloud.google.com/table/githubarchive:month.201612

See also:
http://ghtorrent.org/

Show repository informations (1)

WITH repo_info AS (
  SELECT repo.id AS id, repo.name AS name, JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.description') AS description
  FROM `githubarchive.month.2017*`
  -- FROM `githubarchive.year.2016`
  -- FROM `githubarchive.year.*`
  WHERE type = "PullRequestEvent"
)

SELECT repo_info.name, ANY_VALUE(repo_info.description) AS description
FROM repo_info
WHERE
  repo_info.description IS NOT NULL AND
  repo_info.description != ""
GROUP BY repo_info.name
ORDER BY repo_info.name

ref:
https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#json-functions
https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#any_value

Show repository informations (2)

WITH repo_info AS (
  SELECT repo.id AS id, repo.name AS name, JSON_EXTRACT_SCALAR(payload, '$.description') AS description
  FROM `githubarchive.month.201708`
  WHERE type = "CreateEvent"
)

SELECT repo_info.name, ANY_VALUE(repo_info.description) AS description
FROM repo_info
WHERE
  repo_info.description IS NOT NULL AND
  repo_info.description != ""
GROUP BY repo_info.name
ORDER BY repo_info.name

Show repository informations (3)

SELECT name, description
FROM `ghtorrent-bq.ght_2017_04_01.projects`
WHERE
  forked_from IS NULL AND
  description IS NOT NULL AND
  description != ""

Show starred repositories by a specific user

You must use WatchEvent for starring a repo:
https://developer.github.com/v3/activity/events/types/#watchevent

SELECT repo.name, created_at
FROM TABLE_QUERY([githubarchive:month], 'LEFT(table_ID,4) IN ("2017","2016","2015")') 
WHERE type = "WatchEvent" AND actor.login = 'vinta'
GROUP BY repo.name, created_at
ORDER BY created_at DESC

Show starred repositories per user who has 10+ starred repositories

WITH stars AS (
     SELECT DISTINCT actor.login AS user, repo.name AS repo
     FROM `githubarchive.month.2017*`
     WHERE type="WatchEvent"
),
repositories_stars AS (
     SELECT repo, COUNT(*) as c FROM stars GROUP BY repo
     ORDER BY c DESC
     LIMIT 1000
),
users_stars AS (
    SELECT user, COUNT(*) as c FROM  stars
    WHERE repo IN (SELECT repo FROM repositories_stars)
    GROUP BY user
    HAVING c >= 10
    LIMIT 10000
)
SELECT user, repo FROM stars
WHERE repo IN (SELECT repo FROM repositories_stars)
AND user IN (SELECT user FROM users_stars)

ref:
https://gist.github.com/jbochi/2e8ddcc5939e70e5368326aa034a144e

Use Makefile as the task runner for arbitrary projects

Use Makefile as the task runner for arbitrary projects

Pass arguments

# in Makefile

.PHONY: something
something:
ifeq ($(var),foo)
    @echo $(var) "bar"
else
    @echo "others"
endif
$ make something var=foo
foo bar

$ make something
others

ref:
https://stackoverflow.com/questions/2214575/passing-arguments-to-make-run

Detect OS

# in Makefile

ifeq ($(shell uname),Darwin)
    UPDATE_COMMAND = brew update
else
    UPDATE_COMMAND = apt-get update
endif

Check whether a file or directory exists

# in Makefile

ifneq ($(wildcard /usr/local/HAL-9000/bin/hal),)
    UP_COMMAND = /usr/local/HAL-9000/bin/hal up
else
    UP_COMMAND = docker-compose -f docker-compose.yml -f docker-compose.linux.yml up
endif

up:
    $(UP_COMMAND)

install:
    pip install -r requirements_dev.txt

ref:
http://stackoverflow.com/questions/20763629/test-whether-a-directory-exists-inside-a-makefile

What does .PHONY mean?

Let's assume you have install target, which is a very common in makefiles. If you do not use .PHONY, and a file named install exists in the same directory as the Makefile, then make install will do nothing.

ref:
http://stackoverflow.com/questions/2145590/what-is-the-purpose-of-phony-in-a-makefile

Call another target in the same Makefile

Say that coverage depends on clean.

.PHONY: clean
clean:
     find . -regex "\(.*__pycache__.*\|*.py[co]\)" -delete

.PHONY: coverage
coverage: clean
     docker exec -i -t streetvoice_django_1 python -m coverage run manage.py test --failfast
     docker exec -i -t streetvoice_django_1 python -m coverage html -i

ref:
http://stackoverflow.com/questions/13337727/how-do-i-make-a-target-in-a-makefile-invoke-another-target-in-the-makefile

Setup Spark, Scala and Maven with Intellij IDEA

Setup Spark, Scala and Maven with Intellij IDEA

You're able to browse a complete IntelliJ IDEA project on GitHub:
https://github.com/vinta/albedo

Plugins:

Initiate a Maven project

$ mvn archetype:generate
Choose a number: xxx
xxx: remote -> net.alchim31.maven:scala-archetype-simple

ref:
https://docs.scala-lang.org/tutorials/scala-with-maven.html

Basic configurations

// in pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>ws.vinta</groupId>
    <artifactId>albedo</artifactId>
    <version>1.0.0-SNAPSHOT</version>
    <name>${project.artifactId}</name>
    <description>A recommender system for discovering GitHub repos</description>
    <inceptionYear>2017</inceptionYear>
    <properties>
        <java.version>1.8</java.version>
        <scala.version>2.11.8</scala.version>
        <scala.compactVersion>2.11</scala.compactVersion>
        <spark.version>2.2.0</spark.version>
    </properties>
    <dependencies>
        <!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library -->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.compactVersion}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.11 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_${scala.compactVersion}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib_2.11 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_${scala.compactVersion}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/mysql/mysql-connector-java -->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.42</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/databricks/spark-avro -->
        <dependency>
            <groupId>com.databricks</groupId>
            <artifactId>spark-avro_2.11</artifactId>
            <version>3.2.0</version>
        </dependency>
    </dependencies>
    <build>
        <sourceDirectory>src/main/scala</sourceDirectory>
        <testSourceDirectory>src/test/scala</testSourceDirectory>
        <plugins>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.2.1</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

ref:
https://davidb.github.io/scala-maven-plugin/example_compile.html

// in YourSparkApp.scala
package ws.vinta.albedo

import org.apache.spark.sql.SparkSession

object YourSparkApp {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("YourSparkApp").getOrCreate()

    println("FUCK YEAH")

    spark.stop()
  }
}

ref:
https://spark.apache.org/docs/latest/sql-programming-guide.html#getting-started

Run a Spark application (via Spark JARs)

You could also specify a custom log4j.properties:

$ cd PROJECT_ROOT
$ cp $SPARK_HOME/conf/log4j.properties.template log4j.properties

Follow "Run > Edit Configurations":

  • VM options: -Dspark.master=local[*] -Dlog4j.configuration=file:./log4j.properties
  • Program arguments: -u vinta

ref:
https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Config-log4j-in-Spark/td-p/34968

Run a Spark application (via spark-submit)

TODO

Generate a JAR

$ mvn clean package

Or

Follow "File > Project Structure > Project Settings > Artifacts -> + > JAR > From modules with dependencies":

  • Main Class: ws.vinta.albedo.GitHubCorpusTrainer
  • JAR files from libraries: Extract to the target Jar

You could check "Include in project build" or manually click "Build > Build Artifacts" everytime.

# specify different classes in the same JAR
$ spark-submit \
--master spark://localhost:7077 \
--class ws.vinta.albedo.LogisticRegressionTrainer \
out/artifacts/albedo_jar/albedo.jar -u vinta

$ spark-submit \
--master spark://localhost:7077 \
--class ws.vinta.albedo.GitHubCorpusTrainer \
out/artifacts/albedo_jar/albedo.jar

ref:
https://stackoverflow.com/questions/1082580/how-to-build-jars-from-intellij-properly
https://spark.apache.org/docs/latest/submitting-applications.html