Access SparkSession
Access custom configurations
ref:
https://stackoverflow.com/questions/31115881/how-to-load-java-properties-file-and-use-in-spark
Create a RDD
Read data from MySQL
ref:
https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-configuration-properties.html
Write data to MySQL
ref:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter.jdbc
https://stackoverflow.com/questions/2993251/jdbc-batch-insert-performance/10617768#10617768
Read data from SQLite
ref:
https://github.com/xerial/sqlite-jdbc
java.text.ParseException: Unparseable date: "2016-04-22 17:26:54"
https://github.com/xerial/sqlite-jdbc/issues/88
Read data from parquet
ref:
https://community.hortonworks.com/articles/21303/write-read-parquet-file-in-spark.html
Create a DataFrame
Create a DataFrame with explicit schema
ref:
https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md
Create a nested schema
ref:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.types
Change schema of a DataFrame
Get numbers of partitions
Split a DataFrame into chunks (partitions)
ref:
https://stackoverflow.com/questions/24898700/batching-within-an-apache-spark-rdd-map
https://stackoverflow.com/questions/35370826/using-spark-for-sequential-row-by-row-processing-without-map-and-reduce
Show a DataFrame
Create a column with a literal value
Return a fraction of a DataFrame
Show distinct values of a column
ref:
https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/
Rename columns
Convert a column to double type
ref:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.cast
Update a colume based on conditions
ref:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.when
http://stackoverflow.com/questions/34908448/spark-add-column-to-dataframe-conditionally
Drop columns from a DataFrame
ref:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dropna
DataFrame subtract another DataFrame
ref:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.subtract
Convert a DataFrame column into a Python list
Concatenate (merge) two DataFrames
Convert a DataFrame to a Python dict
Compute (approximate or exact) median of a numerical column
ref:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.approxQuantile
Find frequent items for columns
ref:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.freqItems
Broadcast a value
ref:
http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
Broadcast a DataFrame in join
ref:
http://stackoverflow.com/questions/34053302/pyspark-and-broadcast-join-example
https://chapeau.freevariable.com/2014/09/improving-spark-application-performance.html
Cache a DataFrame
ref:
http://stackoverflow.com/questions/38056774/spark-cache-vs-broadcast
Show query execution plan
Use SQL to query a DataFrame
ref:
https://sparkour.urizone.net/recipes/using-sql-udf/
WHERE ... IN ...
ORDER BY multiple columns
Aggregate
SELECT COUNT(DISTINCT xxx) ...
ref:
http://stackoverflow.com/questions/40888946/spark-dataframe-count-distinct-values-of-every-column
SELECT MAX(xxx) ... GROUP BY
ref:
http://stackoverflow.com/questions/30616380/spark-how-to-count-number-of-records-by-key
SELECT COUNT() ... GROUP BY
You may want to use approx_count_distinct
.
GROUP_CONCAT a column
GROUP_CONCAT multiple columns
SELECT ... RANK() OVER (PARTITION BY ... ORDER BY)
ref:
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md#window-functions
Left anti join / Left excluding join
ref:
https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins
Outer join
ref:
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-joins.html
Cross join
ref:
http://stackoverflow.com/questions/5464131/finding-pairs-that-do-not-exist-in-a-different-table