{"id":454,"date":"2017-10-25T01:23:14","date_gmt":"2017-10-24T17:23:14","guid":{"rendered":"https:\/\/vinta.ws\/code\/?p=454"},"modified":"2026-03-17T01:18:52","modified_gmt":"2026-03-16T17:18:52","slug":"spark-best-practices","status":"publish","type":"post","link":"https:\/\/vinta.ws\/code\/spark-best-practices.html","title":{"rendered":"Spark best practices"},"content":{"rendered":"<h2>Architecture<\/h2>\n<p>A Spark application is a set of processes running on a cluster, all these processes are coordinated by the driver program. The driver program is 1) the process where the <code>main()<\/code> method of your program runs, 2) the process running the code that creates a SparkSession, RDDs, DataFrames, and stages up or sends off transformations and actions.<\/p>\n<p>Those processes that run computations and store data for your application are executors. Executors are 1) returning computed results to the driver program, 2) providing in-memory storage for cached RDDs\/DataFrames.<\/p>\n<p>Execution of a Spark program:<\/p>\n<ul>\n<li>The driver program runs the Spark application, which creates a SparkSession upon start-up.<\/li>\n<li>The SparkSession connects to a cluster manager (e.g., YARN\/Mesos) which allocates resources.<\/li>\n<li>Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Actions are happening on workers.<\/li>\n<li>The driver program sends your application code (e.g., functions which applied on <code>map()<\/code>) to the executors.<\/li>\n<li>Transformations and actions are queued up and optimized by the driver program, sent to executors to run, and then the executors send results back to the driver program.<\/li>\n<\/ul>\n<p>\u540d\u8a5e\u89e3\u91cb\uff1a<\/p>\n<ul>\n<li>Driver program\uff1a\u5c31\u662f master node\uff0c\u8ca0\u8cac\u5efa\u7acb SparkSession\uff0c\u4f60\u7684 Spark app \u7684 main() \u5c31\u662f\u8dd1\u5728\u9019\u4e0a\u9762<\/li>\n<li>Worker node\uff1acluster \u88e1\u9664\u4e86 driver \u4e4b\u5916\u7684\u90a3\u4e9b\u6a5f\u5668\uff0c\u5be6\u969b\u57f7\u884c\u5206\u6563\u5f0f\u904b\u7b97\u7684\u6a5f\u5668\uff0c\u57fa\u672c\u4e0a\u4e00\u53f0\u6a5f\u5668\u5c31\u662f\u4e00\u500b worker \u6216 slave<\/li>\n<li>Executor\uff1aworker node \u4e0a\u7684\u4e00\u500b\u500b processes\uff0c\u901a\u5e38\u4e00\u500b core \u5c0d\u61c9\u4e00\u500b executor<\/li>\n<li>Cluster Manager: \u8ca0\u8cac\u7ba1\u7406\u8cc7\u6e90\uff0c\u901a\u5e38\u662f YARN\u3002driver program \u548c workers \u4e4b\u9593\u6703\u900f\u904e cluster manager \u4f86\u6e9d\u901a<\/li>\n<\/ul>\n<p>ref:<br \/>\n<a href=\"https:\/\/spark.apache.org\/docs\/latest\/cluster-overview.html\">https:\/\/spark.apache.org\/docs\/latest\/cluster-overview.html<\/a><br \/>\n<a href=\"https:\/\/databricks.com\/blog\/2016\/06\/22\/apache-spark-key-terms-explained.html\">https:\/\/databricks.com\/blog\/2016\/06\/22\/apache-spark-key-terms-explained.html<\/a><\/p>\n<h2>Resilient Distributed Dataset (RDD)<\/h2>\n<p>RDDs are divided into partitions: each partition can be considered as an immutable subset of the entire RDD. When you execute your Spark program, each partition gets sent to a worker.<\/p>\n<p>\u5c0d RDD \u7684\u64cd\u4f5c\u53ef\u4ee5\u5206\u70ba\u5169\u7a2e\uff1a<\/p>\n<ul>\n<li>\u6240\u6709 return RDD \u7684\u64cd\u4f5c\u5c31\u662f transformation\n<ul>\n<li>\u6703 lazy \u5730\u5efa\u7acb\u65b0\u7684 RDD\uff0c\u4f8b\u5982 <code>.map()<\/code>\u3001<code>.flatMap()<\/code>\u3001<code>.filter()<\/code><\/li>\n<\/ul>\n<\/li>\n<li>\u6240\u6709\u4e0d\u662f return RDD \u7684\u5c31\u662f action\n<ul>\n<li>\u6703 eager \u5730\u57f7\u884c\u64cd\u4f5c\uff0c\u4f8b\u5982 <code>reduce()<\/code>\u3001<code>count()<\/code>\u3001<code>.collect()<\/code><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>\u4ee5 <code>someRDD.foreach(println)<\/code> \u70ba\u4f8b\uff0c<code>foreach<\/code> \u662f\u500b action\uff0c\u56e0\u70ba\u5b83\u56de\u50b3\u7684\u662f <code>Unit<\/code>\uff0c<code>println<\/code> \u5be6\u969b\u4e0a\u662f print \u5230 executor \u7684 stdout \u4e86\uff0c\u6240\u4ee5\u4f60\u5728 driver program \u6839\u672c\u770b\u4e0d\u5230\u3002\u9664\u975e\u4f60\u6539\u6210 <code>someRDD.collect().foreach(println)<\/code>\u3002<\/p>\n<p>\u4ee5 <code>someRDD.take(10)<\/code> \u70ba\u4f8b\uff0c<code>take(10)<\/code> \u5be6\u969b\u4e0a\u662f\u5728 executors \u4e0a\u88ab\u904b\u7b97\u51fa\u4f86\u7684\uff0c\u4f46\u662f\u6703\u628a\u90a3 10 \u7b46\u7d50\u679c\u50b3\u56de\u5230 driver program \u4e0a\u3002<\/p>\n<p>ref:<br \/>\n<a href=\"https:\/\/zhangyi.gitbooks.io\/spark-in-action\/content\/chapter2\/rdd.html\">https:\/\/zhangyi.gitbooks.io\/spark-in-action\/content\/chapter2\/rdd.html<\/a><\/p>\n<p>\u4ee5 <code>logRDD.filter(line =&gt; line.contains(&quot;ERROR&quot;)).take(10)<\/code> \u70ba\u4f8b\uff0c<code>filter()<\/code> \u662f lazy \u7684\u597d\u8655\u662f\uff0cSpark \u77e5\u9053\u6700\u5f8c\u53ea\u9700\u8981 <code>take(10)<\/code>\uff0c\u6240\u4ee5\u7576\u5b83 filter \u96c6\u6eff 10 \u500b\u7b26\u5408\u689d\u4ef6\u7684 lines \u6642\u5c31\u53ef\u4ee5\u4e0d\u7528\u7e7c\u7e8c\u57f7\u884c\u4e0b\u53bb\u4e86\uff0c\u4e0d\u9700\u8981\u5c0d\u6574\u500b dataset \u505a filter\u3002<\/p>\n<p>\u4ee5 <code>logRDD.map(_.toLowerCase).filter(_.contains(&quot;error&quot;)).count()<\/code> \u70ba\u4f8b\uff0c\u96d6\u7136 <code>map()<\/code> \u548c <code>filter()<\/code> \u662f\u5169\u500b\u64cd\u4f5c\uff0c\u4f46\u662f\u56e0\u70ba\u5b83\u5011\u90fd\u662f lazy evaluation\uff0c\u6240\u4ee5 Spark \u80fd\u5920\u5728 <code>count()<\/code> \u968e\u6bb5\u5224\u65b7\uff0c\u5176\u5be6\u53ef\u4ee5\u5728\u8b80\u53d6\u6bcf\u4e00\u884c log \u7684\u6642\u5019\u540c\u6642\u505a <code>.toLowerCase<\/code> \u548c <code>.contains(&quot;error&quot;)<\/code>\u3002<\/p>\n<p>By default, RDDs are recomputed each time you run an action on them! Spark allows you to control what should be cached in memory.<\/p>\n<p>ref:<br \/>\n<a href=\"https:\/\/www.coursera.org\/learn\/scala-spark-big-data\/lecture\/0GZV7\/evaluation-in-spark-unlike-scala-collections\">https:\/\/www.coursera.org\/learn\/scala-spark-big-data\/lecture\/0GZV7\/evaluation-in-spark-unlike-scala-collections<\/a><\/p>\n<h2>Shuffling<\/h2>\n<p>Moving data from one node to another across network is called &quot;shuffling&quot;. A shuffle can occur when the resulting RDD depends on other elements from the same RDD or another RDD, you can also figure out whether a shuffle has been planned via 1)the return type of certain transformations (e.g. <code>ShuffledRDD<\/code>); 2) using <code>toDebugString<\/code> to see its execution plan.<\/p>\n<p>Operations that might cause a shuffle:<\/p>\n<ul>\n<li>cogroup<\/li>\n<li>groupWith<\/li>\n<li>join<\/li>\n<li>leftOuterJoin<\/li>\n<li>rightOuterJoin<\/li>\n<li>groupByKey<\/li>\n<li>reduceByKey<\/li>\n<li>combineByKey<\/li>\n<li>distinct<\/li>\n<li>intersection<\/li>\n<li>repartition<\/li>\n<li>coalesce<\/li>\n<\/ul>\n<p>Narrow dependencies:<\/p>\n<ul>\n<li>map<\/li>\n<li>mapValues<\/li>\n<li>flatMap<\/li>\n<li>filter<\/li>\n<li>mapPartitions<\/li>\n<li>mapPartitionsWithIndex<\/li>\n<\/ul>\n<p>Wide dependencies:<\/p>\n<ul>\n<li>cogroup<\/li>\n<li>groupWith<\/li>\n<li>join<\/li>\n<li>leftOuterJoin<\/li>\n<li>rightOuterJoin<\/li>\n<li>groupByKey<\/li>\n<li>reduceByKey<\/li>\n<li>combineByKey<\/li>\n<li>distinct<\/li>\n<li>intersection<\/li>\n<\/ul>\n<p>ref:<br \/>\n<a href=\"https:\/\/www.coursera.org\/learn\/scala-spark-big-data\/lecture\/bT1YR\/shuffling-what-it-is-and-why-its-important\">https:\/\/www.coursera.org\/learn\/scala-spark-big-data\/lecture\/bT1YR\/shuffling-what-it-is-and-why-its-important<\/a><\/p>\n<h2>Partitioning<\/h2>\n<p>Partitions never span multiple machines, data in the same partition are guaranteed to be on the same machine. The default number of partitions is the total number of cores on all executor nodes.<\/p>\n<p>Customizing partitions is only possible when working with Pair RDD, because of partitioning is done based on keys. The most important thing is that you must <code>cache()<\/code> or <code>persist()<\/code> your RDDs after re-partitioning.<\/p>\n<p>Following operations hold to (and propagate) a partitioner:<\/p>\n<ul>\n<li>cogroup<\/li>\n<li>groupWith<\/li>\n<li>groupByKey<\/li>\n<li>reduceByKey<\/li>\n<li>foldByKey<\/li>\n<li>combineByKey<\/li>\n<li>partitionBy<\/li>\n<li>join<\/li>\n<li>sort<\/li>\n<li>mapValues (if parent has a partitioner)<\/li>\n<li>flatMapValues (if parent has a partitioner)<\/li>\n<li>filter (if parent has a partitioner)<\/li>\n<\/ul>\n<p>All other operations (e.g. <code>map()<\/code>) will produce a result without a partitioner.<\/p>\n<p>ref:<br \/>\n<a href=\"https:\/\/www.coursera.org\/learn\/scala-spark-big-data\/lecture\/Vkhm0\/partitioning\">https:\/\/www.coursera.org\/learn\/scala-spark-big-data\/lecture\/Vkhm0\/partitioning<\/a><\/p>\n<p>You can chain a call to .repartition(n) after reading the text file to specify a different and larger number of partitions. You might set this higher to match the number of cores in your cluster, for example.<br \/>\nmake num of partitions equal the num of cores in your cluster<\/p>\n<p>ref:<br \/>\n<a href=\"https:\/\/www.safaribooksonline.com\/library\/view\/learning-apache-spark\/9781785885136\/ch01s04.html\">https:\/\/www.safaribooksonline.com\/library\/view\/learning-apache-spark\/9781785885136\/ch01s04.html<\/a><br \/>\n<a href=\"https:\/\/umbertogriffo.gitbooks.io\/apache-spark-best-practices-and-tuning\/content\/sparksqlshufflepartitions_draft.html\">https:\/\/umbertogriffo.gitbooks.io\/apache-spark-best-practices-and-tuning\/content\/sparksqlshufflepartitions_draft.html<\/a><\/p>\n<p>It's really important to repartition your dataset if you are going to cache it and use for queries. The optimal number of partitions is around 4-6 for each executor core, with 40 nodes and 6 executor cores we use 1000 partitions for best performance.<\/p>\n<p>ref:<br \/>\n<a href=\"https:\/\/databricks.com\/blog\/2015\/10\/13\/interactive-audience-analytics-with-apache-spark-and-hyperloglog.html\">https:\/\/databricks.com\/blog\/2015\/10\/13\/interactive-audience-analytics-with-apache-spark-and-hyperloglog.html<\/a><br \/>\n<a href=\"https:\/\/stackoverflow.com\/questions\/40416357\/spark-sql-difference-between-df-repartition-and-dataframewriter-partitionby\">https:\/\/stackoverflow.com\/questions\/40416357\/spark-sql-difference-between-df-repartition-and-dataframewriter-partitionby<\/a><\/p>\n<h2>Dataset<\/h2>\n<p>Dataset \u7684 transformation API \u5206\u6210 untyped \u548c typed\uff0c\u4f7f\u7528\u4e86 untyped API \u4e4b\u5f8c\u6703\u56de\u50b3 DataFrame\uff0c\u6703\u5931\u53bb type\uff1btyped API \u5247\u6703\u56de\u50b3 Dataset\u3002\u5982\u679c\u662f\u50cf <code>select()<\/code> \u9019\u6a23\u7684 API\uff0c\u53ea\u8981\u986f\u5f0f\u5730\u52a0\u4e0a type casting\uff0c\u4f8b\u5982 <code>ds.select($&quot;name&quot;.as[String], $&quot;age&quot;.as[Int])<\/code>\uff0c\u56de\u50b3\u7684\u6771\u897f\u5c31\u6703\u662f Dataset \u800c\u4e0d\u662f DataFrame\u3002<\/p>\n<p>Untyped transformations:<\/p>\n<ul>\n<li>\u5e7e\u4e4e\u6240\u6709\u90a3\u4e9b DataFrame \u53ef\u4ee5\u7528\u7684 transformations\uff0c\u4f8b\u5982 <code>groupBy()<\/code><\/li>\n<\/ul>\n<p>Typed transformations:<\/p>\n<ul>\n<li>map()<\/li>\n<li>flatMap()<\/li>\n<li>reduce()<\/li>\n<li>groupByKey()<\/li>\n<li>agg()<\/li>\n<li>mapGroups()<\/li>\n<li>flatMapGroups()<\/li>\n<li>reduceGroups()<\/li>\n<\/ul>\n<p>Datasets don't get all of the optimization the DataFrames get!<\/p>\n<p>\u76e1\u91cf\u4f7f\u7528 Spark SQL \u7684\u5beb\u6cd5\uff08\u5373 relational operations\uff09\uff0c\u4f8b\u5982 <code>filter($&quot;city&quot;.as[String] === &quot;Boston&quot;)<\/code> \u6216 <code>select($&quot;age&quot;.as[Int])<\/code> \u7b49\uff0c\u5c11\u7528 higher-order functions\uff08\u5373 functional operations\uff09\uff0c\u4f8b\u5982 <code>filter(p =&gt; p.city == &quot;Boston&quot;)<\/code> \u6216 <code>map()<\/code> \u7b49\uff0cCatalyst \u5c0d\u5f8c\u8005\u7684\u512a\u5316\u4e0d\u597d\u3002<\/p>\n<p>ref:<br \/>\n<a href=\"https:\/\/www.coursera.org\/learn\/scala-spark-big-data\/lecture\/yrfPh\/datasets\">https:\/\/www.coursera.org\/learn\/scala-spark-big-data\/lecture\/yrfPh\/datasets<\/a><br \/>\n<a href=\"https:\/\/www.51zero.com\/blog\/2016\/2\/24\/type-safety-on-spark-dataframes-part-1\">https:\/\/www.51zero.com\/blog\/2016\/2\/24\/type-safety-on-spark-dataframes-part-1<\/a><\/p>\n<h2>Encoders<\/h2>\n<p>Encoders are what convert your data between JVM objects and Spark SQL's specialized internal representation. They're required by all Datasets.<\/p>\n<p>Two ways to introduce Encoders:<\/p>\n<ul>\n<li>Automatically via <code>import spark.implicits._<\/code><\/li>\n<li>Explicitly via <code>org.apache.spark.sql.Encoder<\/code><\/li>\n<\/ul>\n<p>ref:<br \/>\n<a href=\"https:\/\/www.coursera.org\/learn\/scala-spark-big-data\/lecture\/yrfPh\/datasets\">https:\/\/www.coursera.org\/learn\/scala-spark-big-data\/lecture\/yrfPh\/datasets<\/a><\/p>\n<h2>Broadcast<\/h2>\n<p>\u67d0\u500b DataFrame \u5c0f\u5230\u4e00\u53f0\u6a5f\u5668\u53ef\u4ee5\u5403\u5f97\u4e0b\uff0c\u5c31\u53ef\u4ee5 broadcast \u5b83\u3002<\/p>\n<p>ref:<br \/>\n<a href=\"https:\/\/umbertogriffo.gitbooks.io\/apache-spark-best-practices-and-tuning\/content\/when_to_use_broadcast_variable.html\">https:\/\/umbertogriffo.gitbooks.io\/apache-spark-best-practices-and-tuning\/content\/when_to_use_broadcast_variable.html<\/a><\/p>\n<h2>Spark UI<\/h2>\n<p>If your dataset is large, you can try repartitioning to a larger number to allow more parallelism on your job. A good indication of this is if in the Spark UI \u2013 you don\u2019t have a lot of tasks, but each task is very slow to complete.<\/p>\n<p>ref:<br \/>\n<a href=\"https:\/\/databricks.com\/blog\/2015\/06\/22\/understanding-your-spark-application-through-visualization.html\">https:\/\/databricks.com\/blog\/2015\/06\/22\/understanding-your-spark-application-through-visualization.html<\/a><br \/>\n<a href=\"https:\/\/databricks.com\/blog\/2016\/10\/18\/7-tips-to-debug-apache-spark-code-faster-with-databricks.html\">https:\/\/databricks.com\/blog\/2016\/10\/18\/7-tips-to-debug-apache-spark-code-faster-with-databricks.html<\/a><br \/>\n<a href=\"https:\/\/docs.databricks.com\/spark\/latest\/rdd-streaming\/debugging-streaming-applications.html\">https:\/\/docs.databricks.com\/spark\/latest\/rdd-streaming\/debugging-streaming-applications.html<\/a><\/p>\n<h2>Configurations<\/h2>\n<p>ref:<br \/>\n<a href=\"https:\/\/spark.apache.org\/docs\/latest\/configuration.html\">https:\/\/spark.apache.org\/docs\/latest\/configuration.html<\/a><br \/>\n<a href=\"https:\/\/spark.apache.org\/docs\/latest\/tuning.html#memory-tuning\">https:\/\/spark.apache.org\/docs\/latest\/tuning.html#memory-tuning<\/a><\/p>\n<h3>My spark-defaults.conf<\/h3>\n<pre class=\"line-numbers\"><code class=\"language-txt\">spark.driver.maxResultSize       2g\nspark.jars.packages              com.github.fommil.netlib:all:1.1.2,com.hankcs:hanlp:portable-1.3.4,mysql:mysql-connector-java:5.1.41,org.apache.httpcomponents:httpclient:4.5.2,org.elasticsearch.client:elasticsearch-rest-high-level-client:5.6.2\nspark.kryoserializer.buffer.max  1g\nspark.serializer                 org.apache.spark.serializer.KryoSerializer<\/code><\/pre>\n<h2>Allocate Resources (Executors, Cores, and Memory)<\/h2>\n<p>When submitting a Spark application via <code>spark-submit<\/code>, you may specify following options:<\/p>\n<ul>\n<li><code>--driver-memory MEM<\/code>: Memory for driver (e.g. 1000M, 2G) (Default: 1024M).<\/li>\n<li><code>--executor-memory MEM<\/code>: Memory per executor (e.g. 1000M, 2G) (Default: 1G).<\/li>\n<\/ul>\n<p>YARN and Spark standalone only:<\/p>\n<ul>\n<li><code>--executor-cores NUM<\/code>: Number of cores per executor. (Default: 1 in YARN mode, or all available cores on the worker in standalone mode)<\/li>\n<\/ul>\n<p>YARN only:<\/p>\n<ul>\n<li><code>--driver-cores NUM<\/code>: Number of cores used by the driver, only in cluster mode (Default: 1).<\/li>\n<li><code>--num-executors NUM<\/code>: Number of executors to launch (Default: 2).<\/li>\n<\/ul>\n<p>\u8a08\u7b97\u65b9\u5f0f\uff1a<\/p>\n<p>driver: 2cores 7.5GB x 1<br \/>\nworker: 8cores 30GB x 4<\/p>\n<ul>\n<li>actual memory for driver = driver memory - (10% overhead x 2)\n<ul>\n<li>7.5 - (7.5 x 0.1 x 2) = 6<\/li>\n<li><code>spark.driver.memory=6g<\/code><\/li>\n<\/ul>\n<\/li>\n<li>cores per executor = 5\n<ul>\n<li>for good HDFS throughput<\/li>\n<li><code>spark.executor.cores=5<\/code><\/li>\n<li><code>--executor-cores 5<\/code><\/li>\n<\/ul>\n<\/li>\n<li>total cores in cluster = (cores per node - 1) x total nodes\n<ul>\n<li>leave 1 core per node for YARN daemons<\/li>\n<li>(8 - 1) x 4 = 28<\/li>\n<\/ul>\n<\/li>\n<li>total executors in cluster = (total cores in cluster \/ cores per executor) - 1\n<ul>\n<li>leave 1 executor for ApplicationManager<\/li>\n<li>(28 \/ 5) - 1 = 4<\/li>\n<li><code>spark.executor.instances=4<\/code><\/li>\n<li><code>--num-executors 4<\/code><\/li>\n<\/ul>\n<\/li>\n<li>executors per node = (total executors in cluster + 1) \/ total nodes\n<ul>\n<li>(4 + 1) \/ 4 = 1<\/li>\n<\/ul>\n<\/li>\n<li>memory per executor = (memory per node \/ executors per node) - (10% overhead x 2) - 3\n<ul>\n<li>(30 \/ 1) - (30 x 0.1 x 2) - 3 = 21<\/li>\n<li><code>spark.executor.memory=21g<\/code><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>ref:<br \/>\n<a href=\"https:\/\/spark.apache.org\/docs\/latest\/submitting-applications.html\">https:\/\/spark.apache.org\/docs\/latest\/submitting-applications.html<\/a><br \/>\n<a href=\"https:\/\/spark.apache.org\/docs\/latest\/configuration.html\">https:\/\/spark.apache.org\/docs\/latest\/configuration.html<\/a><br \/>\n<a href=\"https:\/\/spoddutur.github.io\/spark-notes\/distribution_of_executors_cores_and_memory_for_spark_application\">https:\/\/spoddutur.github.io\/spark-notes\/distribution_of_executors_cores_and_memory_for_spark_application<\/a><br \/>\n<a href=\"https:\/\/www.slideshare.net\/cloudera\/top-5-mistakes-to-avoid-when-writing-apache-spark-applications\/21\">https:\/\/www.slideshare.net\/cloudera\/top-5-mistakes-to-avoid-when-writing-apache-spark-applications\/21<\/a><br \/>\n<a href=\"http:\/\/blog.cloudera.com\/blog\/2015\/03\/how-to-tune-your-apache-spark-jobs-part-2\/\">http:\/\/blog.cloudera.com\/blog\/2015\/03\/how-to-tune-your-apache-spark-jobs-part-2\/<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A Spark application is a set of processes running on a cluster, all these processes are coordinated by the driver program. The driver program is 1) the process where the main() method of your program runs, 2) the process running the code that creates a SparkSession, RDDs, DataFrames, and stages up or sends off transformations and actions.<\/p>\n","protected":false},"author":1,"featured_media":455,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[112],"tags":[108,109],"class_list":["post-454","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-about-big-data","tag-apache-spark","tag-scala"],"_links":{"self":[{"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/posts\/454","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/comments?post=454"}],"version-history":[{"count":0,"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/posts\/454\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/media\/455"}],"wp:attachment":[{"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/media?parent=454"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/categories?post=454"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/tags?post=454"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}