Spark troubleshooting

Spark troubleshooting

Apache Spark 2.x Troubleshooting Guide
https://www.slideshare.net/jcmia1/a-beginners-guide-on-troubleshooting-spark-applications
https://www.slideshare.net/jcmia1/apache-spark-20-tuning-guide

Check your cluster UI to ensure that workers are registered and have sufficient resources

PYSPARK_DRIVER_PYTHON="jupyter" \
PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip 0.0.0.0" \
pyspark \
--packages "org.xerial:sqlite-jdbc:3.16.1,com.github.fommil.netlib:all:1.1.2" \
--driver-memory 4g \
--executor-memory 20g \
--master spark://TechnoCore.local:7077
TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

可能是你指定的 --executor-memory 超過了 worker 的 memory。

你可以在 Spark Master UI http://localhost:8080/ 看到各個 worker 總共有多少 memory 可以用。如果每台 worker 可以用的 memory 容量不同,Spark 就只會選擇那些 memory 大於 --executor-memory 的 workers。

ref:
https://spoddutur.github.io/spark-notes/distribution_of_executors_cores_and_memory_for_spark_application

SparkContext was shut down

ERROR Executor: Exception in task 1.0 in stage 6034.0 (TID 21592)
java.lang.StackOverflowError
...
ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerJobEnd(55,1494185401195,JobFailed(org.apache.spark.SparkException: Job 55 cancelled because SparkContext was shut down))

可能是 executor 的記憶體不夠,導致 Out Of Memory (OOM) 了。

ref:
http://stackoverflow.com/questions/32822948/sparkcontext-was-shut-down-while-running-spark-on-a-large-dataset

Container exited with a non-zero exit code 56 (or some other numbers)

WARN org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1504241464590_0001_01_000002 on host: albedo-w-1.c.albedo-157516.internal. Exit status: 56. Diagnostics: Exception from container-launch.
Container id: container_1504241464590_0001_01_000002
Exit code: 56
Stack trace: ExitCodeException exitCode=56:
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
    at org.apache.hadoop.util.Shell.run(Shell.java:869)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)

Container exited with a non-zero exit code 56

可能是 executor 的記憶體不夠,導致 Out Of Memory (OOM) 了。

ref:
http://stackoverflow.com/questions/39038460/understanding-spark-container-failure

Exception in thread "main" java.lang.StackOverflowError

Exception in thread "main" java.lang.StackOverflowError
    at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
    at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    ...

解決辦法:

import org.apache.spark.ml.recommendation.ALS
import org.apache.spark.sql.SparkSession

val spark: SparkSession = SparkSession.builder().getOrCreate()
val sc = spark.sparkContext
sc.setCheckpointDir("./spark-data/checkpoint")

// 因為 sc.setCheckpointDir() 就會啟用 checkpoint 了
// 所以可以不用特別指定 checkpointInterval
val als = new ALS()
  .setCheckpointInterval(2)

ref:
https://stackoverflow.com/questions/31484460/spark-gives-a-stackoverflowerror-when-training-using-als
https://stackoverflow.com/questions/35127720/what-is-the-difference-between-spark-checkpoint-and-persist-to-a-disk

Randomness of hash of string should be disabled via PYTHONHASHSEED

解決辦法:

$ cd $SPARK_HOME
$ cp conf/spark-env.sh.template conf/spark-env.sh
$ echo "export PYTHONHASHSEED=42" >> conf/spark-env.sh

ref:
https://issues.apache.org/jira/browse/SPARK-13330

It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

因為 spark.sparkContext 只能在 driver program 裡存取,不能被 worker 存取(例如那些丟給 RDD 執行的 lambda function 或是 UDF 就是在 worker 上執行的)。

ref:
https://spark.apache.org/docs/latest/rdd-programming-guide.html#passing-functions-to-spark
https://engineering.sharethrough.com/blog/2013/09/13/top-3-troubleshooting-tips-to-keep-you-sparking/

Spark automatically creates closures:

  • for functions that run on RDDs at workers,
  • and for any global variables that are used by those workers.

One closure is send per worker for every task. Closures are one way from the driver to the worker.

ref:
https://gerardnico.com/wiki/spark/closure

Unable to find encoder for type stored in a Dataset

Unable to find encoder for type stored in a Dataset.  Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._  Support for serializing other types will be added in future releases. someDF.as[SomeCaseClass]

解決辦法:

import spark.implicits._

yourDF.as[YourCaseClass]

ref:
https://stackoverflow.com/questions/38664972/why-is-unable-to-find-encoder-for-type-stored-in-a-dataset-when-creating-a-dat

Task not serializable

Caused by: java.io.NotSerializableException: Settings
Serialization stack:
    - object not serializable (class: Settings, value: [email protected])
    - field (class: Settings$$anonfun$1, name: $outer, type: class Settings)
    - object (class Settings$$anonfun$1, <function1>)
Caused by: org.apache.spark.SparkException:
    Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)

通常是你在 closure functions 裡使用了 driver program 裡的某個 object,因為 Spark 會自動 serialize 那個被引用的 object 一起丟給 worker node 執行,所以如果那個 object 或是 class 沒辦法被 serialize,就會出現這個錯誤。

ref:
https://www.safaribooksonline.com/library/view/spark-the-definitive/9781491912201/ch04.html#user-defined-functions
http://www.puroguramingu.com/2016/02/26/spark-dos-donts.html
https://stackoverflow.com/questions/36176011/spark-sql-udf-task-not-serialisable
https://stackoverflow.com/questions/22592811/task-not-serializable-java-io-notserializableexception-when-calling-function-ou
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html
https://mp.weixin.qq.com/s/BT6sXZlHcufAFLgTONCHsg

如果你只有在 Databricks Notebook 裡遇到這個錯誤,因為 Notebook 的運作機制跟一般的 Spark application 稍微有點不同,你可以試試 package cell。

ref:
https://docs.databricks.com/user-guide/notebooks/package-cells.html

java.lang.IllegalStateException: Cannot find any build directories.

java.lang.IllegalStateException: Cannot find any build directories.
    at org.apache.spark.launcher.CommandBuilderUtils.checkState(CommandBuilderUtils.java:248)
    at org.apache.spark.launcher.AbstractCommandBuilder.getScalaVersion(AbstractCommandBuilder.java:240)
    at org.apache.spark.launcher.AbstractCommandBuilder.buildClassPath(AbstractCommandBuilder.java:194)
    at org.apache.spark.launcher.AbstractCommandBuilder.buildJavaCommand(AbstractCommandBuilder.java:117)
    at org.apache.spark.launcher.WorkerCommandBuilder.buildCommand(WorkerCommandBuilder.scala:39)
    at org.apache.spark.launcher.WorkerCommandBuilder.buildCommand(WorkerCommandBuilder.scala:45)
    at org.apache.spark.deploy.worker.CommandUtils$.buildCommandSeq(CommandUtils.scala:63)
    at org.apache.spark.deploy.worker.CommandUtils$.buildProcessBuilder(CommandUtils.scala:51)
    at org.apache.spark.deploy.worker.ExecutorRunner.org$apache$spark$deploy$worker$ExecutorRunner$$fetchAndRunExecutor(ExecutorRunner.scala:145)
    at org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:73)

可能的原因是沒有設置 SPARK_HOME 或是你的 launch script 沒有讀到該環境變數。

IPython: Another Python REPL interpreter

IPython: Another Python REPL interpreter

IPython is a neat alternative of Python's builtin REPL (Read–Eval–Print Loop) interpreter, also a kernel of Jupyter.

ref:
https://ipython.org/
https://jupyter.org/

Useful Commands

# cheatsheet
%quickref

# show details of any objects (including modules, classes, functions and variables)
random?
os.path.join?
some_variable?

# show source code of any objects
os.path.join??

# show nothing for a function that is not implemented in Python
len??

# run some shell commands directly in IPython
pwd
ll
cd
cp
rm
mv
mkdir new_folder

# run any shell command with ! prefix
!ls
!ping www.google.com
!youtube-dl

# assign command output to a variable
contents = !ls
print(contents)

ref:
http://ipython.readthedocs.io/en/stable/interactive/tutorial.html

Magic Functions

# list all magic functions
%lsmagic

# run a Python script and load objects into current session
%run my_script.py

# run a profiling for multi-line code
%%timeit
array = []
for i in xrange(100):
    array.append(i)

# paste multi-line code
%paste
%cpaste

# explore objects
%pdoc some_object
%pdef some_object
%psource some_object
%pfile some_object

# if you call it after hitting an exception, it will automatically open ipdb at the point of the exception
%debug

# the In object is a list which keeps track of the commands in order
print(In)

# the Out object is a dictionary mapping input numbers to their outputs
pinrt(Out)
print(Out[2], _2)

ref:
http://ipython.readthedocs.io/en/stable/interactive/magics.html
https://www.safaribooksonline.com/library/view/python-data-science/9781491912126/ch02.html

Python linters

ESLint: a pluggable and extensible linter for JavaScript

ESLint: a pluggable and extensible linter for JavaScript

ESLint is a pluggable and extensible linter for JavaScript, supports ECMAScript 6 perfectly.

ref:
https://eslint.org/

Installation

$ npm install \
eslint \
eslint-plugin-import \
eslint-config-airbnb-base \
--save-dev

It's convenient to use the ESLint plugin from Airbnb instead of configurating your own preferences.
https://github.com/airbnb/javascript/tree/master/linters

Configuration

// .eslintrc
{
  "extends": ["airbnb-base"],
  "env": {
    "browser": true,
    "es6": true,
    "mocha": true,
    "node": true,
    "shared-node-browser": true
  },
  "rules": {
    "class-methods-use-this": 0,
    "curly": 2,
    "indent": [2, 2],
    "max-len": 0,
    "no-continue": 0,
    "no-plusplus": ["error", {"allowForLoopAfterthoughts": true}],
    "no-useless-escape": 0,
    "object-curly-spacing": 0,
    "quote-props": 0,
    "quotes": [1, "single"]
  }
}

You should put a .eslintrc in your project root as the global linting settings, also you can put a .eslintrc in any subfolders to specialize linting configurations.

ref:
http://eslint.org/docs/user-guide/configuring
http://eslint.org/docs/rules/

Tools for profiling your Python projects

Tools for profiling your Python projects

The first aim of profiling is to test a representative system to identify what's slow, using too much RAM, causing too much disk I/O or network I/O. You should keep in mind that profiling typically adds an overhead to your code.

In this post, I will introduce tools you could use to profile your Python or Django projects, including: timer, pycallgraph, cProfile, line-profiler, memory-profiler.

ref:
https://stackoverflow.com/questions/582336/how-can-you-profile-a-script
https://www.airpair.com/python/posts/optimizing-python-code

timer

The simplest way to profile a piece of code.

ref:
https://docs.python.org/3/library/timeit.html

pycallgraph

pycallgraph is a Python module that creates call graph visualizations for Python applications.

ref:
https://pycallgraph.readthedocs.org/en/latest/

$ sudo apt-get install graphviz
$ pip install pycallgraph
# in your_app/middlewares.py
from pycallgraph import Config
from pycallgraph import PyCallGraph
from pycallgraph.globbing_filter import GlobbingFilter
from pycallgraph.output import GraphvizOutput
import time

class PyCallGraphMiddleware(object):

    def process_view(self, request, callback, callback_args, callback_kwargs):
        if 'graph' in request.GET:
            config = Config()
            config.trace_filter = GlobbingFilter(include=['rest_framework.*', 'api.*', 'music.*'])
            graphviz = GraphvizOutput(output_file='pycallgraph-{}.png'.format(time.time()))
            pycallgraph = PyCallGraph(output=graphviz, config=config)
            pycallgraph.start()

            self.pycallgraph = pycallgraph

    def process_response(self, request, response):
        if 'graph' in request.GET:
            self.pycallgraph.done()

        return response
# in settings.py
MIDDLEWARE_CLASSES = (
    'your_app.middlewares.PyCallGraphMiddleware',
    ...
)
$ python manage.py runserver 0.0.0.0:8000
$ open http://127.0.0.1:8000/your_endpoint/?graph=true

cProfile

cProfile is a tool in Python's standard library to understand which functions in your code take the longest to run. It will give you a high-level view of the performance problem so you can direct your attention to the critical functions.

ref:
http://igor.kupczynski.info/2015/01/16/profiling-python-scripts.html
https://ymichael.com/2014/03/08/profiling-python-with-cprofile.html

$ python -m cProfile manage.py test member
$ python -m cProfile -o my-profile-data.out manage.py test --failtest
$ python -m cProfile -o my-profile-data.out manage.py runserver 0.0.0.0:8000

$ pip install cprofilev
$ cprofilev -f my-profile-data.out -a 0.0.0.0 -p 4000
$ open http://127.0.0.1:4000

cProfile with django-cprofile-middleware

$ pip install django-cprofile-middleware
# in settings.py
MIDDLEWARE_CLASSES = (
    ...
    'django_cprofile_middleware.middleware.ProfilerMiddleware',
)

Open any url with a ?prof suffix to do the profiling, for instance, http://localhost:8000/foo/?prof

ref:
https://github.com/omarish/django-cprofile-middleware

cProfile with django-extension and kcachegrind

kcachegrind is a profiling data visualization tool, used to determine the most time consuming execution parts of a program.

ref:
http://django-extensions.readthedocs.org/en/latest/runprofileserver.html

$ pip install django-extensions
# in settings.py
INSTALLED_APPS += (
    'django_extensions',
)
$ mkdir -p my-profile-data

$ python manage.py runprofileserver \
--noreload \
--nomedia \
--nostatic \
--kcachegrind \
--prof-path=my-profile-data \
0.0.0.0:8000

$ brew install qcachegrind --with-graphviz
$ qcachegrind my-profile-data/root.003563ms.1441992439.prof
# or
$ sudo apt-get install kcachegrind
$ kcachegrind my-profile-data/root.003563ms.1441992439.prof

cProfile with django-debug-toolbar

You're only able to use django-debug-toolbar if your view returns HTML, it needs a place to inject the debug panels into your DOM on the webpage.

ref:
https://github.com/django-debug-toolbar/django-debug-toolbar

$ pip install django-debug-toolbar
# in settiangs.py
INSTALLED_APPS += (
    'debug_toolbar',
)

DEBUG_TOOLBAR_PANELS = [
    ...
    'debug_toolbar.panels.profiling.ProfilingPanel',
    ...
]

line-profiler

line-profiler is a module for doing line-by-line profiling of functions. One of my favorite tools.

ref:
https://github.com/rkern/line_profiler

$ pip install line-profiler
# in your_app/views.py
def do_line_profiler(view=None, extra_view=None):
    import line_profiler

    def wrapper(view):
        def wrapped(*args, **kwargs):
            prof = line_profiler.LineProfiler()
            prof.add_function(view)
            if extra_view:
                [prof.add_function(v) for v in extra_view]
            with prof:
                resp = view(*args, **kwargs)
            prof.print_stats()
            return resp

        return wrapped

    if view:
        return wrapper(view)

    return wrapper

@do_line_profiler
def your_view(request):
    pass

ref:
https://djangosnippets.org/snippets/10483/

There is a pure Python alternative: pprofile.
https://github.com/vpelletier/pprofile

line-profiler with django-devserver

ref:
https://github.com/dcramer/django-devserver

$ pip install git+git://github.com/dcramer/django-devserver#egg=django-devserver

in settings.py

INSTALLED_APPS += (
    'devserver',
)

DEVSERVER_MODULES = (
    ...
    'devserver.modules.profile.LineProfilerModule',
    ...
)

DEVSERVER_AUTO_PROFILE = False

in your_app/views.py

from devserver.modules.profile import devserver_profile

@devserver_profile()
def your_view(request):
    pass

line-profiler with django-debug-toolbar-line-profiler

ref:
http://django-debug-toolbar.readthedocs.org/en/latest/
https://github.com/dmclain/django-debug-toolbar-line-profiler

$ pip install django-debug-toolbar django-debug-toolbar-line-profiler
# in settings.py
INSTALLED_APPS += (
    'debug_toolbar',
    'debug_toolbar_line_profiler',
)

DEBUG_TOOLBAR_PANELS = [
    ...
    'debug_toolbar_line_profiler.panel.ProfilingPanel',
    ...
]

memory-profiler

This is a Python module for monitoring memory consumption of a process as well as line-by-line analysis of memory consumption for Python programs.

ref:
https://pypi.python.org/pypi/memory_profiler

$ pip install memory-profiler psutil
# in your_app/views.py
from memory_profiler import profile

@profile(precision=4)
def your_view(request):
    pass

There are other options:
http://stackoverflow.com/questions/110259/which-python-memory-profiler-is-recommended

dogslow

ref:
https://bitbucket.org/evzijst/dogslow

django-slow-tests

ref:
https://github.com/realpython/django-slow-tests