Solution Resort

Learn, try and share…

Python CMD app like a CRON job in Docker

This Github repository is a boilerplate of a simple Python script called repeatably in a Docker container.

Recently I’ve deployed one of my home projects to ECS (AWS Elastic Container Service) using this approach and it worked seamlessly for over 3 weeks so I decided it’s time to share it. My use case was I needed to fetch some data every minute from a third party API and then send the data over to Kinesis. Google Functions combined with Google AppEngine is obviously one of the approaches ( but I think this approach using Docker is probably equally straightforward or even simpler.

The idea is very simple and it’s basically calling a Python script using a fixed interval in a bash script. The reason for using a shell script is that it’s not limited to just Python, but anything can be executed from the command line.

OK, give it a try and let me know how you get on with it 🙂

Install Python 3.6 on Ubuntu 16.04

This should save you some time on finding the right commands to run:

Install Docker on Ubuntu 16.04

Feel free to grab:

How to rebase and squash git commits

Configure PyCharm CE to work with Apache Spark

This guide should help you to setup PyCharm CE to work with Python3 and Apache Spark (tested with version 2.1)

First, Create a new Pure Python PyCharm project.

Now copy the content of to your project. Your IDE should complain at the following line

from pyspark.sql import SparkSession

because it doesn’t know where is pyspark.sql which is part of the Python Spark library.

In order to tell PyCharm where the Python Spark libraries are, you need to go to Preferences->Project->Project Structure and add the zip files under $SPARK_HOME/python/lib to the content root. $SPARK_HOME is the location of your Apache Spark directory. If you haven’t downloaded Apache Spark, you can download it here

Next, go to Run -> Edit Configurations

and create a new configuration using the default Python configuration profile and add the following environment variables,

SPARK_HOME=<your spark home dir>
PYTHONPATH=<your spark home dir>/python

Then specify the name of your main .py script and the location of your text file where you want the words to be counted.

Finally, run your new configuration and it should do a word count job using Apache Spark.

Have fun!

Boilerplate – Apache Spark with Spring profile

This is a boilerplate you can try out to get started with Apache Spark (version 2.10) with Spring profile quickly:

This should allow you to configure environment specific properties (i.e. path to read some input file) really easily.

Boilerplate – Groovy and Spock

This is a boilerplate you can try out to get started with Groovy and Spock quickly

I created this because it can be tricky to find all the right dependencies & plugins to get started with Spock. This should get you started in no time.

Submit Spark Streaming job without waiting for it to finish

A Spark Streaming job is different to an ordinary Spark job. It runs 24/7 and never stops until you tell it too. Oozie is a really good tool for scheduling and orchestrating Spark jobs, but when it comes down to also making it work with Spark Streaming jobs, things gets a bit tricky.

Consider the following workflow which is an example taken from a disaster recovery process,
1. Run an ordinary spark job (i.e. a ETL process) to recovery the data from a backup.
2. Run some checks to make sure the data recovered is correct
3. Start the Spark Streaming job.

By default, when a job is submitted via, the submission processed is locked until the actual Spark job finishes. This is not ideal for Spark Streaming, because it means the workflow itself will never finish. And that’s not the only problem, Oozie itself consumes quite a bit of resources. From my experience, Oozie needs around 2 CPU Cores and 2G of RAM as a minimum to run any Spark Job (1 Core and 1G of RAM per process. Tt uses one process for Oozie itself, and another one for the submission of the Spark job).

Well, the good news is there is an option to tell spark-submit not to wait when it submits a job, which is

And when it’s set to false, the spark-submit submission process will exit and return 0 immediately as soon as the job is submitted, and of course, the Oozie job will exit as well.

Be careful not to blindly use this option everywhere. In the above disaster recovery example, only the Spark Streaming part should use this option, not the disaster recovery itself or you will end up starting the streaming job without the disaster recovery been done at all.

This option is not an obvious one, and I only came cross it when reading Hope this helps the other having similar issues.

Scalable reporting solutions with Elasticsearch

See my talk at the Elastic London Meetup about our experience building scalable reporting systems using Elasticsearch (especially if you have a legacy platform)

MySQL in Docker without losing data after rebuild

When it comes down to running database services or anything that has states in it with docker containers, the first question is often “how about my data” after the container is destroyed or rebuilt?

The simple answer is you can use Docker Data Volumes.

After reading a few articles as well as trying it out myself, the easiest and cleanest way I found is to create a Data Container with a volume first, and then tell your MySQL container to use the data volume on that container.

This can be simply done with two commands,

# creates the volume container and mount /var/lib/mysql

docker create -v /var/lib/mysql --name mysql_data_store busybox /bin/true
# start the mysql container and tell it to use the created volume on the mysql_data_store container. Do not use the -d option if you are running it with CoreOS

docker run --volumes-from mysql_data_store --name mysql -e MYSQL_ROOT_PASSWORD=<your long password> -d mysql:latest

Now if you kill and remove the MySQL container and recreate it by mounting the same data volume again, all the data you have won’t be lost because the data volume has not been destroyed. The MYSQL_ROOT_PASSWORD option will be redundant the second time you run the container as the MySQL service has already been installed.

I’ve not tried this on production yet but will do soon on some hobby projects.

More reading:

Page 1 of 3

Powered by WordPress & Theme by Anders Norén