Solution Resort

Learn, try and share…

Submit Spark Streaming job without waiting for it to finish

A Spark Streaming job is different to an ordinary Spark job. It runs 24/7 and never stops until you tell it too. Oozie is a really good tool for scheduling and orchestrating Spark jobs, but when it comes down to also making it work with Spark Streaming jobs, things gets a bit tricky.

Consider the following workflow which is an example taken from a disaster recovery process,
1. Run an ordinary spark job (i.e. a ETL process) to recovery the data from a backup.
2. Run some checks to make sure the data recovered is correct
3. Start the Spark Streaming job.

By default, when a job is submitted via, the submission processed is locked until the actual Spark job finishes. This is not ideal for Spark Streaming, because it means the workflow itself will never finish. And that’s not the only problem, Oozie itself consumes quite a bit of resources. From my experience, Oozie needs around 2 CPU Cores and 2G of RAM as a minimum to run any Spark Job (1 Core and 1G of RAM per process. Tt uses one process for Oozie itself, and another one for the submission of the Spark job).

Well, the good news is there is an option to tell spark-submit not to wait when it submits a job, which is

And when it’s set to false, the spark-submit submission process will exit and return 0 immediately as soon as the job is submitted, and of course, the Oozie job will exit as well.

Be careful not to blindly use this option everywhere. In the above disaster recovery example, only the Spark Streaming part should use this option, not the disaster recovery itself or you will end up starting the streaming job without the disaster recovery been done at all.

This option is not an obvious one, and I only came cross it when reading Hope this helps the other having similar issues.

Scalable reporting solutions with Elasticsearch

See my talk at the Elastic London Meetup about our experience building scalable reporting systems using Elasticsearch (especially if you have a legacy platform)

MySQL in Docker without losing data after rebuild

When it comes down to running database services or anything that has states in it with docker containers, the first question is often “how about my data” after the container is destroyed or rebuilt?

The simple answer is you can use Docker Data Volumes.

After reading a few articles as well as trying it out myself, the easiest and cleanest way I found is to create a Data Container with a volume first, and then tell your MySQL container to use the data volume on that container.

This can be simply done with two commands,

# creates the volume container and mount /var/lib/mysql

docker create -v /var/lib/mysql --name mysql_data_store busybox /bin/true
# start the mysql container and tell it to use the created volume on the mysql_data_store container. Do not use the -d option if you are running it with CoreOS

docker run --volumes-from mysql_data_store --name mysql -e MYSQL_ROOT_PASSWORD=<your long password> -d mysql:latest

Now if you kill and remove the MySQL container and recreate it by mounting the same data volume again, all the data you have won’t be lost because the data volume has not been destroyed. The MYSQL_ROOT_PASSWORD option will be redundant the second time you run the container as the MySQL service has already been installed.

I’ve not tried this on production yet but will do soon on some hobby projects.

More reading:

Run Docker and Docker Compose in a Vagrant box

Created recently which provisions a Vagrant VM (Ubuntu trusty) with everything necessary installed to run docker & docker compose.

The main reason I created this is because it gives you an isolated environment to run things without going through the hassle of installing docker & docker compose which can be quite annoying if you are running i.e. Mac OSX and wanted to use your own docker hub repo.

All you need to do is to install Vagrant which could be much easier than installing docker & docker compose depending on the OS you use.

Simply check it out and follow the instructions.

How to: Install a Virtual Apache Hadoop Cluster with Vagrant and Cloudera Manager on a Mac

Feel free to skip some of the steps if you already have certain packages installed

Get Cask
brew install caskroom/cask/brew-cask

Get Vagrant & Vagrant plugins
brew cask install virtualbox
brew cask install vagrant
brew cask install vagrant-manager
vagrant plugin install vagranthostmanager

Install Hadoop
git clone [email protected]:richardhe-awin/vagrant-hadoop-cluster.git
cd vagrant-hadoop-cluster
vagrant up

Configure Cloudera Manager (mostly referenced from

  1. Go to http://hadoop-master:7180/ (you might have to wait for a few minutes for the service to boot up before this is available)  and login with admin/admin
  2. Choose to use the Express version and continue
  3. When you are asked to enter the host names, enter hadoop-node1 and hadoop-node2 and click search. You should see the two hosts coming up and confirm.
  4. Keep using the default option until you got to the page asking “Login to all hosts as”. Change this to “Another user” and enter “vagrant” as the username and enter “vagrant” again for the password fields. Click next and it should start installing (this will take a while).
  5. On the “Cluster Setup” page, choose “Custom Services” and select the following: HDFS, Hive, Hue, Impala, Oozie, Solr, Spark, Sqoop2, YARN and ZooKeeper. Click Continue.
  6. On the next page, you can select what services end up on what nodes. Usually Cloudera Manager chooses the best configuration here, but you can change it if you want. For now, click Continue.
  7. On the “Database Setup” page, leave it on “Use Embedded Database.” Click Test Connection (it says it will skip this step) and click Continue.
  8. Click Continue on the “Review Changes” step. Cloudera Manager will now try to configure and start all services.
  9. Done!


Example of caching MVC response using Filesystem cache in ZF2

The code is available on my Github:


The filesystem cache is configured within Application/module.config.php and therefore delegated to ZendCacheServiceStorageCacheAbstractServiceFactory to construct the Cache Adapter.

Response Caching

The caching for the MVC response is done through event listeners. This is for separating the concerns and making the code much better decoupled and reusable.

Check out the two methods loadPageCache() and savePageCache() within Application/Module.php

savePageCache() is attached to the MvcEvent::EVENT_RENDER event with a very low priority. This makes sure $e->getResponse()->getContent() is populated before adding it to the cache.

loadPageCache() is attached to the MvcEvent::EVENT_ROUTE event with a low priority. This allows all other attached events to run first before loading the response data cache. If the response data is within the cache, $e->getResponse()->setContent() will be called and the response object will be returned. This stops all subsequent listeners attached to the same event from executing. You might wonder why the savePageCache() method no longer gets ran either, and that’s attached to a different event (EVENT_RENDER). The trick is actually done within ZendMvcApplication::run() by the following block of code:

$result = $events->trigger(MvcEvent::EVENT_ROUTE, $event, $shortCircuit);
    if ($result->stopped()) {
        $response = $result->last();
        if ($response instanceof ResponseInterface) {
            $events->trigger(MvcEvent::EVENT_FINISH, $event);
            $this->response = $response;
            return $this;

You can see the $result->stopped() returns true in this case and the $result object is an instance of ZendEventManagerResponseCollection. The last result is the response object with the data retrieved from the cache!

Mount Shared Folder to Ubuntu if auto-mount doesn’t work

I had some issues with mounting shared folders with Virtual Box because they don’t appear anywhere. I finally figured out that the Shared drive is already there, it’s just not showing anywhere.

Before continuing, making sure you have installed the latest version of Guest Additions and your user is added to the vboxsf group.

First, add a shared folder under Devices -> Shared Folder Settings with the following properties (I am using windows in this example),

Folder Path: C:your-folder
Folder Name: shared-name-on-ubuntu

Go to a folder, e.g. ~/ and run

mkdir ~/shared-name-on-ubuntu

Finally mount it

sudo mount -t vboxsf shared-name-on-ubuntu ~/shared-name-on-ubuntu

And now you should be able to see everything shared within Ubuntu.


Create a Zend Framework 2 Module for Composer

I am sure everyone has been using Zend Framework 2 (ZF2) modules created by many others via Packagist, but it can be a little tricky when it comes down to creating your own ZF2 modules to work with Composer, especially when using your own private VCS repository. I found the documentation for setting this up is rather difficult to find so I thought it might be useful to pull everything together.

Let’s get started.

First, we need to create a ZF2 module.  To save a bit of time, I have already created a skeleton ZF2 module on Github which you can just grab. This is a typical ZF2 module with an additional composer.json file.  composer.json is required by Composer to specify the configuration options. The minimum you’ll need is “name”, “require” and “autoload”. The “name” section is used to tell Composer what your module is called and it is a unique identifier for your module . Please use a unique namespace to avoid conflicting with other projects. The “require” section is used to specify dependencies. For example, if your module is depending on ZF2, you’ll need to include it there. Finally the “autoload” section is used to define the PHP namespace for your module. This is explained very well by Composer. Without specifying the autoload section, Composer won’t generate an autoloader for the root namespace of your module and therefore ZF2 won’t know where to find it. In addition to the “psr-0” section, I have also added a “classmap” section including the path to “Module.php”. This is required because Module.php within a ZF2 module is outside the src folder, in order to make ZF2 aware of it, we’ll need to add it separately.

We have created our first ZF2 module that works with Composer. Now we need to see this in action like how we use all other open source Composer packages. The key difference here is we are likely to have our ZF2 module stored within a private VCS. To use it with Composer, it requires a special section added to composer.json file within our ZF2 Application.

"repositories": [
 "type": "vcs",
 "url" : "[email protected]:shenghuahe/zf2-composer-module.git"

I am using a Git repository on Github here. This section is required to tell Composer where to find your ZF2 module. And then, you’ll need to include it under the “require” section,

"require": {
 "shenghuahe/zf2-composer-module" : "dev-master"

“dev-master” is pointing to your master branch. You can always make a tag and point it to the tag instead.

That’s it. Run php composer.phar update and see it in action!


Other things to keep in mind

  • If you are using a private repository, you’ll need to authenticate with the server. I haven’t done this with SVN, but with GIT, all you have to do is to generate & install a SSH public/private key pair on your OS and copy & paste the public key to your GIT server. The instruction can vary depending on the host you choose. Here is the instruction for Bitbucket

Configure AngularJS server

In the Start an Angular project with Yeoman (Ubuntu 13.4) article (please read this one first or you may have problem following this article), I explained how to start an angular project with Yeoman as well as using Grunt to manage the work flow of your application. However you may wonder when making http requests to the server (back-end), where is that server and how it should be configured. I’ll cover all of that in this article. 

You can use anything for your back-end code. NodeJS with express, Zend Framework, Rails or anything else you prefer. The concept is the same. However if you are using Apache as the web server, you will have to add a virtual host for your back-end code. In this article, I’ll use NodeJs with express as the example. 

Luckily there is Yeoman generator for NodeJs with Express. To install it, simply run the following command in your terminal (I usually install generators on the global scope for convenience )

sudo npm install -g generator-node-express

My personal preference is to leave the angular client side code and the service side code in the same repository. However, they do need to be separated because they are two separate applications. I typically have the following folder structure,

YourApp/client #angularJS client side code
YourApp/server #nodeJS with express service side code

To generate the AngularJs client side scaffolding, you’ll need to run yo angular within the client folder. To create the server side code using Yeoman, simply run yo node-express within the server folder.

You’ll be asked a few questions when installing node-express, but because we don’t need any angular stuff nor the SASS parser, I recommend unticking those options when installing. If you had issues installing, please refer to the official documentation on NPM (

Now we need to start the node server. If you don’t want the faff, just go to YourApp/server and type in

node app.js

This will start your http server on localhost:3000. If you wish to automate the workflow (i.e. automatically restart the server when any watched files are modified, use grunt restart instead).

We now got to the point that we have a server and we can open the browser and type in to see our server side home page. How exciting! However remember our AngularJS app has been setup on port 9000. Making calls from to is going to cause problems because cross domain traffic is blocked by default unless you deliberately allow it.

To solve this problem, we need to setup a proxy. In other words, this means when accessing, we’ll tell the server go via instead behind the scene.

This is done by using the grunt-connect-proxy module which has a very good instruction on Although this is a indeed very good instruction, I do have a few things to point out because I got stuck on this for a few hours due to the lack understanding of grunt.

  • Under the connect:proxies section, the context has a value of ‘cortex’. This  means your route is pointing to not I thought it was going into the top level route but apparently it didn’t.
  • The next thing is it’s really useful to use the –verbose option when running grunt server. i.e. grunt server –verbose. This is going to give you very detailed output when starting the server and on top of that, you’ll also see access logs in the terminal. i.e.
Proxied request: /api/home -> http://localhost:3000/api/home
    "host": "",
    "connection": "keep-alive",
    "content-length": "2",
    "accept": "application/json, text/plain, */*",
    "origin": "",
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36",
    "content-type": "application/json;charset=UTF-8",
    "referer": "",
    "accept-encoding": "gzip,deflate,sdch",
    "accept-language": "en-GB,en-US;q=0.8,en;q=0.6",
    "cookie": "_ga=GA1.4.194784887.1384818518"

That’a pretty much it. Grab it and have fun!



Start an Angular project with Yeoman (Ubuntu 13.4)

This guide uses Yeoman to scaffold  an Angular Application. It should get you started in no time.


Please note that DO NOT use sudo to run any of the npm commands. npm will try to use the ~/tmp folder to create temporary files, this folder should belong to the user currently logged in, it can cause all sorts of permissions issues if you use the sudo command to run any npm tasks. However if you have done this in the past, the best thing to do is to recusively set everything within your home directory to the correct permission. Such as

sudo chown -R <username> ~/*
sudo chgrp -R <username> ~/*


Install Yeoman

npm install -g yo

Install Angular generator plug-in

npm install -g generator-angular

Install Rudy and Compass [optional: if you want to use SASS].

sudo apt-get install ruby -y
sudo apt-get install compass -y

*There is also a good article talking about why SASS is good.

Create the project

mkdir <project-folder-name> && cd $_
yo angular

*This will only generate the bare minimum skeleton for your application, if you need to generate controllers, views, templates, read the guide on github.

To build the project for production


To preview the project

grunt server

*This is the genius part. When you start the server, all changes you made to the application will be instantly reflected into the browser. This is done by configuring the ‘watch’ section within gruntfile.js

At last, I strongly suggest you watch the introduction videos on the Yeoman website.

Page 1 of 2

Powered by WordPress & Theme by Anders Norén