6(66) Nightmares of Microservice Testing

Tests are clearly an extremely important port of a developer’s day-to-day. They provide us with a safety net as we hack away at the rotten roots of our codebase, allowing us to feel more like a real engineer and less like we’re walking through an antiques shop wearing a 12th century suit of armour.

But as Dante said - there’s a special place in hell for evil-doing programmers, where the tests run ever so slowly, constantly flap and the coffee never ever brews! - or at least I’m pretty sure that’s what Inferno describes having never actually read it. In reality, we’ve all glimpsed this hell and its myriad mini tortures …

i) Distributed monolith

One of the most fundamental principles of a microservice architecture is independent deployment. There are numerous benefits that result from this, but at the most basic level it’s about simplification, speed and tiny increments.

So why - then - do we often insist on testing our whole suite of services in one huge gulp!

If you have to manage every single moving part in one massive test deployment then you no longer have a set of microservices but rather a painfully distributed monolith. Needless to say, this is going to be extremely difficult to consistent spin up, debug and develop for. Any activity at this level should really be kept to the few smoke or end-to-end tests that can verify your system.

Contract testing frameworks, such as pact, can be a saviour here.

ii) Tending to special unicorns

Tests of any category should be simple to check-out, kick-off and develop/debug. However, far too often, it can involve a Herculeon effort of trial and error just to get the damn things running in the first place.

Wrestling with pages of undocumented configuration, rivaling the bulk of War & Peace, can make you want to quit before you’ve even begun.

Machines that need some special pampering and preening before they even think about playing ball inevitably result in a team fearful of touching ‘the sacred server’ in case they piss off the testing gods.

Those that cannot be easily executed from a development environment are doomed to a future of bit rot and ever increasing padding with inventive variations on “sleep(1000)” to get the build to “please just go green again!”

A consistent way to deploy and control a deployment is the antidote here. Tooling such as arquillian can given Java developers that piece of mind. Or alternatively, if you’re able to encapsulate that “specialness” within a docker container you can use docker-compose (and friends) to orchestrate in a consistent way and move your tests in the right direction.

iii) Too much tea

If, more often than not, when you run your tests you have enough time to make a(nother) cup of tea then there’s a bit of an issue. Slow tests mean a far longer feedback loop, less pace and more importantly loss of focus.

It’s understandable that for a certain small class of test - those that run end-to-end - will take their fair share of time. But every effort should be made to keep the bulk of your tests running lightening quick.

The greatest crime of all is using those dreaded blocking waits to control flow within a test. These stack up really quickly and breed even faster - one sleep usually leads to enough instability to require a number of peers. Tests should be strictly event driven, reacting directly to triggers in the system under test.

iv) An unbalanced portfolio

We’re all aware of the sacred ‘test pyramid’ demonstrating the range of tests from fast and cheap to slow and costly. There are many variations on this metaphor these days, but they are all based on the same principle: cover as much of your functionality using the lower levels of the pyramid and only extend further up the structure as needed to cover the rest.

Avoid covering functionality in your heavier testing layers when it can be taken care of by the simpler and snappier ones.

v) Testing in paradise

Little in life is ever certain. However, although in the old monolith world you couldn’t ever be completely confident in each and every action, you could be fairly sure a call between one module and another was pretty much guaranteed.

In the distributed systems domain of microservices very little is certain. If your system has a large enough footprint it is almost never going to 100% healthy. It’s foolish to test and not take this into account, or to at least not mitigate these issues via more modern chaos engineering strategies.

Tools to check out in this space would include saboteur, hoverfly and simian army .

vi) Thinking that testing ends when you ship code

Traditional testing strategy is very much focussed on ‘the product’. You design it, develop it, run a whole bunch of tests (hopefully automated) before burning it on a disc and kicking it out the door.

Can’t we take the same approach with a microservice deployment? Well many have certainly tried and inevitably struggled to make it succeed. The inherent complexity in a more advanced distributed system makes catering for every eventuality an enormous task that quickly becomes insurmountable.

By throwing in the towel and stepping back it becomes clearer that testing is really a probability game. We want to get a near to 100% coverage as we can without spending all our resource on getting there.

A smarter investment may be to pick up much of low hanging fruit when it comes to traditional testing, and then investing the rest in improving our QA in production ability. Strategies such as:

  • canary deployments - routing a small selection of users onto a fresh deployment to check its stability.
  • shadowing - duplicating traffic from the live environment onto a new deployment, and comparing the results to assert correctness.

This clearly comes with a certain amount of maturity within the services themselves. You cannot go about deploying potentially breaking changes into production without an ability to roll back easily, gracefully degrade and to provide alternatives/fallbacks in the case of an issue.

Definitely a goal worth pursuing.

Share Comments

[Bits 1] Year of the Dog


Happy new year! … again

As I seem to only post blog entries at the start of a fresh year, here’s the first of what I hope will become a fairly frequent round-up of articles, books etc that I’ve found particularly interesting during the preceding week or so.

If all goes to plan then maybe I’ll get my next post out before the Mayan calendar flips over in July.


Articles

Introducing capsule networks

A brief, and understandable, overview of a new neural network architecture from the godfather of deep learning Geoffrey Hinton.

Lessons from optics, the other deep learning

A case for making deep learning more intuitive and approachable by comparing its layered nature to that of the field of optics.

How neural networks learn distributed representations

An intuition for how a deep learning net can capture representations and models across its distributed parts.

Game-theory insights into asymmetric multi-agent games | DeepMind

From the masters of reinforcement learning and AI game playing (DeepMind) comes a technique to quickly and easily identify the Nash equilibrium of an asymmetric multi-agent game.

Quantum computers ‘one step closer’

BBC article on some recent developments in the field of quantum computing.

Why the Web 3.0 matters and what you should know about it

Data privacy and governance are increasingly important topics in today’s world. Web 3.0 - the decentralized web - is approaching fast, thanks mainly to blockchain. This post brings you up to speed on many of the areas of our lives this may have the power to affect.


Books

Deep Learning with Python

A fantastic dive into the practical aspects of deep learning. The only real downside is that there’s such a wealth of information present that it’ll take a fair few iterations to fully absorb all of the content.

Share Comments

Technology Radar Jan '18

The Thoughtworks Technology Radar should be a familiar sight to any active technologist. The breadth of coverage across hundreds of interesting technologies, both new and old, has made its release something of an event in the software developer’s calendar.

In an interesting post by one of Thoughtworks key employees - Neal Ford - it is suggested that not only should enterprises produce their own version of the radar, but so should each individual software developer.

You need two radars: one for yourself, to help guide your career decisions, and one for your company.

In an attempt capture the direction of my career and hobbyist efforts I have taken this advice and produced my first personal radar.

TL;DR

First a quick overview of my “Intro to 2018” list, which seems to be split into three broad categories.

Near Term Increments
Items related to Java, microservices and containerization would fall under this category. Technologies that I use on a day to day basis, and in which I should be fairly fluent, whilst continuing to improve my understanding.

Longer Term Investments
Technologies that are more aspirational in scope - e.g. AI and machine learning topics - but are likely to become much more prevalent in the industry within the next 5 years or so.

Skill Diversification
Skills that would be considered a “parallel path” to my everyday, but of which I should have some decent appreciation of - even if only at a very basic level. For example, Android or Alexa development.

Now for the radar in full.

Techniques & Theory

Too Many Cucumbers!

hold

BDD, and Cucumber most specifically, are great tools when used in the right context. However having way too many tests, or generating unfocussed tests, can result in a slow, brittle and unmaintainable test suite. In addition to this, not having a key business stakeholder involved with the definition and understanding of the scenarios defeats the point somewhat. I’ll be personally looking to use this technique/tool more judiciously in future.

Deep Learning

trial

Deep learning, and related techniques, have made neural networks cool again! Not to jump on any kind of bandwagon, but I’ll be looking to further understand this class of algorithm more clearly, as I believe they will become much more prevalent in the daily life of a software developer in the upcoming decade and beyond.

Service Mesh

assess

A microservice architecture, although in the singular quite simple, in the aggregate can become quite an unmanagable behemoth. Managing communication, security, monitoring and other similar concerns that are orthogonal to the core business can form a significant part of this complex distributed system. Service meshes look to externalize and manage much of this trickery as a third part concern, allowing you to focus solely on what you do best.

Chaos Engineering

assess

If things can go wrong they usually will. Add to this fact that microservices up the surface area of things that can go wrong (specifically in between those services), then writing robust, resilient and scalable systems means we must probe at the corners of possibilities. A mature development team should be looking at chaos engineering practices to try and tease out any problematic edge cases before they occur in the wild.

Lightweight Architecture Decision Records

assess

Ever wonder why a decision was made or the context in which it was made. Lightweight architecture decision records are a technique for capturing those important decisions in a simple text format able to be kept in SCM alongside the sourcecode itself.

Consumer Driven Contract Testing

trial

An essential part of the microservice testing toolkit. They enable independent service deployments whilst maintaining that solid contract between any two neighbouring services.

Reinforcement Learning

assess

A relatively old machine learning technique that utilizes a feedback loop to train a software agent in a specific, realtime, environment. More recently, when combined with deep learning, has shown some very interesting applications - e.g. the famous Deep Mind Atari playing agents.

Data Structures & Algorithms

adopt

As a software developer, unless you’re very lucky, you spend much of your time putting things into databases and then pulling them out again in the future to show to an end user. As you can imagine, this doesn’t involve a ton of good old Computer Science, and inevitably the knowledge of those core algorithms slowly diminishes. This is a “note to self” to visit some of those topics, and keep that knowledge fresh.

Statistics

adopt

The continued rise of machine learning and data science require a good understanding of statistical methods and similar such topics. This is likely to become even more true as the field expands into every day software development.

Linear Algebra

adopt

Machine learning algorithms are very much based upon core linear algebra concepts. To truely understand the magic inside some of those algorithms, a good understand of linear algebra is essential.

Domain Driven Design

adopt

Designing microservices that are internally cohesive but well decoupled can be quite a challenge to impllement, as anyone who has worked on such a system will understand. An understanding of DDD is key to to ensure that the right design decisions are made.

Tools & Frameworks

Tensorflow

trial

TensorFlow is an open-source software library for dataflow programming across a range of tasks. It is a symbolic math library, and also used for machine learning applications such as neural networks.

Scikit Learn

trial

Scikit-learn is a free software machine learning library. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

Keras

assess

Keras is a high-level interface in Python for building neural networks. Keras is open source and runs on top of either TensorFlow or Theano. It provides an amazingly simple interface for creating powerful deep-learning algorithms to train on CPUs or GPUs

Jupyter

adopt

Increased interest in machine learning — along with the emergence of Python as the programming language of choice for practitioners in this field — has focused particular attention on Python notebooks, and the most popular of these is Jupyter.

Open Tracing

assess

A vendor-neutral open standard for distributed tracing

Apache Spark

hold

Apache Spark™ is a fast and general engine for large-scale data processing. Although still an excellen tool in the right circumstance, I’ve decided to focus my attention on learning less large-scale data science frameworks.

Pandas

adopt

pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Numpy

adopt

NumPy is the fundamental package for scientific computing with Python.

Platforms

AWS

adopt

Clearly AWS isn’t anything new, but I’ve personally neglected getting more involved with the platform for way too long. Here’s a “note to self” to dig in and understand it a little more.

Serverless

trial

The use of serverless architecture has very quickly become an accepted approach for organizations deploying cloud applications, with a plethora of choices available for deployment.

Android

trial

In a similar vein to my AWS entry, Android (or mobile development in general) should be understood at a basic level by any mature software developer - another todo for personal projects.

Alexa

trial

Voice platforms are high in popularity at the moment. Whether a fad, or real trend, it’s worth getting stuck in to a personal project on the Alexa platform for some diversification of skills.

Sage Maker

assess

Amazon SageMaker is a fully-managed service that enables developers and data scientists to quickly and easily build, train, and deploy machine learning models at any scale. Amazon SageMaker removes all the barriers that typically slow down developers who want to use machine learning.

Kaggle

adopt

Kaggle is a platform for predictive modelling and analytics competitions in which statisticians and data miners compete to produce the best models for predicting and describing the datasets uploaded by companies and users.

Kubernetes

adopt

Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications.

Docker

adopt

Docker is the world’s leading containerization platform - not much to add myself, just that it’s currenlty a must have for serverside development.

Languages

Kotlin

trial

Kotlin is a statically typed programming language for modern multiplatform applications. It has overtaken Scala as my personal JVM alt-language of interest, and I’m looking to play with it a fair bit more this year.

Python 2

hold

Python 2 has been king for as long as I can remember, but with Python 3 getting some great new features and wider community support, it has finally come time for its abdication.

Python 3

adopt

See Python 2

Scala

hold

For many years after my initial Clojure obession I saw Scala as my functional saviour from a frustrating and verbose Java. Despite many fun times with the language, its inherent complexity and (seeming) drop off in adoption have lead me to consider the growing alternatives - see Kotlin.

R

hold

A great data science and machine learning language. However, my larger personal experience with Python, and Python’s continued dominence in the data science community, have led me to trim down my options and focus on a deeper rather than broader personal research.

Effective Java 8/9

adopt

Java has (fairly) recently obtained some significant improves to the core language. However, as with everything, addition adds complexity. Some best practices around these features is key, and where better to study and in Joshua Bloch’s revised book.

Share Comments

Spring Cloud in 10 Bad Cartoons

A quick tour of the combined Spring Cloud / Netflix OSS microservice stack through some pretty terrible drawings, inspired by John Carnell’s book Spring Microservices in Action (the subject, that is, not the awful pictures)


Building well designed applications using microservices can require a great deal of maturity. Aspects such as service discovery, load balancing and gracefully handling failure are effectively mandatory, but can be painful to implement well.

Spring Cloud pulls together a number of well worn tools that help make a number of the core patterns of distributed systems simpler to wire up and manage. More specifically this can involve technology such as Consul, Zookeeper and the Netflix OSS stack.

We’ll now check out some of the patterns made available to you by utilizing Spring Cloud, and related tooling, through the medium of terrible drawings.

Netflix OSS

A ton of the functionality provided here is backed by the Netflix OSS stack. Service discovery, load balancing, fault tolerance and gateway routing features are all supporting by Netflix’s toolset, although the full stack does much more than this. In the picture below, I’ve marked out the specific libraries we’ll be checking out later, with a few notes as to their general purpose.


Configuration Management with Spring Cloud

At the tail end of the last century, NASA sent an orbiter to Mars with the intention of surveying the red planet to understand its water history and to search for the traces of evidence to suggest that life had once existed there. The spacecraft arrived, after a grueling ten month journey, only for disaster to strike and it burn up in the Martian atmosphere after flying almost 105 miles closer to the planet’s surface then intended. The reason for this, it turned out afterwards, was due to a misunderstanding between two separate development teams as to the units of force used throughout the system. On one hand, propulsion engineers at Lockheed Martin had used their standard expression of force in pounds. However, in space engineering, the commonly used units are newtons, and NASA engineers hadn’t thought to question any mismatch when integrating components. One pound of force is around 4.45 newtons, providing enough of a difference to cause the disaster.

So how does this relate at all to configuration management? Well, it’s a fairly crude example of the importance of a single source of truth, and how miscommunication across components in a system can result in a catastrophic outcome. The principles at work here apply to the services within a distributed system - most specifically, in this case, related to configuration of those services, and the concept of configuration drift. Let’s look at a definition of this:

Configuration Drift is the phenomenon where running servers in an infrastructure become more and more different as time goes on, due to manual ad-hoc changes and updates, and general entropy.

Instances of microservices should be totally unremarkable, and completely replaceable. Having any chance of a unique configuration being introduced to one service over others could cause unexpected issues within a production setting - and any kind of property/configuration file tied to a single instance of a service provides that chance.

This is where a centralized configuration strategy can help. All services point at that single source of truth, making divergence of configuration across them much less likely. As a second bonus feature the ability to change that one piece of information to affect all dependents instantly, can streamline general management of configuration within your distributed application.

Service Discovery with Ribbon & Eureka!

Another important aspect of a distributed system is how you actually connect all those moving parts together in the first place! Of course, it’s easy to statically configure a set of addresses on service boot, but what if one of those endpoints disappears or becomes unhealthy?

Eureka! and it’s partner, Ribbon, were designed to help solve this problem. As a service starts it registers itself with the central Eureka service. This allows any dependent service to find out who to talk to via this central point.

Eureka keeps tabs on a service instance by prodding it’s health-check API to ensure that it is available and happy to serve. If an instance is found to be unavailable or reporting issues, it is removed from the working list.

Ribbon keeps the client side of this arrangement simple. It is a request side library that keeps in touch with Eureka to keep track of those addresses that serve a certain function. It abstracts away the physical addresses for a location-transparent reference which we can use within our code to decouple our service from any of those upstream.

Failing Successfully with Hystrix

Netflix’s Hystrix is a fault tolerance library designed to prevent cascading failures across a distributed system - a place where failure is almost certainly going to occur at some point.

Application architecture is generally well enough designed to cater for large-scale failure, and by this I mean situations like a full server outage. Databases are replicated so that they can lose a cluster member and still remain unhindered. API calls are often load balanced across a number of identical instances of an application to avoid any single point of failure.

However, smaller scale failure, or a downward spiral of QoS, are generally less well handled. Specifically aspects such as intermittent failure and ever-increasing latency of upstream responses are not well catered for and so requests eventually back up, and overwhelm the system.

Circuit Breakers

A circuit breaker functions in a manner similar to its electrical counterpart. But rather than detecting an electrical surge it tries to prevent a situation where a struggling upstream service is becoming increasingly more stressed due to an overwhelming number of requests. It does this by monitoring the lifecycle of a remote service call. If the latency begins to creep up the connection is cut protecting its dependency.

Once disconnected, the behaviour of the circuit breaker changes somewhat. As calls continue to enter the service to which the circuit breaker belongs, the upstream endpoint is tested until we see that good service is resumed, at which point the circuit breaker is closed again and requests are allowed to flow freely once again.

As you can imagine, if the circuit breaker is open the requests are unable to successfully complete. Although beneficial to the upstream service being protected, it’s not great for the client that made the request in the first place.

This is where an extension of this pattern becomes useful.

Fallbacks

Instead of just allowing the request to crash ‘n’ burn, we can fail more gracefully by providing a fallback to our incomplete API call. This can be provided by a cache, an alternative or even just plain old stubbed data. The important thing is that to an outsider it looks just like the real thing.

For example - let’s say you are a service providing some personalized recommendations, but the recommendation engine has been circuit broken. By providing some general pre-cached recommendations an end user wouldn’t notice the difference, unless they really started poking around.

Bulkhead

Have you ever experienced a performance issue where a slow running resource, be it a database or API call, has caused requests to back up and eventually consume all the threads in your app. The reason for this is that your app is acting like a big hollow rowing boat - one leak and water eventually consumes the whole thing.

The bulkhead pattern (in reference to a ship’s bulk heads) is a way to isolate different remote calls into their own thread pools. If one remote resource causes requests to queue up then the problem is isolated to that single resource, protecting the rest of the application to carry on as normal as possible.

Zuul

Zuul (as in that nightmarish dog-monster thing from Ghostbusters) acts as a gatekeeper to your full suite of microservices. This single entrypoint for all requests allows you to manage several cross-cutting concerns in one place. Aspects such as security, monitoring and logging, to name but a few.

Zuul can intercept a request at potentially three separate points in the lifecycle of a request, allowing you to decorate with additional functionality as appropriate.

  • pre filters add custom logic to process the request as it enters your “domain”
  • post filters are the final stop as a response leaves your platform. For example, to log the completion of the request.
  • route filters intercept the request before it travels upstream and gives you the chance to alter its destination. Great for managing A/B testing, and similar strategies.

In addition to this Zuul integrates seamlessly with the Eureka service discovery engine, to be able to dynamically determine healthy upstream resources.

Event Based Architecture

Often microservices communicate via RESTful API calls. REST implies synchronicity because the request/response are naturally tied together and as such cause a fairly tight coupling between two services.

Unfortunately, this tight coupling causes some complexity in managing communication in aspects such as fault tolerance (i.e the Hystrix library we discussed earlier). Synchronous communication is also much more affected by general slowness making graceful degradation a tricky prospect.

By decoupling services through some kind of message bus we gain many advantages through asynchronous messaging. The ability to easily scale, to cope with outages and downtime, and to evolve your architecture to support additional consumers, are all great benefits that can make your system much more resilient and flexible.

Zipkin

With all of this technology chatting away in a distributed fashion it can make debugging a production issue quite a challenge, to say the least.

The OpenTracing initiative aims to alleviate this problem by providing a vendor, language and framework independent solution - of which Zipkin is a member.

A single request flow, or trace, is started at our gateway (e.g. Zuul) and propagates through the full traversal. Each trace is broken down into a number of spans which represent some service processing step such as a database call.

Traces are captured and logged to a central service for only a small sample of requests (by default 10%). The final outcome is a visual representation of the lifecycle of a request through your system, accompanied by some key metrics at each stage (span) allowing you to track down those sneaky areas of concern.

Summary

Spring Cloud can greatly simplify the development of a suite of robust, cleanly integrated microservices. However, there is an open question as to how well this integration stretches to non-spring services - and, of course, part of the microservice mantra is to use the right tool for each job, which may lead to a more diverse technology footprint across many teams. I suppose this may be the case in which a service mesh makes most sense. But if you are starting out with a handful of Java based services, you could do much worse than adopting the Spring Cloud framework and it’s associates.

Share Comments

Python 3 Asyncio


Asynchronous programming has become a core practice over the last decade. One thread per request architectures have disappeared for the most part and have been replaced by non-blocking functionality, whether a Servlet 3 style @Suspend, NIO actor support via Akka or coding at a lower level against the Netty framework. Of course there are many advantages to asynchronous programming, most centered around resource efficiency and avoiding CPU context switches - but that’s a topic that has already been very well covered many times over, so I’ll avoid a detour right now.

Python has had some great asynchronous options for a long time. Gevent, Twisted and Tornado have all had, and continue to have, a great community to support them. Each being a standalone library they have a certain opinionated approach that is somewhat incompatible with the others. For example, Tornado is specifically targeted at web application programming whereas Gevent and Twisted are more general concurrency libraries for interleaving non-blocking ‘threads’.

Enter async/await. Python 3 seems to finally be gaining the traction it deserves to help it overthrow the long standing Python 2 dominance. One major evolution that has been added fairly recently (late 2015) is the native asyncio library that gives us coroutine support directly within the core language.

Let’s spend a little time checking out examples of this exciting new feature, but first - what the heck are coroutines?

Programming with Coroutines

So what are coroutines? They’re not something I’d ever come across in the Java world, and they take a little getting used to. Here’s the formal definition:

Coroutines are computer program components that generalize subroutines for non-preemptive multitasking, by allowing multiple entry points for suspending and resuming execution at certain locations. Coroutines are well-suited for implementing more familiar program components such as cooperative tasks, exceptions, event loops, iterators, infinite lists and pipes.

Umm, … so … did that make sense? I gather that it’s something about pausing and resuming execution, but that’s about it.

Let’s try a real world analogy - picking up some dinner from your favourite takeaway.

Arriving at the takeaway you politely join the queue of customers (the coroutines) waiting in line to get served by the cashier. For simplicity’s sake, let’s assume there is only one cashier able to serve you. This person represents the main process, or event loop.

Each customer is served by the cashier one at a time. He/she takes your order, money, and gives you a little ticket representing the meal being prepared behind the scenes. The process of cooking your food represents external I/O, such as a network call or filesystem access.

Now, in a blocking I/O world, you and the cashier would awkwardly eyeball each other for the full length of time it took to cook your food - the other customers looking on in frustration. By using coroutines, rather than keeping the poor cashier occupied, you put your interaction on hold and go wait quietly in the corner so that another customer can be served in the meantime. When your food is ready to go the cashier is notified by the kitchen and resumes your interaction to complete the process.

In a similar way coroutines yield their control of the main thread when they encounter a blocking operation and then resume execution once that operation completes. This allows for the core process to remain active as long as there is valuable work to be done, rather than blocking important tasks whilst waiting on external factors.

Now for some specific examples.

Hello World of Asyncio

Let’s start off with the obligatory ‘hello world’.

1
2
3
4
5
6
7
8
import asyncio

async def hey():
print("Hello World!")

loop = asyncio.get_event_loop()
loop.run_until_complete(hey())
loop.close()

The function itself is pretty mundane although it is marked with the async keyword to denote its nature. To get access to the event loop to execute this special type of routine, we grab it from asyncio and ask it to run our coroutine to completion.

The example below is slightly more interesting, showing how coroutines can interact with and utilize one another.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import asyncio

async def excl():
return "!"

async def world():
return "World" + await excl()

async def hello():
print("Hello " + await world())

loop = asyncio.get_event_loop()
loop.run_until_complete(hello())
loop.close()

Here we are introduced to the await keyword. This allows one coroutine to yield its program flow until we get the result from another asynchronous operation, at which point we can resume operation. Rather simply, here we are just chaining the invocation and resolution of a series of coroutines.

Fibonacci Time and Async Decorators

Now for something a bit meatier - recursion and decoration!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import asyncio

def memo(fn):
cache = {}
async def wrap(n):
if n in cache:
print("found %s in cache" % (n))
return cache[n]
res = await fn(n)
print("calculated %s" % (n))
cache[n]= res
return res
return wrap

@memo
async def fibn(n):
if n == 0:
return 0
if n == 1:
return 1
return await fibn(n-2) + await fibn(n-1)


loop = asyncio.get_event_loop()
c = loop.run_until_complete(asyncio.gather(fibn(4), fibn(4), fibn(4)))
print(c)
loop.close()

First, take a look at the fibn function. This is a good old recursive fibonacci, however the delegate calls are made asynchronously. Of course, this gives no real benefit, but does prove we can invoke coroutines in a recursive fashion.

More interestingly we have implemented a memoizing decorator that will cache the result of an asynchronous call. We have to be careful to declare the wrapper function (as created within the decorator) as async so that we can await completion of the decorated function.

But Who Cares About Async Fibonacci!?

Playing with these toy functions is all very well, but what bearing does this have on helping us solve real world problems?

TBH, the list of libraries supporting async/await functionality right now (at least at the time of writing) seems to be fairly limited. Obviously the intention is that over time this will become a much more richly supported and standardized way of writing asynchronous code with the core python language. Although, having said that, as long as you’re using fairly common technology you’ll likely find at least one driver that supports this approach.

In this final example we can see how easy it is to wire up an asynchronous application end-to-end - a REST service that reads and writes asynchronously to mongodb.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import json
from aiohttp import web
from motor.motor_asyncio import AsyncIOMotorClient

client = AsyncIOMotorClient('mongodb://mongodb:27017')
db = client.demo

async def retrieve(req):
id = req.match_info.get('id')
result = await db.stuff.find_one({'_id': id})
if result:
return web.Response(body=json.dumps(result))
return web.Response(status=404)

async def save(req):
id = req.match_info.get('id')
json = await req.json()
json['_id'] = id
result = await db.stuff.replace_one({'_id': id}, json, upsert=True)
status = 200 if result.matched_count >= 1 else 201
return web.Response(status=status)

app = web.Application()
app.router.add_get('/stuff/{id}', retrieve)
app.router.add_put('/stuff/{id}', save)
web.run_app(app)

There are a few notable points in this small application.

  • [line 8, 15] First we have the handling of requests themselves. Our handlers are specified as coroutines so that the HTTP server doesn’t have to block a thread whilst waiting for the end-to-end response to complete.
  • [line 17] Even trivial I/O, such as reading the request payload from the socket, is handled in an efficient way.
  • [line 10, 19] As expected, any database access is handled in a non-blocking fashion.

So pretty simple stuff, but it all seems to fit together in a sensible and consistent manner. It’ll be interesting to see how this feature of python evolves both within the language and across the ecosystem over the next few years.

Happy non-blocking!

Share Comments

Microservices Part 2: Breaking Up That Monolith


This post is part of a larger series on the challenges commonly encountered whilst adapting and running a microservice style architecture. For further entries in this series please check out the following links:


One of the most important problems you’ll encounter whilst developing and evolving a microservice architecture is that of dividing up an existing monolith or domain into a number of well defined and decoupled entities.

So how can we divide and conquer in a sensible manner? First let’s look at one of the approaches taken by Domain Driven Design - specifically that of Bounded Contexts.

Bounded Context

When developing any application you spend much of your time modeling the real world that it is designed to serve. The terminology that emerges out of this process generally becomes accepted across the development team as a whole, forming the unified domain model for your business. These same concepts get encoded as objects with various states and behaviors inside your application.

However, once the application reaches a certain size it becomes increasingly difficult to stretch these models to cover all aspects of the business domain. For example, an ‘account’ will likely mean something very different to the billing department than one geared towards managing security. This can lead to a confusion of responsibilities within the model, and whilst teams will generally develop using the same terminology in reality they actually mean very different things.

A solution proposed by the Domain Driven Design methodology is to divide up our unified model. This approach improves upon the above by chopping these conflicting concerns into a number of separate areas - each a bounded context. This allows for coherent discussion and clear modeling to take place within the bounds of a context, adhering more closely to the single responsibility principle. It also allows us to map out the relationships between each bounded context so that the interactions between them are more clearly defined.

To continue our ‘account’ example, we would split both ‘billing’ and ‘security’ into different bounded contexts. We are then able to reason about the concerns of the model separately - the billing team concerning themselves more with payments, whilst security with any rights or permissions given to an account.

Bounded Context Example

It follows from this, that the way we breakup our application into services corresponds very naturally to the bounded contexts we define. Taking this approach allows us to reduce the amount of knowledge that any one team has to keep in mind, as the focus only has to be within a specific context. This also leads to more cleanly separated entities, which should reduce external dependencies and simplify over-complicated chatter.

In short, a service should correspond to a single business domain, and not cross boundaries.

Things that Change Together Stay Together

The ability to separate business concerns into neat little packages that can be managed and worked on separately is what enables all that dividing to actually conquer your monster scalability problem. Let’s quickly review two fairly simple but fundamental concepts of good software design:

Cohesion refers to the degree to which the elements inside a module belong together. Thus, cohesion measures the strength of relationship between pieces of functionality within a given module. For example, in highly cohesive systems functionality is strongly related.

Coupling is the degree of interdependence between software modules; a measure of how closely connected two routines or modules are; the strength of the relationships between modules.

After revisiting these concepts it becomes clearer that microservices should be highly cohesive but loosely coupled. Concepts and functionality that are strongly related need to be kept within the bounds of the same service. Conversely, more weakly related concepts should tend to exist within separate modules.

An architecture that keeps correlated concerns together and pushes unrelated concepts apart allows for more robust development and deployment strategies. Due to this loose coupling of components, the impacts of one service on another should be simplified to allow for both parallel development (with fewer blocking elements between teams) and an independent deployment lifecycle - i.e. avoiding the need to orchestrate delivery between whole suites of services.

If you find that the addition of a feature to your application requires a tough co-ordination effort, then you should consider whether your microservices really are cohesive and loosely coupled, or if abstractions and business logic are crossing several boundaries. If several services are always changing in step with one another then the question really is whether they’re partitioned well-enough? If the answer seems to be no, then it’s time to start merging and refactoring those components to achieve a smarter separation.

Strangler Pattern

It’s very rare that a greenfield project lands in your lap. Unfortunately for us developers most of our days are spent maintaining and evolving existing systems. This is definitely something that microservices (done well) can help you tackle more easily in future, but let’s come back to today’s reality. How we can sensibly refactor an existing ball-of-mud application into a more managable architecture?

Beware the Big Rewrite

After decades of high profile failures it has generally become a natural intuition of software professionals to avoid “The Big Rewrite”, but it’s worth a cautionary mention anyhow.

Second System Syndrome refers to the common outcome of replacing a profitable, but flawed, system with a complete rewrite that generally misses much of the point of what made the original system successful in the first place.

When it seems to be working well, designers turn their attention to a more elaborate second system, which is often bloated and grandiose and fails due to its over-ambitious design. In the meantime, the first system may also fail because it was abandoned and not continually refined.

What is the ‘Strangler Pattern’?

The Strangler Pattern is a method of slowly wrapping and replacing an existing system, usually a monolith, in a slow and methodical fashion. It is named after the strangler fig vines found in tropical climates. These vines slowly grow upon an existing tree, eventually covering (and effectively replacing) the host.

This same pattern can be applied in evolving a piece of software. One by one, each part of the application (potentially identified by a bounded context) is refactored into a new service and spun out on its own. A façade provides the single entry point to your API disguising those parts of the app that have been migrated vs those which are still waiting in line for attention.

This iterative process gives us many benefits, including:

  • Keeping each refactoring managable due to its small, well defined, context
  • Constant validation of the new functionality vs the old in a real-world, production, scenario
  • Ability to handle failure more gracefully due to a rollback being as simple as redirecting the façade’s requests

Next Time

Next time, we’ll look into the communication patterns available to connect your microservices together in a maintainable and robust way.

References

Share Comments

Microservices (Mind the Gap) Part 1: An Introduction


This post is part of a larger series on the challenges commonly encountered whilst adapting and running a microservice style architecture. For further entries in this series please check out the following links:


Introduction (aka ‘The Positive Bit’)

The microservice architectural style seems to be continuing with its ever-rising popularity, and with good reason. There is a lot to be gained from adopting this model to ensure that your large application can be developed at a fast pace, scaled appropriately, and delivered to production with higher frequency and less overall risk.

Let’s discuss some of the key benefits of a microservice architecture.

Scaling Your Development Team

Monolithic applications aren’t actually a bad thing (despite the constant bad press), but once your product reaches a certain size you will likely hit several scalability challenges. The first of these will likely come as an organizational challenge rather than anything grounded in technology.

Many of us have worked on one of those humongous Java applications containing hundreds of thousands of lines of code, get deployed into an enterprise application server (such as JBoss or WebSphere) and are supported by some large RDBMS. To keep up with the competition teams will need to develop features on this monolith in parallel, and here’s where the problems creep in. Merge car crashes, leaky modules and an unwieldy test framework soon make the application a nightmare platform on which to develop.

Microservices to the rescue! When we no longer have one gigantic project, but instead many small cleanly partitioned pieces, then work on various features is also partitioned. This gives us cleaner development in an isolated and much more mangeable development environment.

A common, but excellent way, to picture this difference is to imagine a number of workers chipping away at a large boulder. In the monolithic case they struggle to gather around the surface well enough to get their work done without bashing away at the hands of the person next to them. When broken down into smaller pieces the surface area is greatly increased, and the workers can happily chip away in much greater comfort without bloodied hands and broken thumbs.

Scaling Your Application

When your killer app does eventually go viral and needs to be scaled horizontally with a monolith you have no option but to ramp up ‘everything and the kitchen sink’ regardless of whether it’s a bottleneck or not. This can clearly result in a great deal of wasted resource to support the redundant parts of the deployment that have hitched along for the ride.

In the case of microservices, because they’re already nicely chopped up into sensible parts, then we can scale out only the bits that we need, allowing us to much more finely tune our platform to the traffic hitting it.

Reducing Deployment Risk

In a similar vein to the above, the divide and conquer approach of microservices can also, if done correctly, make your deployment simpler and more stable.

In the case of a monolith any upgrade will, by it’s nature, include all changes made to the monolith since the last release. If something were to go wrong during the upgrade then the cause could come from any number of new features, not to mention the sheer complexity you hit in just getting a large application up and running in the first place.

By splitting out the moving parts we can also separate out their deployment. Smaller, more controlled, deployments result in less surprise, less uncertainty and makes quickly reverting back to a known state a more welcoming prospect than rolling back a huge upgrade.

Upcoming: The Spooky Part

But I’m not here to blindly praise the wonders of a microservice architecture - and admittedly there are many potential benefits available to you - but rather I’m here to start the discussion around the scary bits, the parts that can keep you up at night in a cold sweat and make your work-day feel like herding wildcats around a data centre.

A microservice system is a distributed system, and distributed systems are hard! Over the next few posts in this series we’ll consider a number of different areas in which you must become proficient to avoid falling into one of the myrid traps surrounding a microservice architecture. More helpfully, we’ll also consider a number of common strategies to help you traverse this landscape safely. Specifically these topics will include subjects such as: strategies for breaking up your services, evolving them independently, and handling failure elegantly.

Next Time

We’ll look at arguably the most important step in a microservice architecture - strategies to effectively break up your application into a number of well designed microservices, whether decomposing that monolith or starting over with a clean slate.

References

Share Comments

Python Development with Vagrant & IntelliJ

I’m mostly a JVM language developer who loves the toolset that comes from the JVM world, especially that of a good IDE. Don’t get me wrong I’m partial to a bit of emacs, but I spend most of my time, these days, staring into the awesome Darcula black of IntelliJ.

So when it comes to developing a bit of Python I want to try to keep my foundations as comfortable as possible without compromising the new environment. More specifically this comes from developing a Linux-based Python app whilst on a Windows machine. With the power of Python stemming from its C-based libraries you lose some of that platform independence that you otherwise come to take for granted.

Enter Vagrant (let’s ignore Otto for now). Vagrant enables users to create and configure lightweight, reproducible, and portable development environments. This is great for development targeted at one platform whilst coding on another - a flexible sandbox that you can build up and and tear apart with the execution of a single command.

Also, helpfully, IntelliJ has some great bindings to allow us to develop with ease on a Vagrant environment. This is what we will quickly explore.

vagrant up

To get started with Vagrant you just need to have it and VirtualBox (or similar) installed. Then a simple configuration Vagrantfile is the only hard work we need to put in. Here’s one I made earlier:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Vagrant.configure(2) do |config|

config.vm.box = "bento/centos-6.7"

config.vm.network "forwarded_port", guest: 8080, host: 8080

config.vm.provider "virtualbox" do |vb|
vb.gui = false
vb.name = "Intellij Vagrant Test"
end

config.vm.provision "shell", path: "setup.sh"

end

Nothing much to it: just pick an image, forward some ports, change a little VM config and run the setup script once the environment has booted.

All that is in the setup.sh script is a little work to smooth out Python development a little.

1
2
3
4
rpm -iUvh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
yum -y install python python-pip gcc python-devel git
pip install --upgrade setuptools
pip install gevent

Now just we run vagrant up, make some tea, and we’re in!

Hooking in IntelliJ

Connecting up IntelliJ is actually pretty simple too. Just create a new project as usual and when the time comes create a new “Remote Interpreter”.

Choose the project base (the folder containing Vagrantfile) as your Vagrant Host URL and the rest should be done automatically.

Once hooked in we can see details about our guest environment, such as the libraries that we have installed. To do this specifically navigate to Tools -> Manage Python Packages

Throwing Together a Quick App

Now let’s build a very small gevent wsgi app to demonstrate how the environment works - we’ll just create a very lightweight “hello world” service.

First thing, setting up the tests. As we haven’t used this environment before, some of the supporting tools, mocking frameworks, etc, will be missing from the environment. That’s ok, IntelliJ makes it easy to just highlight the red parts of your code and automatically install those libraries. Check it out below with the mock framework.

Now for some example test code:

1
2
3
4
5
6
7
8
9
10
import unittest
import mock
import hello.server

class TestServer(unittest.TestCase):

def test_default_response(self):
response = hello.server.hello_world({}, mock.MagicMock())
output = next(response)
self.assertEqual("<h1>Hello World!</h1>", output)

To run the test is no different than usual. Just kick off as a unit test and the remote interpreter we configured earlier will take care of the details of running the code “remotely”.

Of course, our test fails, as it should, at this point

Once we’ve developed our wsgi handler the test then passes.

1
2
3
4
5
6
7
8
9
10
11
12
13
from gevent import pywsgi

def hello_world(environ, start_response):
start_response('200 OK', [('Content-Type', 'text-html')])
yield "<h1>Hello World!</h1>"

server = pywsgi.WSGIServer(
('', 8080), hello_world
)

if __name__ == "__main__":
server.serve_forever()

Finally, what about running the application and debugging it? - again, IntelliJ does a great job of hiding all those abstractions.

Just run/debug a program as you usually would, and thanks to our port mapping we can play around with the app as if it was running natively on the host rather than in a Linux VM.

Share Comments

#7FaveGames - Creating a Word Cloud from 346,859 Tweets

Over the last few days the hashtag #7FaveGames has been trending on
twitter. As you can probably guess you’re supposed to list your top seven games of
all time and put them out to the world to be judged!

Of course, I jumped straight into this trend and submitted my own faves - if
you’re interested, my choices are in the banner of this post.

Whilst trawling my own gaming history, I started to
wonder about the most popular games overall. A great excuse to
pull out the R analytics toolkit and do some amateur data science.

Gathering all those Tweets

The first step was to gather as many #7FaveGames tweets as
possible without crossing the boundaries set by the twitter API (180
requests per 15 minutes).

I decided to store all tweets with their metadata in a CSV file. This
makes the data much easier to explore and reload - specifically for
the numerous cases where I accidently trashed my in-memory copy.

Initialize CSV Store

Initially we want to set some parameters and pull the first
batch. This initial chunk is used to create the CSV file to which we
will append during the rest of the process.

1
2
3
4
5
6
7
8
# define batch size
batch <- 500
sleepPeriod <- 10

# pull first batch and initialize output file
tweets <- searchTwitter("#7FaveGames", n=batch, retryOnRateLimit=batch)
parsed <- do.call("rbind.fill", lapply(tweets, as.data.frame))
write.table(parsed, file="7favegames.csv", sep=",", append=TRUE, col.names=TRUE)

Grab the Rest of the Data

Next we just paginate, in descending chronological order, through
the rest of the data using the maxID key to shift the query
window. The last id in the current batch is used as the maxID in the
next to ensure a contiguous set of tweets.

We just keep looping through until we hit a batch smaller than
expected, or a NA id.

Admittedly it’s a pretty basic approach, but it does the job … in a
few short hours …

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# while we still have tweets to read, pull batches
# (with respect to twitter so I don't get kicked)
pull <- TRUE
while (pull) {
# grab the last id of the previous batch
id <- parsed$id[batch]
print(c("Getting batch with id ", id))

# perform a paginated search on twitter
tweets <- searchTwitter("#7FaveGames", n=batch+1, retryOnRateLimit=batch, maxID=id)
parsed <- do.call("rbind.fill", lapply(tweets, as.data.frame))

# strip the first record to avoid duplicates, as it was the last in the previous batch
parsed <- parsed[-(1:1),]
write.table(parsed, file="7favegames.csv", sep=",", append=TRUE, col.names=FALSE)
pull <- (nrow(parsed) > 1) || (!is.na(id))

# just pause for a moment to avoid hammering the twitter api
# TBH the 'retryOnRateLimit' would probably do the job better anyway
Sys.sleep(sleepPeriod)
}

Building a Word Cloud

Which Libraries?

First we import a bunch of libraries that we will need.

  • tm and SnowballC for text mining and transformation
  • RColorBrewer to pick a color palette
  • wordcloud to plot the cloud itself
1
2
3
4
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)

Change Encoding

We read in the data we collected in the last step. For some unknown reason the
data is coming back in an odd
encoding. We need to force it to UTF-8 to avoid the “tm” algorithms
crashing out half-way through.

1
2
3
data <- read.csv("7favegames.csv", row.names=NULL, stringsAsFactors = FALSE)
tweets <- data$text
tweets <- enc2utf8(tweets)

Convert “Odd” Characters and Numbers

Many of the video game titles include roman numerals (e.g. Final
Fantasy X) or accented characters (e.g. Pokémon). We want to simplify
these down to a common form so we don’t end up with duplicates
(such as Pokémon vs Pokemon).

The roman numeral converter is pretty primitive but, I think, fine for this
purpose. The odd ordering of the numbers is so that we’re greedy in
our matching of independent roman numerals (i.e. we don’t match a
substring of a numeral and corrupt the value).

1
2
3
4
5
6
7
8
9
10
11
12
13
14

replaceRoman <- function(vect) {
numerals <- c("xviii", "xvii", "xiii", "viii", "xiv", "xvi", "xix", "vii", "iii", "xii", "ii", "iv", "vi", "ix", "xi", "xv", "xx", "i", "v", "x")
numbers <- c(18, 17, 13, 8, 14, 16, 19, 7, 3, 12, 2, 4, 6, 9, 11, 15, 20, 1, 5, 10)
for (i in c(1:length(numerals))) {
vect <- gsub(paste("\\b", numerals[i], "\\b", sep=""), numbers[i], vect)
vect <- gsub(paste("\\b", toupper(numerals[i]), "\\b", sep=""), numbers[i], vect)
}
return(vect)
}

tweets <- replaceRoman(tweets)
# remove accented characters
tweets <- iconv(tweets, to='ASCII//TRANSLIT')

Perform Corpus Transformations

This is where the bulk of the transformations occur. We first create a
text corpus from our tweets and …

  • Convert to lower case
  • Remove any punctuation
  • Remove stop words, such as “the” and “and”. Or those specific to tweets(“rt” and “lol”), and video games (e.g. “super” to avoid common words in all those SNES games).
1
2
3
4
5
corpus <- Corpus(VectorSource(tweets))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, c("7favegames", "amp", "rt", "lol", "follow", "twitter", "series", "new", "super", stopwords("english")))

Plot Word Cloud

Finally, now that we have our simplified corpus, we can plot our word
cloud.

We choose a fairly solid color scheme, avoiding too many of the weaker
shades that would be hard to read. You can see the full selection of palettes by running display.brewer.all()

1
2
3
# plot word cloud
pal <- brewer.pal(8,"Dark2")
wordcloud(corpus, max.words = 200, random.order=TRUE, scale=c(5, .5), colors = pal)

The End Result

And here’s the end result:

7favegames Word Cloud

Clearly Mario and Pokemon are the most popular - no real surprise there -
but it’s interesting to see other titles such as Dark Souls, Bioshock
and Bayonetta pop up. There are a couple of newer titles that make an appearance too, such as
Overwatch and Rocket League, but this may be partially due to some
kind of recency bias.

Admittedly the model does have some fundamental flaws though, including:

  • Not accounting for the difference between a single game and
    series. For example the Final Fantasy saga will get over-represented
    due to the sheer number of releases.
  • Titles with more complicated names, or names prone to acronym, will
    get under-represented due to the algorithm not being able to tie
    those together.
  • Tweets that repeat a single game over and over will add unfairly to
    its weight.

But regardless, it’s a fun experiment to run, and does give us some
idea of the relative popularity of the various titles.

Share Comments

Zero Downtime Upgrade Problems and Patterns


In the modern software landscape it is increasingly important, even expected, that services be upgraded without causing downtime to the end user. In an industry such as IPTV to make the service unavailable to the end user for a period of time, even in the middle of the night, is generally unacceptable.

This page will survey the problems that various technologies face in this regard, and detail common patterns for tackling them.

This page will survey the problems that various technologies face in this regard, and detail common patterns for tackling them.

It’s worth noting here that most of these approaches are essentially just variations on the Blue Green Deployment strategy, as described by Martin Fowler here.

Stateless Systems / Considering Only the Application Layer

A stateless system, that is one without any kind of persistent store, is obviously the simplest case we can consider for a zero downtime upgrade and will cover many principles that can be used in all other cases, when considering the application in isolation.

Remove-Upgrade-Add

The most obvious strategy, and the one that is probably most widely used, is that of removing the service from a load balancer, upgrading that service, and once upgraded re-enabling it. Of course this assumes you have some redundancy in your platform to enable you to take parts of it down whilst still serving traffic.

Remove Add Upgrade Animation

Unfortunately, this is a pretty coarse approach to upgrade, and is generally achieved manually by some poor engineer in the middle of the night. As manual is usually synonymous with bug-prone then automating this would be desirable, but isn’t trivial in a traditional bare-metal deployment as managing running services and state over potentially many nodes is quite tricky and bug-prone in itself.

Fixing the Restart Problem

Master - Workers Model

World class http servers (e.g. nginx, unicorn) and process managers (e.g. node’s pm2) have inbuilt features that allow for zero-downtime upgrades within the context of a single machine, avoiding the tedium of the above remove-upgrade-add cycle but achieving the same end goal (remember we’re still talking application layer only, so no state to worry about).

The core architecture that allows them to achieve this is a master-workers model. In this model we have one core process performing all privileged operations (such as reading configuration or binding to ports) and a number of workers and helper processes (the app itself). The point here, is that this allows for the workers to be switched out behind the scenes, the master effectively working as a software load balancer.

Inside Nginx - how we designed for performance/scale

Nginx on the fly upgrade

“Nginx’s binary upgrade process achieves the holy grail of high-availability – you can upgrade the software on the fly, without any dropped connections, downtime, or interruption in service.

The binary upgrade process is similar in approach to the graceful reload of configuration. A new NGINX master process runs in parallel with the original master process, and they share the listening sockets. Both processes are active, and their respective worker processes handle traffic. You can then signal the old master and its workers to gracefully exit.”

There are tools that are able to achieve this natively for your own project (the aforementioned pm2) and also as a generic solution across various platforms (socketmaster, einhorn).

You could also achieve the equivalent of this architecture by parallel running both versions of your app on the same node, and using nginx reload to seamlessly switch your configuration between the two.

so_reuseport

Introduced in Linux kernel 3.9 was the SO_REUSEPORT option. As long as the first server sets this option before binding its socket, then any number of other servers can also bind to the same port if they also set the option beforehand - basically no more “Address already in use” errors. Once enabled the SO_REUSEPORT implementation will distribute connections evenly across all of the processes.

Of course, the implication here is that we can simply spin up multiple versions of our application, an old and new; drain the old of requests and once drained shut it down.

Stateful Systems / Evolving the Database

It’s when we get to stateful systems that we begin to hit problems with zero downtime upgrades, or live migration. There is much to consider here, as we have various data-store properties and numerous different uses cases, especially in a micro-service model where each service is tuned in a very specific way. Inevitably most of the available patterns do not fit all circumstances.

First let’s consider the various dimensions along which this problem can vary:

Mutability of Data

  • Read only systems - metadata retrieval, or notification systems
  • Write only systems - immutable data stores, event sourcing systems
  • Read-Write systems - account management, bookmarking, etc

Data Store Properties

  • Schema based DB system - Oracle, Cassandra
  • Schema-less - MongoDB, Redis
  • Archive (essentially non-interactive) data store - e.g. HDFS with Avro

Evolution Type

  • Addition of a field
  • Removal or rename of a field
  • Removal of a record
  • Refactoring of a record - e.g. splitting into multiple

Polymorphic Schema / Lazy Migration

A schema-less database, such as MongoDB, can support a pattern known as a “Polymorphic Schema”, also known as a “single-table model”. Rather than all documents in a collection being identical, and so adhering to an implicit schema, the documents can be similar, but not identically structured. These schema, of course, map very closely to object-oriented programming principles. This style of schema feeds well into a schema evolution plan.

MongoDB Applied Design Patterns

A traditional RDBMS will evolve its schema through a set of migration scripts. When the system is live, these migrations can become complex or require database locking operations, resulting in periods of extended downtime. This model is also possible in Mongo DB, and similar, but we have alternatives that may suit us better.

By using a polymorphic schema we can account for the absence of a new field through defaults. Similar behavior can be implemented for field removal also, and these kinds of transformations are handled elegantly by Mongo ORM libraries if you want to avoid coding this logic yourself.

By applying these transformations lazily, as records are naturally read and written back to the database, the schema is slowly transformed into the latest. By running through all records in a collection, we would be able to force a full upgrade of the full collections.

Of course there are other, more specific, challenges depending on the type of transformation required:

  • How do we perform collection refactoring exercises (e.g. splitting out a single collection into multiple related collections) - would a _version field on an object allow us to trigger migrations more effectively in this case?

  • How would we perform queries based upon that new field? - by forcing a read-write of all records, and therefore completing the transformation whilst the system was still alive?

Forward Only Migrations

For database systems that enforce a schema, but a principle that is equally as applicable to all persistence stores, is the concept of forward only migrations.

Never roll back == No messy recovery

In this practice every database migration performed should be compatible with both the new version of the code and the previous one. Then if you have to roll back a code release the previous version is perfectly happy using the new version of the software.

As you can imagine, this will require some strict adherence to convention. For example, dropping a column:

Dropping a Column

  • Release a version of the code that doesn’t use the column.
  • Ensure that it is stable, and won’t be rolled back.
  • Do a second release that has a migration to remove the column now that it isn’t being used.

Rename a Column

  • Release a version that adds a new column with the new name, changing the code to write to both but read from the old.
  • Employ a batch job to copy data from the old column to the new column - one row at a time to avoid locking the database.
  • Release a version that reads from and writes to the new column.
  • In the next release, drop the column

This technique seems most successfully employed in a continuous delivery environment (I)

Expansion / Upgrade / Contraction Scripts

This method is similar to the last, but it’s organized slightly differently. Here we use two different sets of migration scripts:

Expansion Scripts

  • Apply changes to the documents safely, that do not break backwards compatibility with the existing version of the application.
  • e.g. Adding, copying, merging splitting fields in a document.

Contraction Scripts

  • Clean up any database schema that is not needed after the upgrade.
  • e.g. removing fields from a document

The process we perform here is:

  1. Expand - Run expansion scripts before upgrading the application
  2. Upgrade - Upgrade the cluster one node at a time
  3. Contract - Once the system has been completely upgraded and deemed stable. Typically, contractions can be run, say days/weeks after complete validation

Again, this does not require any DB rollback. The reason for attempting this is that reversing DB changes can lead to potential loss of data or leaving your system in an inconsistent state. It is safer to rollback the application without needing to rollback DB changes as expansions are backwards compatible.

One Big Caveat for the Above

Most of the above solutions require that upgrades are nice and discrete. That is, they move from version to version in an incremental way, not skipping any releases so that we can control schema compatibility in a pair-wise fashion.

This is all fine for a continuous delivery, or cloud-based shop, where upgrades are obviously linear and wouldn’t be applied any other way. However what if you have a user-base of several parallel deployments all leap-frogging over various versions, as below.

One solution is simply to enforce your upgrade path more clearly. So, for example, any app-only change would be a simple revision bump. But any schema change that needs to be handled by one of the above methods would be a larger bump.

It would have to be mandatory to hit every one of these larger version numbers to preserve this incremental schema-change model.

Immutable Infrastructure - Docker and Friends

In the world of containers, as deployments are immutable, then a number of the above techniques are not applicable. But the same blue-green principles apply.

HAProxy is a web proxy and load balancer that allows you to change the configuration, restart the process, and (most importantly) have the old process die automatically when there is no more traffic running through it. So a simple upgrade would go as follows:

  • Start a new version of the container
  • Tell HAProxy to direct all new traffic to the new container whilst maintaining existing connections to the previous.
  • All new users get access to the new code immediately, whilst existing users are not cut-off from their in-progress requests.

Of course, this can all be automated, but that’s for another time.

References

Share Comments

Survivorship Bias and Negotiating Tech Hype

Bombers with Bullet Holes

During the dark days of World War II the American military presented
their Statistical Research Group with a problem. They wanted to add additional armour to
their bombers, but clearly couldn’t put the armour everywhere because
of the additional weight it would add to the planes. The group was
tasked with working out how much armor to allocate to the various regions
of the aircraft to maximize defense whilst minimizing any effect on
fuel consumption and agility.

Engineers inspected a number of bombers that had seen some action. These
planes had bullet holes that were distributed mainly across the wings
and body. Comparatively the engines and cockpit had much less
damage. This had lead the commanders to make the obvious, but foolish, conclusion that
they should enhance armour on areas that had been hit most
frequently, namely the fuselage and wings.

One of the many geniuses of the group, Abraham Wald, realised that they
were looking at the problem from completely the wrong angle. It’s not
that planes weren’t being hit as frequently on the engines and cockpit,
but rather those that had been never returned to tell the tale! These
were the parts of the airplane that needed enhancement, not the areas
that could take a battering and still survive.

Survivorship Bias

How do you really evaluate a success when the failures are nowhere
to be seen?

Survive

Countless articles, books and documentaries have been produced about
successful people and how to capture the principles of their success to
improve your own fortune.

Consider Steve Jobs - frequently heralded as a one of the greatest
geniuses of our time - how do we emulate his success? Clearly dropping
out of college, spending time at meditation retreats and starting a
business from your parent’s garage is the way to go. But what about
the hundreds of thousands of budding Apple founders for whom this
strategy never quite worked out?

Books aren’t usually written about failed enterprises,
just the rare, billion-dollar, success stories.

Choosing Tech Thoughtfully

As well as being responsible for the latest diet fads and the exaggerated
performance of mutual funds, we can see this
bias lurking in certain corners of the software development world.

Thinking

How often do you see a wave of enthusiasm for the next high
throughput NoSQL
system or a push for complicated elastic scaling technology? Companies
such as Twitter and Netflix present their wild successes but we don’t usually see qualifications on the size and
scale of the teams implementing these solutions.

It’s worth keeping in
mind the potential for a mass of silent teams. Struggling under the
weight of overpowered, over-engineered, “web-scale” technologies
inspired by the industry front-runners. Most of us mere mortals just don’t
have the resources, skills, or (most importantly) even the need for
such high class deployments. Most of the time it’s just better to keep
things simple and known.

Similarly, businesses push on with “Big Data” for fear of missing
out, but without any real understanding of what they really
need. Solutions are commissioned that aspire to the heights of Facebook and Google
whilst in reality they fumble for a real business-value providing use-case.

Don’t get me wrong, I’m 100% all for learning new paradigms, languages
and frameworks. This is just a reminder, as much to myself as anyone
else, to take a moment to think past the biases that may lead us to
make some regrettable, albeit well-intentioned and over-excited, choices.

References

Share Comments

Augeas the Missing Manual

Lately I’ve been using Augeas to configure some json files, and phew! it’s not been easy! So here’s a little guide to help other unfortunate souls lost in the Augeas wilderness.

1) First let’s start with a pretty basic json file located at /tmp/test.json:

1
{ "a": 1 }

2) Load

Loading the file with the json lens is easy enough once you know how

1
2
3
4
5
6
7
8
9
augtool> set /augeas/load/Json/lens Json.lns
augtool> set /augeas/load/Json/incl /tmp/test.json
augtool> load
augtool> print /files/tmp/test.json

/files/tmp/test.json
/files/tmp/test.json/dict
/files/tmp/test.json/dict/entry = "a"
/files/tmp/test.json/dict/entry/number = "1"

3) Set

As is changing the existing property

1
2
3
augtool> set /files/tmp/test.json/dict/entry[. = "a"]/number 2
augtool> save
Saved 1 file(s)

resulting in
1
{ "a" : 2 }

4) Set Type

Of course you can change this to a string or boolean property as required

1
2
3
4
augtool> rm /files/tmp/test.json/dict/entry[. = "a"]/number
augtool> set /files/tmp/test.json/dict/entry[. = "a"]/string hello
augtool> save
Saved 1 files(s)

to create
1
{ "a": "hello" }

5) Add Sub-object

And finally adding some subobjects with properties

1
2
3
4
5
6
set /files/tmp/test.json/dict/entry[. = "b"] b
set /files/tmp/test.json/dict/entry[. = "b"]/dict/entry[. = "c"] c
set /files/tmp/test.json/dict/entry[. = "b"]/dict/entry[. = "c"]/dict/entry[. = "d"] d
set /files/tmp/test.json/dict/entry[. = "b"]/dict/entry[. = "c"]/dict/entry[. = "d"]/string "world"
augtool> save
Saved 1 files(s)

to create an json structure with depth:
1
2
3
4
5
6
7
8
{
"a": "hello",
"b": {
"c": {
"d": "world"
}
}
}

Share Comments