So, you want to write software?: November 2010

Wednesday 24 November 2010

Using Cucumber with Scala and SBT

Updated 22nd June 2012: I've now released an updated plugin (version 0.5.0). This is a significant update that switched from using the Ruby version of Cucumber (using JRuby) to the new Cucumber-jvm implementation. Please see my github repository for full details: https://github.com/skipoleschris/xsbt-cucumber-plugin. The remainder of this post, while still serving as an overview, has now been deprecated by this new plugin version.

In this post I talked about my thoughts on doing BDD using the Cucumber framework. At the time, one of the things I was struggling with was how to get Cucumber working within an SBT environment, writing step definitions in Scala. My ultimate solution was to write my own SBT plugin. Here's the details...

Cucumber on the JVM

The Cucumber framework (http://cukes.info/) comes from the world of Ruby and is distributed as a Ruby Gem. Fortunately the author of Cucumber framework has also developed an additional library called cuke4duke (https://github.com/aslakhellesoy/cuke4duke). This library allows Cucumber to be run on the JRuby platform (JRuby is an implementation of Ruby that runs on the JVM!). It includes a Ruby Gem that allows step definitions to be written in JVM languages plus library classes to support step definition development in languages such as Java, Scala, Groovy and Clojure. For my particular purpose the Scala support is what I am after.

It's relatively easy to install JRuby and then from there install the Ruby Gems for Cucumber and cuke4duke. This gives good command-line support for running Cucumber features with step definitions written and compiled using Scala and existing Scala tools.

Scala Tools Support

My current build tool of choice for Scala is the excellent SBT (http://code.google.com/p/simple-build-tool/). If you aren't using this for your Scala projects then I suggest you take a look. A quick search identified a possible sbt plugin for running cuke4duke (written by rubbish and available at https://github.com/rubbish/cuke4duke-sbt-plugin). Unfortunately this plugin hasn't been updated since June and it is running against very old versions of Cucumber and cuke4duke. My first solution was to clone this plugin and try to update it to the newest versions. Unfortunately I was unable to get it to work properly. I was also aware that this plugin was a very early version and was lacking a number of the features that I required. Time to write my own.

My Cucumber SBT Plugin

Using some of the basic concepts from rubbish's solution plus my own ideas as to how it will work I therefore put together my cucumber-sbt-plugin. This is an SBT plugin project and is implemented in Scala. Full details (and code) can be obtained from my github: https://github.com/skipoleschris/cucumber-sbt-plugin.

Some of the main features of the plugin include:

Automated dependency management of JRuby and cuke4duke
Automated install of all required gems into the lib_managed directory
Support for multiple cucumber goals offering: console, console (detailed), html and pdf output
Extensive configuration objections through overrides in the SBT Project file
Support for 'tag' and 'name' parameters being passed to each cucumber task

Project Setup

In the plugin definition file (project/plugins/Plugin.scala), add the cucumber-sbt-plugin dependency:

import sbt._

class Plugins(info: ProjectInfo) extends PluginDefinition(info) {
  val templemoreRepo = "templemore repo" at "http://templemore.co.uk/repo"
  val cucumberPlugin = "templemore" % "cucumber-sbt-plugin" % "0.4.1"
}

In your project file (i.e. project/build/TestProject.scala), mixin the CucumberProject trait:

import sbt._
import templemore.sbt.CucumberProject

class TestProject(info: ProjectInfo) extends DefaultWebProject(info) with CucumberProject {

  // Test Dependencies
  val scalatest = "org.scalatest" % "scalatest" % "1.2" % "test"

  //...
}

Writing Features

Features are written in text format and are placed in .feature files inside the 'features' directory. For more info on writing features please see the Cucumber website. For example:

Feature: Cucumber
  In order to implement BDD in my Scala project
  As a developer
  I want to be able to run Cucumber from with SBT

  Scenario: Execute feature with console output
    Given A SBT project
    When I run the cucumber goal
    Then Cucumber is executed against my features and step definitions

Writing Step Definitions in Scala

Step definitions can be written in Scala, using the cuke4duke Scala DSL. More information on this api can be obtained from the the cuke4duke wiki page for scala. For example:

import cuke4duke.{EN, ScalaDsl}
import org.scalatest.matchers.ShouldMatchers

class CucumberSteps extends ScalaDsl with EN with ShouldMatchers {

  private var givenCalled = false
  private var whenCalled = false

  Given("""^A SBT project$""") {
    givenCalled = true
  }

  When("""^I run the cucumber goal$""") {
    whenCalled = true
  }

  Then("""^Cucumber is executed against my features and step definitions$""") {
    givenCalled should be (true)
    whenCalled should be (true)
  }
}

Using The Plugin Commands

Just run one of the cucumber actions to run all of the cucumber features. Features go in a 'features' directory at the root of the project. Step definitions go in 'src/test/scala'. The following actions are supported:

cucumber - Runs the cucumber tool with pretty output to the console and source and snippets turned off
cucumber-dev - Runs the cucumber tool with pretty output to the console and source and snippets turned on
cucumber-html - Runs the cucumber tool and generates an output cucumber.html file in the target directory
cucumber-pdf - Runs the cucumber tool and generates an output cucumber.pdf file in the target directory

There are also parameterised versions of each of these tasks (see IMPORTANT NOTE below):

cuke
cuke-dev
cuke-html
cuke-pdf

Each of these task also accepts parameter arguments. E.g.:

cuke @demo,~@in-progress

would run features tagged as @demo and not those tagged as @in-progress. Also:

cuke "User admin"

would run features with a name matched to "User admin". Multiple arguments can be supplied and honour the following rules:

arguments starting with @ or ~ will be passed to cucumber using the --tags flag
arguments starting with anything else will be passed to cucumber using the --name flag

IMPORTANT NOTE: The current design of sbt prevents tasks with parameters (method tasks) being run against the parent project in a multi-module sbt project. This is why there are separate tasks with parameters. To use a parameter task you mush first select a child project. The non-parameter tasks can be run against the parent project or a selected child.

Saturday 20 November 2010

The Best Way to Apply BDD

I am currently in the process of rewriting a simple e-commerce application for one of my customers. The new version is implemented using Scala and the excellent Lift framework and is a rewrite of a six year old struts/JSP version. The core application is largely identical but I am adding a wide range of new and improved administration functions.

Typically I build a simple, well understood application such as this using ordinary TDD principles to create unit tests and drive out the code. I'd also usually add some fairly basic UI tests using something like Selenium or even just HtmlUnit to validate the front-end. However, I like to use these sorts of projects to explore better ways of doing things. I've therefore decided to use a Behaviour Driven Development approach using Aslak Hellesøy's excellent Cucumber framework. (I'll post more details on installing and using Cucumber and cuke4duke with sbt once I get it all figured out!)

In this post I want to explore some of my thinking about exactly how and where to use BDD within the project and the places where I want to integrate the Cucumber features and scenarios.

What To Use BDD For

I am a strong advocate of the TDD approach, and I intend to follow this within this development for the low level details (i.e. unit and component tests). I'm therefore looking at using BDD as a way of capturing the higher-level business requirements and acceptance criteria. My aim is to use BDD and Cucumber to ensure that the application I am building actually does what the business expects it to. Lower level tests will still be written using ScalaTest to ensure the code that I am writing does what I expect it to do and also to drive out the architectural and implementation details.

Where To Integrate BDD

Given this positioning for BDD, where I have been struggling is the correct level in my application to target the Cucumber step definitions (step definitions are the code that actually executes the scenario and conditions described by the behaviour). The following sections describe the different options I have considered and advantages and disadvantages of each:

BDD at the User Interface - In this approach the step definitions are implemented to execute against the running web application. This is usually achieved using a technology such as Selenium or HtmlUnit. Each scenario loads web pages, clicks links, fills and submits forms and makes assertions about the HTML documents and associated resources that are returned.

Advanages:

The behaviour is asserted across the complete application.
The BDD approach can be used to drive the development of the UI as well as the application logic.

Disadvanages:

You can only execute the feature steps against a built and deployed application - which may make development slower.
A web-based UI is usually the least stable part of a web application and thus feature steps may be broken as the UI is tweaked, making it appear that behaviour is being lost. Step definitions may become quite fragile.
Using BDD at this level may not fit well when user experience or visual designers are being used as this often results in rapid changing of the UI
Generally testing at this level requires a lot of boilerplate code to be written - which makes wringing BDD step definitions more complex.
Testing of web-based user interface may be better suited to a suite of UI tests that can focus just on verifying the UI display and interaction.

BDD at the Web Protocol - In this approach the step definitions are written to directly exercise the web protocol used to interface to the application. For example, the steps make HTTP calls to the application and assert that the returned results are correct and contain the expected content.

Advanages:

Does not need to worry about the complexities of web pages, javascript and so on that are perhaps better tested in a dedicated suite of UI tests.
Very well suited to RESTful or other web services that return structured data instead of HTML documents.

Disadvanages:

You can still only execute the feature steps against a built and deployed application
Building complex sequences of requests to simulate thinks such Ajax may be more complex than testing against the UI.
Asserting against returned HTML content may be more complex than using assertions in a framework like Selenium at the UI layer.
Building and testing at this layer requires that you also support a suite of UI tests.
For a web application, the interface at the HTTP layer may change quite frequently, requiring frequent changes to the step definitions. This will be particularly common in an Ajax heavy application.

BDD at the Controllers - In this approach, we ignore the UI and view parts of the application. Instead we wire up the whole application from the controllers/snippets layer down. Step definitions are then written that invoke the controllers/snippets with pre-defined requests and assert that the response action and data that would be used to generate a view is correct.

Advanages:

Features can be tested as part of the test phase of a build - no need to deploy.
Testing at this level tends to be more stable and less fragile as the application emerges.
It is usually much simpler to invoke and assert at a code level rather than a UI level

Disadvanages:

A good set of testing features is required by the chosen web framework, so that requests and responses can be easily simulated and asserted against.
Tests at this level tend to spend a significant portion of logic dedicated to the interactions with the UI rather than asserting against business rules.
A suite of UI tests is required in order to test the UI and these will also generally exercise the same controllers/snippets in order to drive the UI.

BDD at the Services - In this final approach we test behaviour at the level directly below the controller/snippet. This is usually testing against service level code, but may also require writing some step definitions against domain level objects. As this level we are much more interested in testing the business behaviour of the application rather than the behaviour of how the application interacts with the user.

Advanages:

Features can be tested as part of the test phase of a build - no need to deploy.
Code at this level will usually be more stable than at the higher levels, thus making the tests less brittle.
Step definitions at this level will usually be simpler to write and maintain than those written against the UI or controllers.
We are testing against actual business rules rather than how a user might interact with the application.
We require less testing specific tools and framework support.

Disadvanages:

Testing at this level exercises less of the application flow than the tests at higher levels.
A suite of tests is additionally required to validate the UI and the controllers to ensure that the user interaction with the system calls the services with the correct data in the correct order and correctly displays the results.
The business is more likely expecting the behaviour of how they interact with the application to be verified.
Developers must be VERY disciplined to ensure that no business rules, logic or behaviour is implemented in the controllers layer.

So, Where From Here?

After looking at all the possible places to use BDD and Cucumber, I have to draw the conclusion that there is no outright winner. It feels to me that the best mix is a combination of step definitions that test behaviour at the service layer (to verify business rules and behaviour) plus another set that test behaviour of the UI. By combining both of these approaches it should be possible to cover the application sufficiently to have confidence that its behaviour is correct.

This then leads me on to think about the best way to create the feature descriptions of the behaviour. Do we create a single description of the desired behaviour and then write two step definitions - one for the service layer and one for the UI? Or do we write a feature definition describing the required behaviour for the business services and then have an alternative behaviour specific for the user interaction with the system?

I'm still unsure of the correct answers to these questions. Anyone out there got any thoughts or experience? In the mean time I'm going to build my application trying both approaches and see which works out best.

Wednesday 17 November 2010

Why Releases Should be Automated and Environments Equivalent

Last weekend I was involved in a fairly major release for one of my customers. Ultimately we were unable to complete the release due to a number of factors (some within our control and others outside). However, the release was still deemed successful as we were able to cleanly abort and rollback to the previous working version with an absolutely minimal service outage. The release did however highlight a number of places where the release process could be improved.
So, where did this release go wrong. There were three main problems that were encountered:

A switch failure in one of the data centres that temporarily rendered some servers inaccessible.
Mistakes in the manual actions in the release plan.
A data inconsistency between two different data centres.

Unexpected Events During a Release

Murphy's Law: Anything that can go wrong, will go wrong.
Unfortunately, no matter how well you plan, there will always be things that you can't anticipate. In this particular release we had a hardware switch fail right in the middle of the release. An erroneous configuration in the data centre caused this failure to render a number of our servers unreachable. We were therefore mid-release with no way to go forward and an arduous rollback process to restore us back to a working state.
Fortunately due to diligent disaster recovery work undertaken previously we were able to fail the entire site over to the alternative data centre. A pragmatic decision was eventually taken that we couldn't wait on the chance that we MIGHT be able to complete the release so we rolled back. This was well planned and went smoothly enough that within 30 minutes of a new switch coming online we were back up and running on the original software versions. Nice work.
However, even flushed with the success of our miraculous escape from the clutches of Murphy, we have to consider what might have been. Our rollback was a largely manual process and although tested in the test environment, our production is just different enough as to add a level of uncertainty to the rollback process. There's also the greater chance of errors in manual steps when under pressure of a live situation.
So how could we be more certain of our rollback plan?

Untested Manual Release Plans

Our release had a very detailed plan that had been well reviewed before hand. A variation of the plan had been executed in test but, as I already said, the production environment is just different enough as to require a different (and slightly more complex) plan. Additionally, this release had some additional complexities requiring a number of extra manual steps.
What we discovered during the release was that some of the steps specific to production were in fact not quite correct. Couple this with the chance of making errors when executing manual steps and we have a number of places with a high potential for failure.
While the plan did run pretty smoothly, how could we reduce the chances of errors and mistakes?

Problems That Can't be Found Before Production

The final problem encountered on the release was one of database consistency. Database in two different data centres were thought to be consistent. It turns out they were not! With hindsight, we should have checked this and have been continually monitoring these over time. We live and learn.
What should really have happened is that we detected the possibility of inconsistent databases before we got to production. Unfortunately all the lower environments only have a single data centre model for the database in question so there was no way that anything could ever get out of sync in those cases. Hence, no way to detect the problem before reaching production.
How can we find these sorts of problems earlier?

Automate Releases, Equivalent Environments

Fortunately, all of the above problems can be fixed with two solutions. I wanted to say SIMPLE solutions, but unfortunately they are not particularly simple and require a fair amount of work to get right. These solutions are:

Automate the release and rollback pipelines.
Ensure Integration and Test environments are ~~identical~~ equivalent to Production.

Automated Releases
In a modern computing environment there is really no need to include manual steps in a release. Even when releases have lots of complexity it should be possible to script all the required steps. An automated script can be tested, fixed and re-tested many times in order to ensure it is correct before running it against a production environment. This is much better than a manual process that is difficult to test and open to human error.
Don't get me wrong, creating automated releases is hard, but the benefit of smoother, faster and more frequent releases to production gives significant enough business benefit to make the effort worthwhile.
Equivalent Environments
A manufacturer building a product wouldn't create a prototype and then go straight into mass production with a design that varied significantly from that prototype. Companies building software shouldn't do so either. The purpose of Continuous Integration is to find problems early when they are quicker and cheaper to fix. In order to work well, the lower environments need to match production in terms of applications, server structure, configuration and so on.
Additionally, if we are to build automated deployment plans then we need production like environments to develop and test them against. An unproven automated plan is of no more value than a fully manual one.

Conclusions

By automating releases and having equivalent environments we give ourself the best possible chance of undertaking regular, successful and stress-free releases. When Murphy does come along, knowing we have an automated way to get back to a working system saves a whole lot of worry and stress. Yes, it's hard to get to this point, but the long term benefit to a business from doing so is worth every penny and every hour spent.

Wednesday 10 November 2010

Introduction to SBT

I'm at the London Scala User Group monthly meeting. We're here at the Skillsmatter Exchange and we are learning about Simple Build Tool (sbt). Here's my live blog entry on the session. Sbt can be found at http://code.google.com/p/simple-build-tool/

Running/Using SBT

Once SBT is installed, you can run SBT in a new directory. It will ask if you want to create a new project. This will ask you a few questions and then set up the basic project template:

/Users/chris/temp/sbt 6% sbt
Project does not exist, create new project? (y/N/s) y
Name: demo
Organization: skipoleschris
Version [1.0]: 
Scala version [2.7.7]: 2.8.1
sbt version [0.7.4]:

Template structure follows maven directory conventions:

lib - directory for adhoc libraries
project - sbt project files
src - main/scala, main/resources, test/scala, test/resources ala maven
target - where all the built stuff goes

Start up sbt and you can now, build, test, package and run:

> run
[info] 
[info] == copy-resources ==
[info] == copy-resources ==
[info] 
[info] == compile ==
[info]   Source analysis: 1 new/modified, 0 indirectly invalidated, 0 removed.
[info] Compiling main sources...
[info] Compilation successful.
[info]   Post-analysis: 2 classes.
[info] == compile ==
[info] 
[info] == run ==
[info] Running Demo 
Hello World
[info] == run ==
[success] Successful.
[info] 
[info] Total time: 5 s, completed 10-Nov-2010 18:53:12

We can turn off some of the verboseness by setting the log level:

> warn
Set log level to warn

To customise out project we create a scala project file in the project/build directory:

import sbt._

class DemoProject(info: ProjectInfo) extends DefaultProject(info) {

  // Test dependencies
  val scalatest = "org.scalatest" % "scalatest" % "1.2" % "test"
}

You can see that dependencies are specified in a very similar way to the maven group/artifcatId/version/scope model. We then reload the scala project and update to pull down the dependencies:

> reload
[info] Recompiling project definition...
[info]    Source analysis: 1 new/modified, 0 indirectly invalidated, 0 removed.
[info] Building project demo 1.0 against Scala 2.8.0
[info]    using DemoProject with sbt 0.7.4 and Scala 2.7.7
> update
[info] 
[info] == update ==
[info] :: retrieving :: skipoleschris#demo_2.8.0 [sync]
[info]  confs: [compile, runtime, test, provided, system, optional, sources, javadoc]
[info]  1 artifacts copied, 0 already retrieved (1742kB/44ms)
[info] == update ==
[success] Successful.

Some other important/useful sbt commands:

clean-cache - remove cache files
test - compile & run tests
test-quick - compile & run only tests that have changed
console - starts a scala console

It is also possible to prefix any command with ~, which runs the same command each time changes to source files are detected. Very useful with test-quick.

sbt can also build mixed Java and Scala projects and with a little bit of configuration (github.com/szeiger/junit-interface) can also run JUnit tests as part of the build process (by default it runs just scalatest and specs tests).

Another useful command for web developers is: jetty-run, which starts up a Jetty server and deploys the application to it:

> jetty-run
[info] 
[info] == compile ==
[info]   Source analysis: 0 new/modified, 0 indirectly invalidated, 0 removed.
[info] Compiling main sources...
[info] Nothing to compile.
[info]   Post-analysis: 2 classes.
[info] == compile ==
[info] 
[info] == copy-resources ==
[info] == copy-resources ==
[info] 
[info] == prepare-webapp ==
[info] == prepare-webapp ==
[info] 
[info] == jetty-run ==
2010-11-10 19:11:10.505:INFO::Logging to StdErrLog::DEBUG=false via org.eclipse.jetty.util.log.StdErrLog
[info] jetty-7.0.2.RC0
[info] NO JSP Support for /, did not find org.apache.jasper.servlet.JspServlet
[info] Started SelectChannelConnector@0.0.0.0:8080
[info] == jetty-run ==

Web apps are supported with a simple change to the scala project build file:

import sbt._

class DemoProject(info: ProjectInfo) extends DefaultWebProject(info) {

  // Test dependencies
  val jetty7 = "org.eclipse.jetty" % "jetty-webapp" % "7.0.2.RC0" % "test"
  val scalatest = "org.scalatest" % "scalatest" % "1.2" % "test"
}

SBT Plugins

SBT supports the development and use of plugins. Making a new plugin, the project extends PluginProject as opposed to DefaultProject or DefaultWebProject.

For example, we can create a Scala trait that extends BasicScalaProject and do things, such as register new test listeners for generating reports and so on. Plugins must be developed using Scala 2.7.7 as this is the version that SBT is compiled in. Good tip: develop the trait in a normal project and then when it's working make it into a plugin project.

giter8

A tool that generates files and directories from templates published on gitgub. Can be found at https://github.com/n8han/giter8. Provides similar (but obviously much better) functionality than the maven archetype plugin. Currently a limited set of templates, but this should grow over time. In particular there is a template for building Android projects in sbt. E.g. g8 gseitz/android-sbt-plugin

You can also build your own giter8 templates and upload them to your github repository.

IDE Integration

There are SBT plugins for most different IDEs that generate IDE project files. There are also SBT plugins for most IDEs, although many developers actually keep SBT open in a separate window using the ~ command.

Well, that's about it for the SBT overview. Hope you found something of use.

Friday 5 November 2010

Functional Programming Challenge: Most Frequent Multiple Occurring List Item - SOLUTIONS

In this post I set out a challenge to come up with the best solution to the following problem: how to find the most frequently multiple occurring item in a list.

I had created my own solution, but felt that a better one was possible. The criteria specified were simple code, elegance and performance.

I have had two submissions to the problem so far and this post offers a comparison of them along with my original solution. My comparisons were carried out using the following two test lists:

val random = new java.util.Random(2L)
val list1 = (for ( i <- 1 to 100000 ) yield random.nextInt).toList.distinct
val list2 = (for ( i <- 1 to 100000 ) yield random.nextInt(1000)).toList

The first list contains no duplicate elements, so the result should always be None. The second list will contain multiple duplicate elements.

My Solution

My solution to the problem was based around building a map of list items to frequency count and then finding the most frequent multiply occurring item in that map. My implementation was achieved with two small functions and two foldLeft operations:

def highestMultipleFrequency1[T](items: List[T]): Option[T] = {
  type Frequencies = Map[T, Int]
  type Frequency = Pair[T, Int]

  def freq(acc: Frequencies, item: T) = acc.contains(item) match {
    case true => acc + Pair(item, acc(item) + 1)
    case _ => acc + Pair(item, 1)
  }
  def mostFrequent(minOccurs: Int)(acc: Option[Frequency], item: Frequency) = acc match {
    case None if item._2 >= minOccurs => Some(item)
    case Some((value, count)) if item._2 > count => Some(item)
    case _ => acc
  }
  items.foldLeft(Map[T, Int]())(freq).foldLeft[Option[Frequency]](None)(mostFrequent(2)) match {
    case Some((value, count)) => Some(value)
    case _ => None
  }
}

In the tests, my solution performs fairly consistently across both test lists, the second being slightly faster due to the smaller size of the map being used by the second fold function.

Solution from Nilanjan R

This solution send by Nilanjan R is significantly simpler and shorter in the amount of code it uses. It takes the approach of grouping items by themselves and then sorting the resulting lists by their size and taking the first item:

def highestMultipleFrequency2[T](items: List[T]): Option[T] = {
  items.groupBy(x => x).values.toList.sortWith((f, s) => f.size > s.size) match {
    case x :: xs if x.size >= 2 => Some(x.head)
    case _ => None
  }
}

Under test, this solution performs significantly slower than my solution on the list without duplicates. However, when the list does contain significant multiple occurring elements then it performs slightly better than my solution. To me this tends to indicate that the sort function is the major performance factor in this solution. I do really like this due to its very clean, simple and easy to understand code. It's certainly my choice for the Noughts and Crosses application where simple code is more important than raw performance and where I am only dealing with very small lists.

Solution from Antonin Brettsnajdr

The solution offered by Antonin follows a similar approach to my solution in that it builds a map of item frequencies and then find the most frequently occurring element using this map. However, this solution implements the whole algorithm in a single tail-recursive function:

def highestMultipleFrequency3[T](items: List[T]): Option[T] = {
  def mff(l: List[T], m: Map[T, Int]): Option[T] = l match {
    case element :: tail => if (m contains element) mff(tail, m.updated(element, m(element) + 1))
                            else mff(tail, m + (element -> 1))
    case Nil => val occurences = (m map { item => item._2 }) 
                if (occurences forall (_ == 1)) None 
                else {
                  val maxOccur = occurences.max
                  m find ( letter => letter._2 == maxOccur) match {
                    case Some(x) => Option(x._1)
                    case None => None
                  }
                }
  }
  mff(items, Map())
}

This solution is by far the best performing. It is two to three times faster than my solution. There is a significant performance benefit when dealing with a list that contains multiple duplicated items. While the code is not as simple as the previous solution, it is very readable and is a nice example of tail recursion.

A big thanks to Nilanjan R and Antonin Brettsnajdr for their solutions. I found it very interesting looking at the different way people approach a problem such as this. Can you do better? Let me know your solution.

Monday 1 November 2010

Functional Programming Challenge: Most Frequent Multiple Occurring List Item

A little functional programming challenge for those who like that sort of thing. This one came up while I was improving my Noughts and Crosses example application.

The challenge is this: Given a list of elements of type T, find the element with the most multiple occurrences.

For example, the following should apply:

List("a", "b", "c", "a", "b", "a") 
    -> should return "a" as this occurs the most in the list
List("a", "b", "c")                
    -> should return nothing as no elements occur multiple times
List("a", "b", "a", "b")           
    -> can return either "a" or "b" as they both occur the same number of times

The simplest solution I have come up with so far (written in Scala) is:

def highestMultipleFrequency[T](items: List[T]): Option[T] = {
  def freq(acc: Map[T, Int], item: T) = acc.contains(item) match {
    case true => acc + Pair(item, acc(item) + 1)
    case _ => acc + Pair(item, 1)
  }
  def mostFrequent(frequencies: Map[T, Int], minCount: Int = 2) = {
    frequencies.find(_._2 == frequencies.values.max) match {
      case Some((value, count)) if count >= minCount => Some(value)
      case _ => None
    }
  }
  mostFrequent(items.foldLeft(Map[T, Int]())(freq))
}

While this works and is fairly elegant I can help feeling that there must be a better way. Traversing the entire list to build a map of item to frequency and then searching all the values in that map to find the one occurring the most just seems a tad clunky.

I also played around with an alternative implementation for the mostFrequent function, but I'm not sure this is any better or more efficient:

def mostFrequent(frequencies: Map[T, Int], minCount: Int = 2) = {
  frequencies.toList.map(_.swap).max match {
    case (count, value) if count >= minCount => Some(value)
    case _ => None
  }
}

Finally I came up with this solution, which requires less traversal through the map than the previous ones, even if the code is slightly longer and the generics for the foldLeft methods are less readable:

private def highestMultipleFrequency[T](items: List[T]): Option[T] = {
  type Frequencies = Map[T, Int]
  type Frequency = Pair[T, Int]

  def freq(acc: Frequencies, item: T) = acc.contains(item) match {
    case true => acc + Pair(item, acc(item) + 1)
    case _ => acc + Pair(item, 1)
  }
  def mostFrequent(acc: Option[Frequency], item: Frequency) = acc match {
    case None if item._2 >= 2 => Some(item)
    case Some((value, count)) if item._2 > count => Some(item)
    case _ => acc
  }
  items.foldLeft(Map[T, Int]())(freq).foldLeft[Option[Frequency]](None)(mostFrequent) match {
    case Some((value, count)) => Some(value)
    case _ => None
  }
}

Anyone want to offer a different (better) solution? Use the functional programming language of your choice. Points awarded for simplicity, elegance and efficiency. The winning solution will be published here and (if the provider agrees) will be put into my Noughts and Crosses example in place of the current implementation.