So, you want to write software?: November 2011

Friday 25 November 2011

Is Scala Like EJB2?

In this post and then in this followup, Stephen Colebourne (@jodastephen) states his criticisms of the Scala language and how it feels to him similar to the same feeling of when he was working with EJB2. A colleague of mine also tweeted the following:

"This is why I will never use Scala: def ++ [B >: A, That] (that: TraversableOnce[B])(implicit bf: CanBuildFrom[List[A], B, That]) : That"

Here's my thoughts and experience with the above.

I started out as a C and C++ developer, but joined the Java set right at the very beginning. I've been fortunate enough to have been working in Java since 1998. Or unfortunate, if you include the early days of the Servlet API & JSP, EJB1 and then EJB2! About 2007 I really started to feel the need for a different language, one that would allow me to express my ideas more concisely and without writing so much boilerplate code. I looked at Groovy, but found it too easy to keep dropping back and writing Java. I then learned Ruby. This taught me how expressive a good language can be and how powerful closures can be. I dabbled with a few more languages before finally finding Scala at the beginning of 2009. These days I mostly code a mix of Scala, Java and Groovy. Next challenge is Clojure.

Changing Mindset

Now, I'm an engineer at heart and my typical approach is to not only understand how to use something but also (and usually at the same time) understand how and why it works. This approach always served me well when learning languages like Java and Ruby. These languages have a fairly small surface area and a relatively simple type system. It's easy to read up on them, understand how to use them and understand exactly how they work. It's easy to look at the standard and third-party libraries and see how these work as well. For me, Scala was different and involved me making a fundamental shift in mindset about how I learned the language.

When I started out I approached Scala in the same way I did other languages: I read a book and tried to build some test applications while at the same time delving into the library and language to see how they worked. This was daunting! I can fully appreciate Stephen's position that it feels a bit like EJB2. Let's put that in context. EJB2 was a horrid mess of unnecessary complexity and complication. It aimed to simplify building enterprise apps involving persistent entities and business services, but it just added way more complexity and mind effort than it saved.

Now, back to Scala. It's a big language with a very extensive and powerful type system. If you attack trying to understand it as a whole I can see how it could be mistaken as another EJB2: there seems to be loads of complexity, a lot of which you can't see the reason for and a lot of stuff just requires so much mind effort to understand. Much of it initially seems unecessary.

It was after my initial failure to understand how the Scala language and libraries actually were built and worked that I took a step back and reconsidered my learning style. I could already see some value to the language and I wanted to explore more, but how? I then made a fundamental change in my approach. I decided that it was okay for me to start using the language without necessarily understanding exactly how all the details worked. As I used the language more I would gradually try to understand the how and why, one small step at a time.

I started learning to write Scala code. When I wanted to achieve something I researched how to do it until I was familiar with the use of a particular language feature or method. If there was some syntax, implementation detail or concept that I didn't get at that point I just noted it down as something to come back to. Now after a year or so writing Scala I look back at what I didn't initially understand and it mostly makes sense. I still occasionally encounter things that don't make total sense initially, so I just learn enough to use them and then come back later to learn the why or how. It's a system that seems to work incredibly well.

When you attack looking at and learning Scala in this more gradual way, first understanding how to use it then gradually, and incrementally, delving more deeply into the type system, you begin to realised that it's not so bad. Yes, there is complexity there, but it's not the unnecessary complexity of EJB2, it's moderate complexity that helps you be expressive, write safe code and get things done. As my knowledge of Scala has grown I've yet to find any features that I haven't been able to see the value of. It's a very well thought out language, even if it doesn't quite seems so at first.

Switching from a mindset where I have to have a full understanding or how something works before I use it to one where I'm comfortable with how to use it even if I don't know why it works was a very disconcerting change in my thinking. I'm not going to say it was an easy change to make, because it wasn't. However the effort was worth it for me because it got me using a language that I now find to be many more times more productive than Java. I think I also understand the language more deeply because I have put it to use and then come back later to understand more deeply exactly how and why it worked. I would never have got to this depth of understanding with my approach to learning other languages.

An Example

Consider the method:


  def ++ [B >: A, That] (that: TraversableOnce[B])
                        (implicit bf: CanBuildFrom[List[A], B, That]) : That

There's a lot of implementation detail in there. However, in order to use this method all I really need to understand is that it allows me to concatenate two immutable collections, returning a third immutable collection containing the elements of the other two. That's why the Scala api docs also include a use case signature for this method:


  def ++ [B](that: GenTraversableOnce[B]): List[B]

From this I understand that on a List I can concatenate anything that is GenTraversibleOnce (Scala's equivalent of a Java Iterable) and that I will get a new list returned. Initially I need understand nothing more in order to use this method. I can go away and write some pretty decent Scala code.

Some time later I learn that the >: symbol means 'same or super type of' and understand this concept. I can then refine my understanding of the method so that I now know that the collection that I want to concatenate must have elements that are the same (or a supertype) as the collection I am concatenating onto.

As I later progress and learn type classes I can see that CanBuildFrom is a type class that has a default implementation for concatenating lists. I can then take this further and create my own CanBuildFrom implementations to vary the behaviour and allow different types to be returned.

I'm first learning how to use the code, then gradually refining my understanding as my knowledge of the language grows. Trying to start out by learning exactly what the first signature meant would have been futile and frustrating.

Conclusion

If you start using Scala and you take the approach of trying to see it all, understand it all and know how and why it all works then it's a scary starting place. The size and complexity of the language (particularly its type system) can seem daunting. You could easily be confused in feeling it similar to EJB2, where reams of needless complexity just made it plain unusable.

However, by switching mindset, learning to use the language and then growing understanding of how it works over time it becomes much more manageable. As you do this you gradually realise that the perceived initial complexity is not as bad as it first seemed and that it all has great value. That initial hurdle from wanting to understand it all to being happy to understand just enough is very hard to get over, but if you do then a very powerful, concise and productive language awaits you.

Monday 21 November 2011

The Luhny Bin Challenge

Last week Bob Lee (@crazybob), from Square, set an interesting coding challenge. All the details can be found here: http://corner.squareup.com/2011/11/luhny-bin.html. I decided this sounded like a bit of fun a set to work on a Scala implementation. I'm pleased with my final result, but also learnt something interesting along the way. My full solution is available on my github account: https://github.com/skipoleschris/luhnybin.

The challenge was an interesting one, and more difficult that initial reading would suggest. In particular, the ability to deal with overlapping card number and the need to mask them both added significantly to the difficulty of the problem.

My Algorithm

The approach I selected was fairly simple. Work through the input string until a digit character was encountered. Walk forward from there consuming digit, space of hyphen characters until either the end of the string was reached, an different character was encountered or 16 digits had been collected. Next try to mach the LUHN, first on 16 digits, then 15 and finally 14. Record an object describing the index and how many characters to mask. Repeat for the next digit character and so on. After visiting all characters in the string, apply all the masks to the input.

There were some optimisations along the way. For example, if encountering 14 or less characters then there was no chance of an overlapping card number so you could safely skip over these remaining digits after the first one had been evaluated.

My Implementation

I decided to implement in Scala as this is my language of choice at the moment. I considered both a traditional imperative approach as well as a more functional programming solution. In the end I went for a very functional implementation, based around immutable data. My main reason for this choice was because I'm working on improving my functional programming skills and I though it looked like an interesting problem to try and solve in this way.

The implementation consist of for main pieces of code. Each implements a transformation of input into output, is fairly self contained. I fell that by writing them in this way I have created code that is possible to reason about and also reuse in other scenarios. I tried for elegant code rather then optimising for the most efficient and minimal code base. I think I achieved my goals.

Conclusion

While I am very happy with my solution and had great fun building it, it was very interesting to look at other solutions that had been attempted. In particular it's quite clear that the ~900ms that I was able to active on my mid-2010 MBP was a number of times slower than the fastest solutions. The inevitable conclusion being that in some cases adopting a more mutable and imperative approach may be preferable. This is a typical example of this case, when we are in effect creating a logging filter, and speed of operation could easily be argued as more important than adhering to the principles of immutability. An interesting conclusion.

Tuesday 1 November 2011

Concise, Elegant and Readable Unit Tests

Just recently I was pair programming with another developer and we were writing some unit tests for the code we were about to write (yes, we were doing real TDD!). During this exercise we came to a point where I didn't feel that the changes being made to the code met my measure of what makes a good test. As in all the best teams we had a quick discussion and decided to leave the code as it was (I save my real objections for times when it is more valuable). However, this got me thinking about what actually I was uncomfortable with, and hence the content of this post.

As a rule I like my code to have three major facets. I like it to be concise: there should be no more code than necessary and that code should be clear and understandable with minimum contextual information required to read it. It should be elegant: the code should solve the problem at hand in the cleanest and most idiomatic way. Also, it must be readable: clearly laid out, using appropriate names, following the style and conventions of the language it is written in and avoiding excessive complexity.

Now, the test in question was written in Groovy, but it could just as well be in any language as it was testing a fairly simple class that maps error information into a map that will be converted to Json. The test is:

@Test
void shouldGenerateAMapOfErrorInformation() {
    String item = "TestField"
    String message = "There was an error with the test field"
    Exception cause = new IllegalStateException()

    ErrorInfo info = new ErrorInfo()
    info.addError(item, message, cause)

    Map content = info.toJsonMap()
    assert content.errors.size() == 1
    assert content.errors[0].item == item
    assert content.errors[0].message == message
    assert content.errors[0].cause == cause.toString()
}

Now, to my mind this test is at least three lines too long - the three lines declaring variables that are then used to seed the test and in the asserts. These break my measure of conciseness by being unnecessary and adding to the amount of context needed in order to follow the test case. Also, the assertions of the map contents are less than elegant (and also not very concise), which further reduces the readability of the code. I'd much rather see this test case look more like this:

@Test
void shouldGenerateAMapOfErrorInformation() {
    ErrorInfo = new ErrorInfo()
    info.add("TestField", "There was an error with the test field", 
                 new IllegalStateException())

    Map content = info.toJsonMap()
    assert content.errors.size() == 1
    assert hasMatchingElements(contents.errors[0], 
          [ item : "TestField",
             message : "There was an error with the test field",
             cause : "class java.lang.IllegalStateException" ])
}

This test implementation is more concise. There are less lines of code and less context to keep track of. To my eye I find this much more elegant and far more readable.

"But, you've added duplication of the values used in the test", I hear you cry! That is true and it was totally intentional. Let me explain why…

Firstly, as we have already mentioned, duplicating some test values actually makes the code more concise and readable. You can look at either the test setup/execution or the assertions in isolation without needing any other context. Generally, I would agree that avoiding duplication in code is a good idea, but in cases such at this, the clarity and readability far out way the duplication of a couple of values!

Secondly, I like my tests to be very explicit about what they are expecting the result to be. For example, in the original test, we asserted that the cause string returned was the same as the result of calling toString() on the cause exception object. However, what would happen if the underlying implementation of that toString() method changed. Suddenly my code would be returning different Json and my tests would know nothing about it. This might event break my clients - not a good situation.

Thirdly, I like the intention of my code to be very clear, especially so in my tests. For example, I might create some specific, but strange, test data value to exercise a certain edge case. If I just assigned it to a variable, then it would be very easy for someone to come along, think it was an error and correct it. However, if the value was explicitly used to both configure/run the code and in the assertion then I would hope another developer might think more carefully as to my intent before making a correction in multiple places.

My final, and possibly most important reason for not using variables to both run the test and assert against is that this can mask errors. Consider the following test case:

@Test
void shouldCalculateBasketTotals() {
    Basket basket = new Basket()
    LineItem item1 = new LineItem("Some item", Price(23.88))
    LineItem item2 = new LineItem("Another item", Price(46.78))
 
    basket.add(item1)
    basket.add(item2)
 
    assert basket.vatExclusivePrice == 
                       item1.vatExclusivePrice + item2.vatExclusivePrice
    assert basket.vat == item1.vat + item2.vat
    assert basket.totalPrice == item1.price + item2.price
}

In this test, what would happen if the LineItem class had an unnoticed rounding error in its VAT calculation that was then replicated into the basket code? As we are asserting the basket values based on the fixture test data we may never notice as both sides of the assertion would report the same incorrectly rounded value. By being more explicit in our test code we not only create code that is more concise, elegant and readable but we also find more errors:

@Test
void shouldCalculateBasketTotals() {
    Basket basket = new Basket()
    LineItem item1 = new LineItem("Some item", Price(23.88))
    LineItem item2 = new LineItem("Another item", Price(46.78))
 
    basket.add(item1)
    basket.add(item2)
 
    assert basket.vatExclusivePrice == 58.88
    assert basket.vat == 11.78
    assert basket.totalPrice == 70.66
}

Finally, a couple more points that people might raise and my answers to them (just for completeness!):

Q: Can't we move the constant variables up to the class level and initialise them in a setup to keep the tests methods concise?
A: This makes an individual method contain less lines of code but fails my consiseness test as there is context outside of the test method that is required to understand what it does. You actually have to go and look elsewhere to find out what values your test is executing against. This is also less readable!

Q: Isn't is better to share test fixtures across test methods in order to keep the test class as a whole more concise?
A: This might be appropriate for some facets, such as any mocks that are used in the test or if there is a particularly large data set required for the test which is not part of the direct subject of the test. However, I'd rather see slightly longer individual test methods that are explicit in their test data and assertions, even if this means some duplication or a slightly longer overall test class.