Friday, December 23, 2016

Fuzz testing Hexml with AFL

Summary: Hexml 0.1 could read past the end of the buffer for malformed documents. Fuzz testing detected that and I fixed it in Hexml 0.2.

I released Hexml, my fast DOM-based XML parser, and immediately Austin Seipp got suspicious. Here was a moderately large piece of C code, taking untrusted inputs, and poking around in the buffer with memcpy and memchr. He used American Fuzzy Lop (AFL) to fuzz test the Hexml C code, and came up with a number of issues, notably a buffer read overrun on the fragment:

<a b=:fallback

With a lot of help from Austin I setup AFL, fixed some issues with Hexml and with how AFL was being run, released Hexml 0.2 fixing these issues and incorporated AFL into my Travis CI builds.

If you want to actually follow all the steps on your computer, I recommend reading the original GitHub issue from Austin. Alternatively, checkout Hexml and run sh afl.sh.

Building and installing AFL

The first step was to build and install AFL from the tarball, including the LLVM pieces and libdislocator. The LLVM mode allows faster fuzzing, and the libdislocator library provides a library that makes all allocations next to a page boundary - ensuring that if there is a buffer read overrun it results in a segfault than AFL can detect.

An AFL test case

To run AFL you write a program that takes a filename as an argument and "processes" it. In my case that involves calling hexml_document_parse - the full version is online, but the salient bits are:

#include "hexml.c"
... other imports ...

int main(int argc, char** argv)
{
    __AFL_INIT();
    ... read file from argv[0] ...
    document *doc = hexml_document_parse(contents, length);
    hexml_document_free(doc);
    return 0;
}

Here I statically #include the hexml.c codebase and have a main function that calls __AFL_INIT (to make testing go faster), reads from the file, then parses/frees the document. If this code crashes, I want to know about it.

The original AFL driver code used __AFL_LOOP to speed things up further, but that results in a huge number of spurious failures, so I removed it.

Running AFL

To run AFL on my code requires compiling it with one AFL tool, then running it through another. The steps are:

AFL_HARDEN=1 afl-clang-fast -O2 -Icbits cbits/fuzz.c -o $PWD/hexml-fuzz
AFL_PRELOAD=/usr/local/lib/afl/libdislocator.so afl-fuzz -T hexml -x /usr/local/share/afl/dictionaries/xml.dict -i $PWD/xml -o $PWD/afl-results -- $PWD/hexml-fuzz @@

I compile with AFL_HARDEN to detect more bugs, producing hexml-fuzz. I run with libdislocator loaded so that my small buffer overrun turns into a fatal segfault. I give afl-fuzz a dictionary of common XML fragments and a few simple XML documents, then let it run over hexml-fuzz. The interactive UI shows bugs as they occur.

Fixing the bugs

Running AFL on Hexml 0.1 produced lots of bugs within a few seconds. Each bug produces an input file which I then ran through a debugger. While there were a few distinct bug locations, they all shared a common pattern. Hexml parses a NUL-terminated string, and in some cases I looked at a character that was potentially NUL and consumed it in the parsing. That might consume the final character, meaning that any further parsing was reading past the end of the string. I audited all such occurrences, fixed them, and reran AFL. Since then I have been unable to find an AFL bug despite lots of compute time.

Running on CI

I run all my code on Travis CI to ensure I don't introduce bugs, and to make accepting pull requests easier (I don't even need to build the code most of the time). Fortunately, running on Travis isn't too hard:

AFL_PRELOAD=/usr/local/lib/afl/libdislocator.so timeout 5m afl-fuzz -T hexml -x /usr/local/share/afl/dictionaries/xml.dict -i $PWD/xml -o $PWD/afl-results -- $PWD/hexml-fuzz @@ > /dev/null || true
cat afl-results/fuzzer_stats
grep "unique_crashes *: 0" afl-results/fuzzer_stats

I pipe the output of AFL to /dev/null since it's very long. I run for 5 minutes with timeout. After the timeout hits, I display the fuzzer_stats file and then grep for 0 crashes, failing if it isn't there.

Conclusions

Writing C code is hard, especially if it's performance orientated, and if it's not performance orientated you might want to consider a different language. Even if you don't want to use your code on untrusted input, sooner or later someone else will, and even tiny bugs can result in complete exploits. AFL does a remarkable job at detecting such issues and has made Hexml the better for it.

Monday, December 12, 2016

New XML Parser, Hexml

Summary: I've released a new Haskell library, Hexml, which is an incomplete-but-fast XML parser.

I've just released Hexml, a new C/Haskell library for DOM-style XML parsing that is fast, but incomplete. To unpack that a bit:

  • Hexml is an XML parser that you give a string representing an XML document, it parses that string, and returns either a parse error or a representation of that document. Once you have the document, you can get the child nodes/attributes, walk around the document, and extract the text.

  • Hexml is really a C library, which has been designed to be easy to wrap in Haskell, and then a Haskell wrapper on top. It should be easy to use Hexml directly from C if desired.

  • Hexml has been designed for speed. In the very limited benchmarks I've done it is typically just over 2x faster at parsing than Pugixml, where Pugixml is the gold standard for fast XML DOM parsers. In my uses it has turned XML parsing from a bottleneck to an irrelevance, so it works for me.

  • To gain that speed, Hexml cheats. Primarily it doesn't do entity expansion, so &amp; remains as &amp; in the output. It also doesn't handle CData sections (but that's because I'm lazy) and comment locations are not remembered. It also doesn't deal with most of the XML standard, ignoring the DOCTYPE stuff.

If you want a more robust version of Hexml then the Haskell pugixml binding on Hackage is a reasonable place to start, but be warned that it has memory issues, that can cause segfaults. It also requires C++ which makes use through GHCi more challenging.

Speed techniques

To make Hexml fast I first read the chapter on fast parsing with Pugixml, and stole all those techniques. After that, I introduced a number of my own.

  • I only work on UTF8, which for the bits of UTF8 I care about, is the same as ASCII - I don't need to do any character decoding.

  • Since I don't do entity expansion, all strings are available in the source document, so everything simply provides offsets into the input string. In the Haskell API I use constant-time bytestring slices into the source string to present a nice API.

  • The memory model for a document is an array of attributes, an array of nodes, and a root node from the list of nodes. To make sure that scanning a document is fast, each node describes their attributes and direct child nodes in terms of a start and length within the attribute and node arrays. For example, the root node might have attributes 1..5 in the attribute array, and direct children 4..19 in the node array. When scanning the child nodes there are no linked-list operations and everything is cache friendly.

  • To keep the memory compact for attributes, I just have an array and reallocate/copy as necessary. By always doubling the number of attributes on exhaustion I ensure a worst-case of 1-copy per attribute on average.

  • To keep the memory compact for nodes is a bit more complex, as the direct child nodes are not necessarily allocated consecutively, as child nodes may themselves have child nodes. The solution is to have an array of nodes, with contiguous allocation of used child nodes starting at the beginning. To ensure the child nodes are continguous I first put the nodes at the end of the array, then copy them after a child is complete -- in effect using the end of the array as a stack. By always doubling the number of nodes on exhaustion I ensure a worst-case of 2-copies per node on average.

  • When parsing the text in the body of a document, since I don't care about &, the only character that is of any interest is <. That allows me to process much of the document with the highly-optimised memchr.

  • I initially allocate a single buffer that contains the document, a small number of attributes and a small number of nodes, in a single call to malloc. If more attributes/nodes are required they allocate a fresh buffer and just ignore the initially provided one. That ensures that for small documents they don't pay for multiple malloc calls, at the cost of wasting the initial attribute/node allocation on larger documents (which are more memory heavy anyway - so it doesn't matter).

  • I'm pretty sure Hexml could be optimised further. Specifically, I have a recursive descent parser, and it should be a single function with goto. I also process some characters multiple times, mostly to ensure predictable abstraction barriers around the parsing functions, but that could be elimiated with a goto-based approach.

Installing the Haskell Network library on Windows

Summary: This post describes how to install the Haskell network library on Windows, again.

I recently bought a new computer, and tried to install GHC 8.0.1 then upgrade the network library using Cabal. As I have come to expect, it didn't work. Using Git Bash, I got the error:

$ cabal install network-2.6.3.1
Resolving dependencies...
Configuring network-2.6.3.1...
Failed to install network-2.6.3.1
Build log ( C:\Users\Neil\AppData\Roaming\cabal\logs\network-2.6.3.1.log ):
Configuring network-2.6.3.1...
configure: WARNING: unrecognized options: --with-compiler
checking for gcc... C:\ghc\GHC-80~1.1┼║
checking whether the C compiler works... no
configure: error: in `C:/Neil':
configure: error: C compiler cannot create executables
See `config.log' for more details
cabal: Leaving directory '.'
cabal.exe: Error: some packages failed to install:
old-time-1.1.0.3 failed during the configure step. The exception was:
ExitFailure 77

Running -v3 shows the CC variable is being set to C:\ghc\GHC-80~1.1┼║, which looks like a buffer corruption or encoding issue. I tried my previous solution, but it didn't work. My new solution is:

$ cabal unpack network-2.6.3.1
$ cd network-2.6.3.1
$ cabal configure
... fails with a similar error to above ...
$ sh ./configure
$ cabal build
$ cabal copy
$ cabal register

I had to repeat the same pattern for the latest version of old-time, and the same pattern worked.

Another way that works is to use Stack.


Tuesday, December 06, 2016

Undefined Behaviour in C

Summary: I tripped over undefined behaviour in C. It's annoying.

I've recently been writing some C code to parse XML quickly. While working on that project, I inadvertently wrote some code which is undefined according to the C language standard. The code compiled and ran fine using Visual Studio, but under gcc (even at -O0) it corrupted memory, sometimes leading to a segfault, but usually just leading to a wrong answer. The code in question was (see full code at GitHub):

d->nodes.nodes[0].nodes = parse_content(d);

To give some context, d is a structure that contains various pieces of state - what the string to be parsed is, how much we have parsed, along with a pointer to the output nodes. The parse_content function parses the bit inside an XML tag, returning the indicies in nodes which it used.

The complication comes from nodes not being a fixed size, but dynamically resized if the number of nodes exceeds the capacity. For big documents that means parse_content will reallocate d->nodes.nodes.

According to the C spec, the compiler can evaluate the LHS and RHS of an assignment in any order. Since gcc computes the location of d->nodes.nodes[0] before calling parse_content it uses the address of the node before reallocation. After reallocation the address will have changed, and the assignment will be made to the wrong location.

I spotted the bug by inserting printf statements, and in doing so, I had to rewrite the code to:

str content = parse_content(d);
d->nodes.nodes[0].nodes = content;

That fixes the issue, since now the evaluation order is strictly defined. As a simplified example of the same issue:

char* array;

char f() {
    array = malloc(42);
    return 'x';    
}

void test() {
    array = malloc(0);
    array[0] = f();
}

Here the line array[0] = f() might assign to either the result of malloc(0) or malloc(42), at the compilers discretion.

I manually checked if I had made any other such mistakes, and I couldn't find any. Naturally, I wanted to find a static checker that could detect such a mistake, so I tried a bunch of them. I wasn't very successful:

  • Visual Studio 2015 code analysis made me write assert after each malloc, but nothing further.
  • PVS Studio found nothing.
  • Clang undefined behaviour found nothing, and seemingly doesn't work on Windows.
  • GCC undefined behaviour found nothing, and seemingly doesn't work on Windows.
  • RV-Match hit a stack-overflow when running the program.

Wednesday, November 23, 2016

The Haskell.org Website Working Group (HWWG)

Haskell represents both a language and a user community - and moreover a fantastic community full of friends, fun, and deep technical debate. Unfortunately, in recent times the community has started to fracture, e.g. Cabal vs Stack, haskell.org vs haskell-lang.org. These divisions have risen above technical disagreements and at some points turned personal. The solution, shepherded by Simon Peyton Jones, and agreed to by both members of the haskell.org committee and the maintainers of haskell-lang.org, is to form the Haskell Website Working Group (HWWG). The charter of the group is at the bottom of this post.

The goal of the Haskell Website Working Group is to make sure the Haskell website caters to the needs of Haskell programmers, particularly beginners. In doing so we hope to either combine or differentiate haskell.org and haskell-lang.org, and give people clear recommendations of what "downloading Haskell" means. Concretely, we hope that either haskell-lang.org redirects to haskell.org, or that haskell-lang.org ends up being used for something very different from today.



The Haskell Website Working Group (HWWG)

Scope and goals

  • The HWWG is responsible for the design and content of the user-facing haskell.org web site, including tutorials, videos, news, resource, downloads, etc.

  • The HWWG is not responsible for:
    • The infrastructure of haskell.org
    • Toolchains, Hackage, compilers, etc
  • The HWWG focuses on serving users of Haskell, not suppliers of technology or libraries.
  • An explicit goal is to re-unite the haskell.org and haskell-lang.org web sites.

Expected mode of operation

  • HWWG is not responsible for actually doing everything! The web site is on github. Anyone can make a pull request. The general expectation is that uncontroversial changes will be accepted and committed without much debate.
  • If there is disagreement about a proposed change, it's up to the HWWG to engage in (open) debate, and to seek input as widely as time allows, but finally to make a decision.

Membership

Initial membership comprises of:

  • Neil Mitchell (chair)
  • Nicolas Wu
  • Andrew Cowie
  • Vincent Hanquez
  • Ryan Trinkle
  • Chris Done

It is expected the committee will change over time, but the mechanism has not yet been thought about.

Rules of engagement

  • Recognising that honestly-held judgements may differ, we will be scrupulously polite both in public and in private.
  • Recognising that Haskell has many users, and that different users have different needs and tastes, we want haskell.org to be inclusive rather than exclusive, providing a variety of alternative resources (toolchains, tutorials, books, etc) clearly signposted with their intended audiences.
  • Ultimately the haskell.org committee owns the haskell.org URL, but it delegates authority for the design and content of the web site to the HWWG. In extremis, if the haskell.org committee believes that the HWWG is mismanaging the web site, it can revoke that delegation.

Thursday, September 29, 2016

Full-time Haskell jobs in London, at Barclays

Update: I no longer work for Barclays and am no longer hiring.

Summary: I'm hiring 9 Haskell programmers. Email neil.d.mitchell AT barclays.com to apply.

I work for Barclays, in London, working on a brand new Haskell project. We're looking for nine additional Haskell programmers to come and join the team.

What we offer

A permanent job, writing Haskell, using all the tools you know and love – GHC/Cabal/Stack etc. In the first two weeks in my role I've already written parsers with attoparsec, both Haskell and HTML generators and am currently generating a binding to C with lots of Storable/Ptr stuff. Since it's a new project you will have the chance to help shape the project.

The project itself is to write a risk engine – something that lets the bank calculate the value of the trades it has made, and how things like changes in the market change their value. Risk engines are important to banks and include lots of varied tasks – talking to other systems, overall structuring, actual computation, streaming values, map/reduce.

We'll be following modern but lightweight development principles – including nightly builds, no check-ins to master which break the tests (enforced automatically) and good test coverage.

These positions have attractive salary levels.

What we require

We're looking for the best functional programmers out there, with a strong bias towards Haskell. We have a range of seniorities available to suit even the most experienced candidates. We don't have anything at the very junior end; instead we're looking for candidates that are already fluent and productive. That said, a number of very good Haskell programmers think of themselves as beginners even after many years, so if you're not sure, please get in touch.

We do not require any finance knowledge.

The role is in London, Canary Wharf, and physical presence in the office on a regular basis is required – permanent remote working is not an option.

How to apply

To apply, email neil.d.mitchell AT barclays.com with a copy of your CV. If you have any questions, email me.

The best way to assess technical ability is to look at code people have written. If you have any packages on Hackage or things on GitHub, please point me at the best projects. If your best code is not publicly available, please describe the Haskell projects you've been involved in.

Sunday, August 14, 2016

The Four Flaws of Haskell

Summary: Last year I made a list of four flaws with Haskell. Most have improved significantly over the last year.

No language/compiler ecosystem is without its flaws. A while ago I made a list of the flaws I thought might harm people using Haskell in an industrial setting. These are not flaws that impact beginners, or flaws that stop people from switching to Haskell, but those that might harm a big project. Naturally, everyone would come up with a different list, but here is mine.

Package Management: Installing a single consistent set of packages used across a large code base used to be difficult. Upgrading packages within that set was even worse. On Windows, anything that required a new network package was likely to end in tears. The MinGHC project solved the network issue. Stackage solved the consistent set of packages, and Stack made it even easier. I now consider Haskell package management a strength for large projects, not a risk.

IDE: The lack of an IDE certainly harms Haskell. There are a number of possibilities, but whenever I've tried them they have come up lacking. The fact that every Haskell programmer has an entrenched editor choice doesn't make it an easier task to solve. Fortunately, with Ghcid there is at least something near the minimum acceptable standard for everyone. At the same time various IDE projects have appeared, notably the Haskell IDE Engine and Intero. With Ghcid the lack of an IDE stops being a risk, and with the progress on other fronts I hope the Haskell IDE story continues to improve.

Space leaks: As Haskell programs get bigger, the chance of hitting a space leak increases, becoming an inevitability. While I am a big fan of laziness, space leaks are the big downside. Realising space leaks were on my flaws list, I started investigating methods for detecting space leaks, coming up with a simple detection method that works well. I've continued applying this method to other libraries and tools. I'll be giving a talk on space leaks at Haskell eXchange 2016. With these techniques space leaks don't go away, but they can be detected with ease and solved relatively simply - no longer a risk to Haskell projects.

Array/String libraries: When working with strings/arrays, the libraries that tend to get used are vector, bytestring, text and utf8-string. While each are individually nice projects, they don't work seamlessly together. The utf8-string provides UTF8 semantics for bytestring, which provides pinned byte arrays. The text package provides UTF16 encoded unpinned Char arrays. The vector package provides mutable and immutable vectors which can be either pinned or unpinned. I think the ideal situation would be a type that was either pinned or unpinned based on size, where the string was just a UTF8 encoded array with a newtype wrapping. Fortunately the foundation library provides exactly that. I'm not brave enough to claim a 0.0.1 package released yesterday has derisked Haskell projects, but things are travelling in the right direction.

It has certainly been possible to use Haskell for large projects for some time, but there were some risks. With the improvements over the last year the remaining risks have decreased markedly. In contrast, the risks of using an untyped or impure language remain significant, making Haskell a great choice for new projects.

Thursday, August 04, 2016

Upcoming talk: Writing build systems with Shake, 16 Aug 2016, London

Summary: I'm giving a talk on Shake.

I'm delighted to announce that I'll be giving a talk/hack session on Shake as part of the relatively new "Haskell Hacking London" meetup.

Title: Writing build systems with Shake

Date: Tuesday, August 16, 2016. 6:30 PM

Location: Pusher Office, 28 Scrutton Street, London

Abstract: Shake is a general purpose library for expressing build systems - forms of computation, with caching, dependencies and more besides. Like all the best stuff in Haskell, Shake is generic, with details such as "files" written on top of the generic library. Of course, the real world doesn't just have "files", but specifically has "C files that need to be compiled with gcc". In this hacking session we'll look at how to write Shake rules, what existing functions people have already layered on top of Shake for compiling with specific compilers, and consider which rules are missing. Hopefully by the end we'll have a rule that people can use out-of-the-box for compiling C++ and Haskell.

To put it another way, it's all about layering up. Haskell is a programming language. Shake is a Haskell library for dependencies, minimal recomputation, parallelism etc. Shake also provides as a layer on top (but inside the same library) to write rules about files, and ways to run command line tools. Shake doesn't yet provide a layer that compiles C files, but it does provide the tools with which you can write your own. The aim of this talk/hack session is to figure out what the next layer should be, and write it. It is definitely an attempt to move into the SCons territory of build systems, which knows how to build C etc. out of the box.

Monday, July 25, 2016

Maintainer required for Derive

Summary: I'm looking for a maintainer to take over Derive. Interested?

The Derive tool is a library and set of definitions for generating fragments of code in a formulaic way from a data type. It has a mechanism for guessing the pattern from a single example, plus a more manual way of writing out generators. It supports 35 generators out of the box, and is depended upon by 75 libraries.

The tool hasn't seen much love for some time, and I no longer use it myself. It requires somewhat regular maintenance to upgrade to new versions of GHC and haskell-src-exts. There are lots of directions the tool could be taken, more derivations, integration with the GHC Generic derivation system etc. There's a few generic pieces that could be broken off (translation between Template Haskell and haskell-src-exts, the guessing mechanism).

Anyone who is interested should comment on the GitHub ticket. In the absence of any volunteers I may continue to do the regular upgrade work, or may instead have it taken out of Stackage and stop upgrading it.

Monday, July 18, 2016

Why did Stack stop using Shake?

Summary: Stack originally used Shake. Now it doesn't. There are reasons for that.

The Stack tool originally used the Shake build system, as described on the page about Stack's origins. Recently Edward Yang asked why doesn't Stack still use Shake - a very interesting question. I've taken the information shared in that mailing list thread and written it up, complete with my comments and distortions/inferences.

Stack is all about building Haskell code, in ways that obey dependencies and perform minimal rebuilds. Already in Haskell the dependency story is somewhat muddied. GHC (as available through ghc --make) does advanced dependency tracking, including header includes and custom Template Haskell dependency directives. You can also run ghc in single-shot mode, compiling a file at a time, but the result is about 3x slower and GHC will still do some dependency tracking itself anyway. Layered on top of ghc --make is Cabal which is responsible for tracking dependencies with .cabal files, configured Cabal information and placing things into the GHC package database. Layered on top of that is Stack, which has multiple projects and needs to track information about which Stackage snapshot is active and shared build dependencies.

Shake is good at taking complex dependencies and hiding all the messy details. However, for Stack many of these messy details were the whole purpose of the project. When Michael Snoyman and Chris Done were originally writing Stack they didn't have much experience with Shake, and opted to go for simplicity and directly managing the pieces, which they viewed to be less risky.

Now that Stack is written, and works nicely, the question changes to if it is worth changing existing working code to make use of Shake. Interestingly, at the heart of Stack there is a "Shake-lite" - see Control.Concurrent.Execute. This piece could certainly be replaced by Shake, but what would the benefit be? Looking at it with my Shake implementers hat on, there are a few things that spring to mind:


  • This existing code is O(n^2) in lots of places. For the size of Stack projects, compared to the time required to compile Haskell, that probably doesn't matter.


  • Shake persists the dependencies, but the Stack code does not seem to. Would that be useful? Or is the information already persisted elsewhere? Would Shake persisting the information make stack builds which had nothing to do go faster? (The answer is almost certainly yes.)


  • Since the code is only used on one project it probably isn't as well tested as Shake, which has a lot of tests. On the other hand, it has a lot less features, so a lot less scope for bugs.


  • The code makes a lot of assumptions about the information fed to it. Shake doesn't make such assumptions, and thus invalid input is less likely to fail silently.


  • Shake has a lot of advanced dependency forms such as resources. Stack currently blocks when simultaneous configures are tried, whereas Shake would schedule other tasks to run.


  • Shake has features such as profiling that are not worth creating for a single project, but that when bundled in the library can be a useful free feature.

In some ways Stack as it stands avoids a lot of the best selling points about Shake:


  • If you have lots of complex interdependencies, Shake lets you manage
    them nicely. That's not really the case for Stack, but is in large
    heterogeneous build systems, e.g. the GHC build system.


  • If you are writing things quickly, Shake lets you manage
    exceptions/retries/robustness quickly. For a project which has the
    effort invested that Stack does, that's less important, but for things
    like MinGHC (something Stack killed), it was critically important because no one cared enough to do all this nasty engineering.


  • If you are experimenting, Shake provides a lot of pieces (resources,
    parallelism, storage) that help explore the problem space without
    having to do lots of work at each iteration. That might mean Shake is
    more of a benefit at the start of a project than in a mature project.

If you are writing a version of Stack from scratch, I'd certainly recommend thinking about using Shake. I suspect it probably does make sense for Stack to switch to Shake eventually, to simplify ongoing maintenance, but there's no real hurry.

Tuesday, July 05, 2016

More space leaks: Alex/Happy edition

Summary: Alex and Happy had three space leaks, now fixed.

Using the techniques described in my previous blog post I checked happy and alex for space leaks. As expected, both had space leaks. Three were clear and unambiguous space leaks, two were more nuanced. In this post I'll describe all five, starting with the obvious ones.

1: Happy - non-strict accumulating fold

Happy contains the code:

indexInto :: Eq a => Int -> a -> [a] -> Maybe Int
indexInto _ _ []                 = Nothing
indexInto i x (y:ys) | x == y    = Just i
                     | otherwise = indexInto (i+1) x ys

This code finds the index of an element in a list, always being first called with an initial argument of 0. However, as it stands, the first argument is a classic space leak - it chews through the input list, building up an equally long chain of +1 applications, which are only forced later.

The fix is simple, change the final line to:

let j = i + 1 in j `seq` indexInto j x ys

Or (preferably) switch to using the space-leak free Data.List.elemIndex. Fixed in a pull request.

2: Happy - sum using foldr

Happy also contained the code:

foldr (\(a,b) (c,d) -> (a+b,b+d)) (0,0) conflictList

The first issue is that the code is using foldr to produce a small atomic value, when foldl' would be a much better choice. Even after switching to foldl' we still have a space leak because foldl' only forces the outer-most value - namely just the pair, not the Int values inside. We want to force the elements inside the pair so are forced into the more painful construction:

foldl' (\(a,b) (c,d) ->
    let ac = a + c; bd = b + d
    in ac `seq` bd `seq` (ac,bd))
    (0,0) conflictList

Not as pleasant, but it does work. In some cases people may prefer to define the auxiliary:

let strict2 f !x !y = f x y
in foldr (\(a,b) (c,d) -> strict2 (,) (a+b) (b+d)) (0,0) conflictList

Fixed in a pull request.

3: Alex - lazy state in a State Monad

Alex features the code:

N $ \s n _ -> (s, addEdge n, ())

Here N roughly corresponds to a state monad with 2 fields, s and n. In this code n is a Map, which operates strictly, but the n itself is not forced until the end. We solve the problem by forcing the value before returning the triple:

N $ \s n _ -> let n' = addEdge n in n' `seq` (s, n', ())

Fixed in a pull request.

4: Alex - array freeze

Alex calls the Data.Array.MArray.freeze function, to convert an STUArray (unboxed mutable array in the ST monad) into a UArray (unboxed immutable array). Unfortunately the freeze call in the array library uses an amount of stack proportional to the size of the array. Not necessarily a space leak, but not ideal either. Looking at the code, it's also very inefficient, constructing and deconstructing lots of intermediate data. Fortunately under normal optimisation a rewrite rule fires for this type to replace the call with one to freezeSTUArray, which is much faster and has bounded stack, but is not directly exported.

Usually I diagnose space leaks under -O0, on the basis that any space leak problems at -O0 may eventually cause problems anyway if an optimisation opportunity is lost. In this particular case I had to -O1 that module.

5: Happy - complex fold

The final issue occurs in a function fold_lookahead, which when given lists of triples does an mconcat on all values that match in the first two components. Using the extra library that can be written as:

map (\((a,b),cs) -> (a,b,mconcat cs)) .
groupSort .
map (\(a,b,c) -> ((a,b),c))

We first turn the triple into a pair where the first two elements are the first component of the pair, call groupSort, then mconcat the result. However, in Happy this construction is encoded as a foldr doing an insertion sort on the first component, followed by a linear scan on the second component, then individual mappend calls. The foldr construction uses lots of stack (more than 1Mb), and also uses an O(n^2) algorithm instead of O(n log n).

Alas, the algorithms are not identical - the resulting list is typically in a different order. I don't believe this difference matters, and the tests all pass, but it does make the change more dangerous than the others. Fixed in a pull request.

The result

Thanks to Simon Marlow for reviewing and merging all the changes. After these changes Happy and Alex on the sample files I tested them with use < 1Kb of stack. In practice the space leaks discovered here are unlikely to materially impact any real workflows, but they probably go a bit faster.

Saturday, June 11, 2016

js-jquery 3.0.0 now out

The js-jquery Haskell library bundles the minified jQuery Javascript code into a Haskell package, so it can be depended upon by Cabal packages and incorporated into generated HTML pages. It does so in a way that doesn't require each Haskell package to bundle its own extra data files, and in a way that meets the licensing requirements of distributions such as Debian.

I've just released version 3.0.0, following on from jQuery 3.0.0 a few days ago. This release breaks compatibility with IE6-8, so if that's important to you, insert an upper bound on the package.

Friday, May 20, 2016

Another space leak: QuickCheck edition

Summary: QuickCheck had a space leak in property, now fixed (in HEAD).

Using the techniques described in my previous blog post I found another space leak, this time in QuickCheck, which has now been fixed. Using QuickCheck we can chose to "label" certain inputs, for example:

$ quickCheck $ \p -> label (if p > 0 then "+ve" else "-ve") True
+++ OK, passed 100 tests:
54% -ve
46% +ve

Here we label numbers based on their value, and at the end QuickCheck tells us how many were in each set. As you might expect, the underlying QuickCheck implementation contains a Map String Int to record how many tests get each label.

Unfortunately, the implementation in QuickCheck-2.8.1 has a space leak, meaning that the memory usage is proportional to the number of tests run. We can provoke such a space leak with:

quickCheckWithResult stdArgs{maxSuccess=10000} $
    \(p :: Double) -> label "foo" True

When running with ghc --make Main.hs -rtsopts && Main +RTS -K1K we get the error:

Main: Stack space overflow: current size 33624 bytes.

Using -K1K we have detected when we evaluate the space leak, at the end of the program, when trying to print out the summary statistics. The approach taken by QuickCheck for label is to generate a separate Map String Int per run, then at each step merge these Map values together using unionWith (+). As such, there are two likely culprits for the space leak:

  • Perhaps the Map is not evaluated, so in memory we have unionWith (+) x1 $ unionWith (+) x2 $ unionWith (+) x3 $ ....
  • Perhaps the values inside the Map are not evaluated, so in memory we have Map {"foo" = 1 + 1 + 1 + ...}.

QuickCheck avoids the first space leak by keeping its intermediate state in a record type with a strict field for the Map. QuickCheck suffers from the second problem. As usual, actually fixing the space leak is easy - just switch from importing Data.Map to Data.Map.Strict. The Strict module ensures that the computations passed to unionWith are forced before it returns, and the memory usage remains constant, not linear in the number of tests.

I detected this space leak because the Shake test suite runs with -K1K and when running one particular test on a Mac with GHC 8.0 in profiling mode it caused a stack overflow. I did not diagnose which of those factors was the ultimate cause (it may have even been the random seed at a particular point in time - only certain inputs call label).

Many space leaks are now easy to detect (using -K1K), moderate difficulty to debug (using the -xc technique or just by eye) and usually easy to fix.

Tuesday, April 19, 2016

New Shake with better wildcard patterns

Summary: The new version of Shake supports ** patterns for directory wildcards.

I've just released Shake 0.15.6. Don't be mislead by the 0.0.1 increment of the release, it's got over 50 entries in the changelog since the last release. There are quite a few bug fixes, documentation improvements and optimisations.

One of the most user visible features is the new wildcard patterns. In the previous version Shake supported // for matching any number of directories and * for matching within a path component, so to match all C source files in src you could write:

src//*.c

In the new version of Shake you can also write:

src/**/*.c

The // patterns remain supported, but I intend to encourage use of ** in new code if these patterns don't end up having any unforeseen problems. The advantages of the patterns in the new version are:

  • The ** patterns seem to be the defacto standard nowadays, being used by rsync, Ant, Gradle, Jenkins etc.
  • People would often write "src" </> "//*.c", which gives the unexpected result of //*.c. With ** you aren't overloading directories at the same time so everything works out as expected.
  • ** patterns only match relative files, not absolute ones, which is what you usually want in a build system. If you want to match absolute files use */**.
  • The semantics of patterns were a bit confusing for things like /// - I've now given them precise semantics, but ** avoids this confusion.
  • I've optimised the pattern matching for both flavours, so there is more precomputation and less backtracking (in practice I don't think that makes any difference).
  • I've optimised directory traversal using a file pattern, so it doesn't list directories that can't possibly match, which gives a significant speedup.

For this release I've also improved the website at shakebuild.com with more documentation - hopefully it is useful.

Monday, April 11, 2016

GHCid 0.6 Released

Summary: I've released a new version of GHCid, which can interrupt running tests.

I've just released version 0.6.1 of GHCid. To a first approximation, ghcid opens ghci and runs :reload whenever your source code changes, formatting the output to fit a fixed height console. Unlike other Haskell development tools, ghcid is intended to be incredibly simple - it works when nothing else does. This new version features:

Much faster: Since version 0.5 GHCid passes -fno-code to ghci when it makes sense, which is about twice as fast.

Interruptible test commands: Since version 0.4 ghcid has supported a --test flag to pass a test command (e.g. :main) which is run whenever the code is warning free. As of version 0.6 that command will be interrupted if it needs to :reload, allowing long running tests and persistent "tests" - e.g. spawning web servers or GUIs. Thanks to Reid Draper for showing it was possible as part of his ordeal project and Luigy Leon for merging that with GHCid.

Stack integration: If you have a stack.yaml function and a .stack-work directory it will use stack commands to run your project. Thanks to the Stack Team, in particular Michael Sloan, for helping get through all the hoops and providing the necessary functionality in Stack.

More restart/reload flags: It's been possible for a while to pass --restart to restart ghci if certain files change (e.g. the .cabal file). Now there is a separate --reload flag to cause :reload instead of a full restart, and both flags can take directories instead of individual files.

Major relayering: For 0.6 I significantly refactored much of the GHCid code. There has always been an underlying Language.Haskell.Ghcid API, and GHCid was built on top. With the new version the underlying library has been given a significant refactoring, mostly so that interruptions are handled properly without race conditions and with a sane multithreading story. On top of that is a new Session layer, which provides a session abstraction - a ghci instance which tracks more state (e.g. which warnings have been emitted for already loaded files). Then the Ghcid module builds on top, with much less state management. By simplifying and clarifying the responsibility of each layer certain issues such as leaking old ghci processes and obscure race conditions disappeared.


I've been making use of many of these features in the Shake website generator, which I invoke with:

ghcid --test=":main debug" --reload=parts --reload=../docs

This project uses Stack, so relies on the new stack integration. It runs :main debug as the test suite, which generates the website whenever the code reloads. Furthermore, if any of the parts (template files) or docs (Markdown pages) change the website regenerates. I can now edit the website, and saving it automatically regenerates the web pages within a second.

Wednesday, April 06, 2016

Github Offline Issues with IssueSync

For a while I've been looking for something to download the GitHub issues for a project. I do a lot of development work on a train with no internet, so referring to the tickets offline is very useful. I've tried lot of tools, in a very wide variety of languages (Ruby, Python, Perl, Javascript, PHP) - but most of them don't seem to work - and the only one I did manage to get working only gave a curses UI.

Finally, I've found one that works - IssueSync. Installing it worked as described. Running it worked as described. I raised tickets for the author and they fixed them. I even sent a pull request and the author discussed and merged it. It downloads all your issues to Markdown files in an issues directory. I then "list" my issues using:

head -n1 -q issues/*.md | grep -v CLOSED

It's simple and works nicely.

Thursday, March 03, 2016

Compiling GHC on Windows

Summary: How to compile GHC on Windows using Stack and the new Shake-based GHC build system.

Regularly updated instructions are at https://github.com/snowleopard/hadrian/blob/master/doc/windows.md - use them.

Here are a list of instructions to compile GHC, from source, on Windows. I tested these instructions on a clean machine using the free Windows 10 VirtualBox image (I bumped the VM CPUs to 4, and RAM to 4096Mb).

The first step is to install Stack (I just accepted all the defaults), then open a command prompt and run:

stack setup
stack install happy alex
stack exec -- pacman -S gcc binutils git automake-wrapper tar make patch autoconf --noconfirm
stack exec -- git clone --recursive git://git.haskell.org/ghc.git
cd ghc
stack exec -- git clone git://github.com/snowleopard/hadrian
stack build --stack-yaml=hadrian/stack.yaml --only-dependencies
stack exec --stack-yaml=hadrian/stack.yaml -- hadrian/build.bat -j

The entire process (after the VM has downloaded) takes a bit less than an hour. These steps use the Stack supplied tools (MinGW, Git), and the new Shake-based build system. The hope is that by using the isolation Stack provides, combined with the portability improvements from writing the build system in Haskell, these instructions will work robustly on many Windows machines.

I have not tried these instructions on other platforms, but suspect that removing the pacman line might be sufficient to get it to work.

Update: Instructions simplified following improvements to the build system.

Wednesday, February 17, 2016

Selling Haskell in the pub

Summary: A Haskell sales pitch without any code.

I often find myself in a pub, without pen or paper, trying to persuade someone to try Haskell. In these situations I tend to use three arguments:

Low cost abstractions: In Haskell the cost of creating a helper function is low, because the syntax for functions is very concise (a single line) and the optimiser often removes all overhead of the function (by inlining it). The power of such helper functions is greatly enhanced by higher-order functions. In many languages each function must be top-level (within a file or class), but Haskell permits functions local to a small block, providing encapsulation and less necessity for good names. In most languages there is a much higher cost per function, and thus trivial functions are insufficiently valuable. By reducing the cost, Haskell encourages both less repetition and describing problems in a more abstract way, which is very helpful for taming program complexity.

Refactoring works: In Haskell, refactoring is easy, safe and common. Most projects involve writing a chunk of code, then continually changing it as the project evolves. In other languages most refactorings have complex side conditions that must be met. In Haskell, most refactorings are simple, and even refactoring tools can have complex refactorings mechanically proven correct. Any code which violates the expected side-conditions is considered "dangerous" and libraries are expected to provide robust abstractions. The static type checker ensures that most refactorings have been carried correctly. It is common to change a fundamental type in the middle of millions of lines of code, quickly make the changes required and have a high degree of confidence that it still works. The ability to easily refactor means that code can evolve with the project, without accumulating technical debt.

Language polygots: There are few programmers who know only Haskell. I know Haskell programmers who are also experts in Perl, PHP, C, Javascript, C++, Fortran, R, Algol - pretty much any language you care to name. In contrast, when my wife has attended other programming language meetups, many of the participants knew only that language. There are many reasons Haskell programmers know lots of languages, not least because Haskell has rarely been taught as a first language. I find it interesting that many people who are experts in both Haskell and other languages typically prefer Haskell.

In response to these arguments, if people are starting to get convinced, they usually ask:

  • Are there lots of libraries? Yes, 9000+. There are more R libraries for statistics, more Perl libraries for Bioinformatics etc - but most of the standard stuff is covered. Wrapping a C library with the foreign function interface (FFI) isn't too hard.
  • What's the performance like? It's usually somewhere between C and Python. If you carefully optimise you can get close to the performance of C. The profiling tools are reasonable. Performance is not usually a problem, but can be solved with C and FFI if it is.

Many of the above arguments are also supportive of other statically typed functional languages. I tend to find my pub interventions are usually aimed at Python/Java/C++/PHP programmers, where I'd consider it a win if they tried O'Caml or Scala instead.

Monday, February 01, 2016

New HLint version and API features

Summary: In the new version of HLint, errors are warnings and warnings are suggestions. I plan to remove the old API very soon.

I've just released a new version of HLint, the tool for suggesting hints to improve your Haskell code. This version comes with two big changes.

Firstly, hints have been reclassified in severity. What used to be errors are now warnings, and hints that used to be warnings are now suggestions. As people have mentioned in the past, nothing HLint suggested was really an error, and now HLint matches that.

Secondly, there is now an hlint API entry point in Language.Haskell.HLint3 which captures the pattern of running HLint with command-line arguments, but in process as a library. With that, I don't think there are any API patterns that are better captured by the Language.Haskell.HLint API. If no one contacts me with issues, I will be making the module Language.Haskell.HLint a copy of Language.Haskell.HLint3 in the next release.

This release (like all minor releases) fixes a few bugs and adds a few new features. As a few random examples, HLint now warns on a redundant DefaultSignatures language extension, suggests fmap in preference to liftM and warns when otherwise is used as a pattern variable. There is a complete change log in the repo.

Tuesday, January 19, 2016

One Bag of Huel

I've been a fan of the idea of meal replacement products for a while - the Matrix gruel always seemed simple and appealing. One such product is Huel, which is available in the UK, seems fairly sensibly thought out, and isn't cadim-yummy. When having lunch at home, I found myself eating stuff that was nutritionally garbage, inconvenient and time consuming to get and not very nice. Given the downsides, Huel seemed worth a try. Having gone through one bag, it seems to be working nicely, and I intend to continue this as my default lunch-at-home plan.

General random thoughts on Huel:

  • My wife pronounces Huel with a strong and long 'u', huuuuuuel, to imitate the sound of someone vomiting. Soylent is probably a better name.
  • I bought a first bag, and they threw in a free flask/shaker and t-shirt. I've been using the flask they supplied, and the t-shirt is very nice.
  • I like that it has all the vitamins and minerals required, so I'm not likely to be missing anything.
  • The taste is OK. Not horrid, perfectly drinkable, but not too nice, so unlikely to go overboard. They got the balance right.
  • Most of the flavouring suggestions seem like someone trolling. The default vanilla flavour works fine for me.
  • The taste/consistency depends a lot on exactly how much water goes in, so experiment a bit (I find 420ml and 3 scoops works for me).
  • By the end of the flask, it's significantly more dense, so I add a little bit of water with about 40ml to go.
  • If you add the powder before the water it's a bit of a disaster and you get clumped powder at the bottom.
  • Wash the flask as soon as you finish, or it sets quite hard.
  • I think I spend about 3 mins preparation time making the mixture and washing up after, which isn't bad.
  • The pouches come sealed at the top with a resealable strip that is initially unsealed. Cut the top of the strip, don't tear it, or you bump into the resealable strip. Before starting, you have to clean the resealable strip out with a knife (which gives a bad first impression, but otherwise is unproblematic).
  • The powder has a habit of escaping a bit when filling up the flask. If it gets on clothes, it stays there until you wash them, and a gentle brushing down has little effect. A little annoying, but not fatal.
  • It's sufficiently filling that I think I'm probably going under the number of calories I should be getting at lunch. I'm experimenting with trying to start my lunch Huel a bit earlier, and then have another Huel or snack in the afternoon - splitting lunch in two.

Since this is a post about a specific product, I should probably mention I have no relationship with the company other than having spent £45 on Huel.

Thursday, January 14, 2016

A simple Haskell function

Summary: An example of a small function I recently wrote - from type signature to tests.

When writing a build system there are lots of nasty corner cases to consider. One is that command line limits combined with lots of arguments sometimes requires splitting a single command up into multiple commands, each of which is under some maximum length. In this post I'll describe a function that was required, my implementation, and how I tested it.

Type signature and documentation

Before I even got to the function, it already had a type signature and some Haddock documentation:

-- | @chunksOfSize size strings@ splits a given list of strings into chunks not
--   exceeding @size@ characters. If that is impossible, it uses singleton chunks.
chunksOfSize :: Int -> [String] -> [[String]]

As an example:

chunksOfSize 5 ["this","is","a","test"] == [["this"],["is","a"],["test"]]

Implementation

My implementation was:

chunksOfSize n = repeatedly $ \xs ->
    let i = length $ takeWhile (<= n) $ scanl1 (+) $ map length xs
    in splitAt (max 1 i) xs

First we use the repeatedly function from the extra library. This has the signature:

repeatedly :: ([a] -> (b, [a])) -> [a] -> [b]

Given a list of input, you supply a function that splits off an initial piece and returns the rest. One of the examples in the documentation is:

repeatedly (splitAt 3) xs  == chunksOf 3 xs

So we can see how repeatedly lets us focus on just the "next step" of this list, ignoring the recursion. For the function argument we have two tasks - first decide how many items to put in this chunk, then to split the chunks. Splitting the chunks is the easy bit, and can be written:

splitAt (max 1 i) xs

If we know the next i elements will be at or below the limit, then we can use splitAt to divide the elements. As a special case, if no elements would be allowed, we allow one, using max 1 to ensure we never pass 0 to splitAt (and thus enter an infinite loop). That leaves us with:

i = length $ takeWhile (<= n) $ scanl1 (+) $ map length xs

Reading from right to left, we reduce each element to it's length, then use scanl1 to produce a running total - so each element represents the total length up to that point. We then use takeWhile (<= n) to keep grabbing elements while they are short enough, and finally length to convert back to something we can use with splitAt.

Tests

When testing, I tend to start with a few concrete examples then move on to QuickCheck properties. As an initial example we can do:

quickCheck $
    chunksOfSize 3 ["a","b","c","defg","hi","jk"] ==
    [["a","b","c"],["defg"],["hi"],["jk"]]

Here we are explicitly testing some of the corner cases - we want to make sure the full complement of 3 get into the first chunk (and we haven't got an off-by-one), we also test a singleton chunk of size 4. Now we move on to QuickCheck properties:

quickCheck $ \n xs ->
    let res = chunksOfSize n xs
    in concat res == xs &&
       all (\r -> length r == 1 || length (concat r) <= n) res

There are really two properties here - first, the chunks concat together to form the original. Secondly, each chunk is either under the limit or a singleton. These properties capture the requirements in the documentation.

A final property we can check is that it should never be possible to move the first piece from a chunk to the previous chunk. We can write such a property as:

all (> n) $ zipWith (+)
    (map (sum . map length) res)
    (drop 1 $ map (length . head) res)

This property isn't as important as the other invariants, and is somewhat tested in the example, so I didn't include it in the test suite.

Performance and alternatives

The complexity is O(n) in the number of Char values, which is as expected, since we have to count them all. Some observations about this point in the design space:

  • In a strict language this would be an O(n^2) implementation, since we would repeatedly length and scanl the remainder of the tail each time. As it is, we are calling length on the first element of each chunk twice, so there is minor constant overhead.
  • Usually in Haskell, instead of counting the number of elements and then doing splitAt we would prefer to use span - something like span ((<= n) . fst) .... While possible, it makes the special singleton case more difficult, and requires lots of tuples/contortions to associate each element with its rolling sum.
  • For a build system, the entire input will be evaluated before, and the entire output will be kept in memory afterwards. However, if we think about this program with lazy streaming inputs and outputs, it will buffer each element of the output list separately. As a result memory would be bounded by the maximum of the longest string and the Int argument to chunksOfSize.
  • It is possible to write a streaming version of this function, which returns each String as soon as it is consumed, with memory bounded by the longest string alone. Moreover, if the solution above was to use lazy naturals, it would actually come quite close to being streaming (albeit gaining a quadratic complexity term from the takeWhile (<= n)).
  • The type signature could be generalised to [a] instead of String, but I would suspect in this context it's more likely for String to be replaced by Text or ByteString, rather than to be used on [Bool]. As a result, sticking to String seems best.

Refactoring the previous version

The function already existed in the codebase I was working on, so below is the original implementation. This implementation does not handle the long singleton special case (it loops forever). We can refactor it to support the singleton case, which we do in several steps. The original version was:

chunksOfSize _    [] = []
chunksOfSize size strings = reverse chunk : chunksOfSize size rest
  where
    (chunk, rest) = go [] 0 strings
    go res _         []     = (res, [])
    go res chunkSize (s:ss) =
        if newSize > size then (res, s:ss) else go (s:res) newSize ss
      where
        newSize = chunkSize + length s

Refactoring to use repeatedly we get:

chunksOfSize size = repeatedly $ second reverse . go [] 0
  where
    go res _         []     = (res, [])
    go res chunkSize (s:ss) =
        if newSize > size then (res, s:ss) else go (s:res) newSize ss
      where
        newSize = chunkSize + length s

Changing go to avoid the accumulator we get:

chunksOfSize size = repeatedly $ go 0
  where
    go _         []     = ([], [])
    go chunkSize (s:ss) =
        if newSize > size then ([], s:ss) else first (s:) $ go newSize ss
      where
        newSize = chunkSize + length s

It is then reasonably easy to fix the singleton bug:

chunksOfSize size = repeatedly $ \(x:xs) -> first (x:) $ go (length x) xs
  where
    go _         []     = ([], [])
    go chunkSize (s:ss) =
        if newSize > size then ([], s:ss) else first (s:) $ go newSize ss
      where
        newSize = chunkSize + length s

Finally, it is slightly simpler to keep track of the number of characters still allowed, rather than the number of characters already produced:

chunksOfSize size = repeatedly $ \(x:xs) -> first (x:) $ go (size - length x) xs
  where
    go n (x:xs) | let n2 = n - length x, n2 >= 0 = first (x:) $ go n2 xs
    go n xs = ([], xs)

Now we have an alternative version that is maximally streaming, only applies length to each element once, and would work nicely in a strict language. I find the version at the top of this post more readable, but this version is a reasonable alternative.

Acknowledgements: Thanks to Andrey Mokhov for providing the repo, figuring out all the weird corner cases with ar, and distilling it down into a Haskell problem.