Hello Vietnam

Speedy numpy replacement for matlab accumarray

I regularly have to translate some matlab code into python. Most functions there translate fairly well to numpy functions, but the accumarray receipe, that I used to use up to now, sucked quite hard performance wise. So I was looking for some more elegant solution. Unfortunately, there is not too much around, and I was already about to write something together, to ask it at stackoverflow, when I had the Idea for this little snippet:

def accum_np(accmap, a, func=np.sum):
    indices = np.where(np.ediff1d(accmap, to_begin=[1],
    vals = np.zeros(len(indices) - 1)
    for i in xrange(len(indices) - 1):
        vals[i] = func(a[indices[i]:indices[i+1]])
    return vals

Careful: This quick hack only works with contiguous accmaps, like 222111333, but not 1212323. Every change from one number to another will be seen as a new value. This avoids the slow sorting.

Benchmarking shows, that it’s more than 18x faster than the previous solution:

accmap = np.repeat(np.arange(100000), 20)
a = np.random.randn(accmap.size)

timeit accum_py(accmap, a)
>>> 1 loops, best of 3: 16.7 s per loop

timeit accum_np(accmap, a)
>>> 1 loops, best of 3: 887 ms per loop

For completeness, here the timings with octave:

accmap = repmat(1:100000, 20, 1)(:);
a = randn([numel(accmap), 1]);
tic; accumarray(accmap, a); toc
>>> Elapsed time is 0.05152 seconds.

Which actually makes me think of using some bigger guns for the problem now.

So after some days of hacking around with scipy.weave.inline now, I’m down to this:

timeit accum(accmap, a)
1 loops, best of 3: 27 ms per loop

This seems pretty reasonable now, when comparing it with octave.

The new implementation comes with fast implementations written in C for most common functions (sum, prod, min, max, mean, std, …), and falls back to a pure numpy solution for everything less common. It comes with a complete test suite, but if you should face some issues with it, please let me know!


This blog post is quite outdated right now. In collaboration with @d1mansion, all this developed further into some nifty little python package called numpy-groupies available at PyPI and on Github. For more info on this topic, usage details and benchmarks, see the project page at Github.

Lebenslaufschönen für Fortgeschrittene

Pressefreiheit quo vadisWie heise berichtet, scheint Herr Lindner kein großer Fan von Meinungs- und Pressefreiheit zu sein und damit den Namen seiner eigenen Partei gründlich zu karikieren. Dem Bericht nach hat der stets meinungsflexible Heißsporns des 3%‑Lobbyvereins zunächst seinen Lebenslauf bei Wikipedia schönen lassen, und geht nun anschließend juristisch dagegen vor, dass jemand das Kind beim Namen nennt. Erstes Opfer dieser Posse wurde dabei ein durchaus lesenswerter Artikel der Wirtschaftswoche, der von der eigenen Redaktion allerdings aufgrund mangelnden Stehvermögens mittlerweile entfernt wurde. Wer den Artikel als symbolgewordenen Ausdruck der Presse- und Meinungsfreiheit ebenfalls vermissen sollte hat Glück im Unglück: Archive.org vergisst nichts!

Wikivoyage parser with heuristics

While travelling through Vietnam, I was using wikivoyage quite intensely as a travel guide and so I started to contribute to some articles there myself. When editing an article, especially cleaning up and structuring long semi-formatted lists of hotels and restaurant was quite annoying, but given the semi-structured shape of the lists, it’s not straight forward, to automate the formatting.

Being annoyed enough by the editing, I took it as a challenge, and wrote a parser making use of a bunch of heuristical rules to classify the list entries, split them into chunks, apply formatting rules on the chunks, and merge it together again into a nicely formatted list entry. So some ugly unstructred listing like

* '''Birmingham Buddhist Centre''', 11 Park Rd, Moseley (''#1, #35 or #50 bus''), ''+44 121'' 449 5279 (''[mailto:info@birminghambuddhistcentre.org.uk info@birminghambuddhistcentre.org.uk]''), [http://www.birminghambuddhistcentre.org.uk/]. A centre run by the Friends of the Western Buddhist Order'' .

* '''Hotel Indah Manila''' 350 A J Villegas St. Tel: ''+63 2'' 5361188, 5362288. [http://www.hotelindah.com/] Rates start at ₱2000 for this modest 76-room hotel. Facilities include Café Indah and conference and function rooms. Airport and city transfers, tour assistance, and laundry service are available.

becomes nicely formatted into

* {{vCard| type=sight| subtype=religious| name=Birmingham Buddhist Centre| address=11 Park Rd, Moseley| directions=#1, #35 or #50 bus| phone=+44 121 449 5279| email=info@birminghambuddhistcentre.org.uk| url=http://www.birminghambuddhistcentre.org.uk/| description=A centre run by the Friends of the Western Buddhist Order.}}

* {{vCard| type=hotel| subtype=hotel| name=Hotel Indah Manila| address=350 A J Villegas St| phone=+63 2 5361188, 5362288| url=http://www.hotelindah.com/| price=Rates start at ₱2000 for this modest 76-room hotel| description=Facilities include Café Indah and conference and function rooms. Airport and city transfers, tour assistance, and laundry service are available.}}

I wrote it as a library and gave it a web frontend using CGI or as a standalone version using bottle. After using python intensely for several years, it’s actually the first time, that I used it to display some web content instead of PHP, and I was a bit surprised, how straight forward it was. So, give it a try, and let me know what you think about it! The source code is available at github.