Algoriddims

Here comes the science.

Well, here comes the math, really, in which Matrix-Chain-Order gets a little bit sharper around the edges.

When you multiply a long list of matrices together, there’s a few (usually one, really) ways of doing that that are significantly faster than the others. That doesn’t mean you’re going to get the wrong answer, it’s just that, like much of life, going about it one way can be much, much faster than the others.

If you are not interested in this sort of thing, please, come back tomorrow, where I will try hard to be entertaining. Right now, you’re getting abstract algebra.


If you don’t remember your matrix multiplication from high-school algebra, it works like this: if you’ve got two matrices A1 and A2, you can only multiply them together if the number
of columns in the first one is the same as the number of rows in the second one – that is, if your A1 is a 2×5 matrix and your A2 is 5×3, for example. Then, when you multiply
them together, the thing that comes out will have as many columns as the first matrix, and as many rows as the second; for our given examples, the result will be a 2×3 matrix.

This is pretty straightforward stuff, if you’re just multiplying
two matrices. It’s tedious, you’ll note (assuming that you don’t
find this whole discussion tedious at this point anyway), because
to finish that multiplication, you’ll need to do five operations for
each of the 2×3 entries in the resulting matrix. Generally, if you’ve
got a matrix that’s X by Y and one that’s Y by Z, their product will
be a matrix of X by Z size, and take X*Y*Z steps to complete, the
“cost” of the operation. Still pretty straightforward, but it gets
marginally more interesting when you’re multiplying a whole bunch
of them in a row. Let me give you a somewhat exaggerated example,
so you can see what I’m getting at here.

Say you want to multiply three matrices together, A1 (a 2×100 matrix),
A2 (100×5) and A3 (5×1000). You have a
choice now, because you don’t have to do them in order. So, do you
do A1*A2 first, or A2*A3? If I multiply
A1 by A2 first, at a cost of 2x100x5 = 1000,
I then have to multiply the result by A3, costing me
2x5x1000 = 10,000, for a total cost of 11,000 operations. But, if
I do it the other way, doing A2 times A3 first
at a cost of 100x5x1000 = 500,000, and then multiply A1
by the result at a cost of 2x100x1000 = 200,000, I need to perform a
total of 700,000 operations to get the very same result. For these
numbers, the good way of doing the work takes approximately 1/700th the time of the bad way. Onerous!

That might not seem to you like a big deal, given how fast modern computers can be with problems this size, and often you’d be right – modern machines are blazingly fast, and will grind through half a million arithmetic operations horrifically quickly. But take my word for it when I say that matrices can get way bigger than that, and once you start multiplying a whole bunch of them in a row, the difference between the good way and the bad way can go from fractions of a second to minutes. And these are not minutes that this hardware was going to sit twiddling its thumbs, no sir; other muscular problems are queuing up outside while this one’s being worked over second after agonizing second. Sometimes making the effort to figure out the best way to do calculation-intensive stuff like this can be the difference between getting your answer in a fraction of a second and having the whole system get snowed under.

A digression, kplzthx.

Generally speaking, when the people who care about
algorithms talk about them, they’ll describe them in terms of
how long they’ll take to come back to you with an answer, for a
certain number of inputs. It’s called “Big-Oh” Notation, and it’s
not really a description of the time a process takes, but of how
that process “scales”. That is, how the amount of time it will
take increases with the size of its input. If I have a process
for sorting cards, say, and it takes about 10 seconds to sort 10
cards, but 40 seconds to sort 20 cards, and 90 seconds to sort 30
cards and so on, that’s called an O(n2) algorithm –
roughly speaking, for n inputs, it will take n2 units of time.

O(n2) processes aren’t great, but in the grand scheme
of things they’re not bad; compare that to the similar-looking
O(2n). 1002 is only 10,000;
2100 = 1267650600228229401496703205376, and doing that many of anything will cut into your beauty sleep something fierce.

Anyway, I can pretty much tell this is boring those stalwart few
of you who’ve read so far to death, so where I’m going with this is
two things, first that the “figuring out the right order” step I
hinted at earlier is a well-known O(n3) algorithm,
and that generally speaking the more superscripty your big numbers
get, the worse trouble you’re in. N3 really isn’t that bad, though, because the amount of time you’re going to save is pretty much always going to make doing it worthwhile.

Most of the time, I say! But not all the time! First off, this might not be new. But if it isn’t, well, I can’t find the paper that puts it forward despite my best googly, scifindery efforts, and nevertheless I came up with it all by my very lonesome. So don’t hate the player, or something. There is a specific degenerate case that I’d like to address here.

As discussed in CLRS, it is
trivial to construct an O(2n) algorithm to determine the optimal parenthetical ordering of a sequence of multiplication. It is further observed that this would suck, though not in so many words. They then provide an O(n3) algorithm as part of a dynamic programming exercise set before moving on to other things.

I think that I’ve got an edge case here, though, that can be solved much faster than the algorithm they’ve provided, so let me tell you what I’ve got:

For a sequence of matrices A1…An,
where the dimensions of the matrices in this sequence can be expressed as x0, x1 … xn, and such that (xa <= xb) for all (a < b) (A sequence of matrices of constantly increasing dimension, basically) it is always optimal to do them in order, from left to right.

I’m writing out a formal proof for this right now, but the gist of it is that by doing it in order, you carry the x0 element, the smallest in your list, through every step.

So the O(n3) algorithm in CLRS is not optimal in all cases – you should, in an O(n) first pass, determine if your matrix dimensions are constantly increasing (or decreasing, of course, and then you just do the multiplication from back to front – it’s a symmetric proof), and if they are, boom, you’re done, you can skip the rest.

Better yet, if you magically know ahead of time that you’re going to be multiplying matrices of this form, you can just skip the whole optimization step, and declare it optimized by pure, mathematically-backed fiat.

As mentioned above, I don’t see how this can possibly be new, or my own discovery, or whatever. But there it is; I found a way to do a thing faster than the guys in the book did, and I did it with my very own brain.

Go me.

11 Comments

  1. Anonymous
    Posted January 21, 2006 at 8:29 am | Permalink

    Interesting stuff, it’s been a while since I’ve been forced to think about Big-O notation.

  2. Joe Hache
    Posted January 21, 2006 at 8:48 am | Permalink

    And I forgot to fill in the info part .. silly me.

  3. Posted January 21, 2006 at 8:59 am | Permalink

    Have you thought about running this by a math/CS prof, and possibly publishing it? (Or would that not be considered a normal practice for this type of thing?)

  4. Mike Hoye
    Posted January 21, 2006 at 9:15 am | Permalink

    I really don’t know what the normal practice for this kind of thing is, but I’ll ask around.

  5. Posted January 21, 2006 at 9:21 am | Permalink

    Mike, this is more than a little nifty. I’m working up a few examples right now, and may be posting them later; if I do, I’ll link back over.

    Also, for mathematical markup, you may want to check out ASCIIMathML, which is probably the least frivolous use of the language that I’ve seen.

    Joe, are you the same Joe Hache who was once president of the CCSS? If so, I’ve got a couple of questions for you.

    Also, what the heck are you people doing up at this hour on a Saturday? You’re all nuts.

  6. Anonymous
    Posted January 21, 2006 at 12:00 pm | Permalink
    "... it is always optimal to do them in order, from right to left."
    
    I think you mean left to right:
    
    Suppose dimensions of a matrix A_i (i.e., Dim(A_i)) are always iX(i+1) where i belongs to {2,3,4}
    
    The cost of calculating A_2*A_3 would be 2*3*4 = 24 operations
    (since we've defined Dim(A_2) = 2x3, Dim(A_3) = 3x4)
    
    Likewise, calculating (A_2*A_3)*A_4 costs 2*3*4 + 2*4*5
    = 4*(2*3 + 2*5)
    = 4*(6 + 10)
    = 4*16 = 64 operations
    
    But A_2*(A_3*A_4) costs 3*4*5 + 2*3*5
    = 3*5*(4 + 2)
    = 15*6 = 90 operations
    
    So, in this case, where we have strictly increasing dimensions for each matrix, multiplying from left to right costs less.
    
    The proof of whether that is true in the general case, I'll leave as an exercise to the reader ;)
    
  7. Joe Hache
    Posted January 21, 2006 at 12:29 pm | Permalink

    Yes, it’s me. Email me if you’d like, but if it’s CCSS-related I haven’t been involved with them for a couple years.

  8. Mike Hoye
    Posted January 21, 2006 at 12:50 pm | Permalink

    Thank you, anonymous commenter – I’ve made the correction.

  9. Sean Ross
    Posted January 21, 2006 at 1:10 pm | Permalink

    Like Joe, I forgot to fill in the info part … anonymous commenter, c’est moi.
    If you feel like going over your proof, let me know.

  10. Mike Hoye
    Posted January 21, 2006 at 1:54 pm | Permalink

    Thank you, Sean Ross. And, I’d love to, and happy new year!

  11. Oli
    Posted January 21, 2006 at 4:18 pm | Permalink

    I’ve noticed with textbooks and examples from profs that most of the algorithms they provide are intended for the ‘general’ case. However, it’s optimizations like this which set our algorithms apart from the stock versions that people lookup online and plug into their programs, so hey, congrats on tweaking this one!

    If I may, I’ll run through a similar thing I’ve found for d-CNF clauses.. (conjunctive normal form clauses having ‘d’ elements in each disjunction)
    From class, “It’s not always possible to generate a boolean assignment for a d-CNF that yields a true value for the overall clause. However, it is often of interest to generate an assignment that yields a high number of true terms in the clause.”

    Their solution is that randomly generating the boolean assignment will have a (2^d – 1)/(2^d) probability of being true anyway, yielding ‘n’ times that for the expected number of true terms in an n-term clause.

    My solution is to count up the number of terms in which each boolean assignment occurs, and assign the necessary value to the individual variable that sets most of them true, then repeat this step for the remaining false clauses until you’ve made assignments for all ‘d’ variables. I’m not sure how to prove its correctness, but it seems like a sound argument for maximizing terms… thoughts?