Converting .mov files to .avi in Microsoft Windows

Over the christmas, I got a present of a Nikon Coolpix AW100. It's a compact camera and so it's limited in flexibility and image quality compared to my DSLR, however it has some unusual features that I'm enjoying experimenting with. It's waterproof (!), does geotagging (example here) and shoots video at 1080p, and can also shoot video at 240 frames per second (at low resolution).

Video captured on this device is stored as .mov files, which is a bummer as this is an Apple Quicktime format and although there are plenty of other tools which support playback of these files, there's very little support for editing this type of file outside of Quicktime. And I hate Quicktime. And Apple. Converting these files to the more Windows friendly .avi format would allow the use of VirtualDub which is a really fast and reliable, heavily optimised video editing tool written in C++.

There are two types of reasons for wanting editing a short video clip: A) to trim bad parts of the shot, to grade colour or apply filters, to patch several trimmed clips together to make a scene or B) to compress the clip for storage or publication (e.g. host on your website or upload to Vimeo or YouTube). I'm interested in doing both of these things, and I spent quite a while today figuring out how to allow this under Windows, and before I forget how it all works, I thought I should describe it here.

Download and "install" MPlayer

First, you need to get your hands on the MPlayer and MEncoder tools. They come packaged together here. Get the Windows builds, and download the version compiled for your computer's CPU (e.g. I got the version compiled for Intel Pentium 4 or better). Extract the folder from the archive you've dowloaded and move it to a sensible place on your hard-drive. I renamed the folder to "MPlayer" and moved it to "C:\MPlayer", which will crop up later in these instructions.

Create a batch script for converting .mov to x264 .avi

.avi files are much more widely accepted as input for video editing software, and x264 is a good, modern video compressor (or codec), similar to that used in high-definition BluRay videos.

Using notepad, create an empty text file called MPConvertx264.bat in your MPlayer directory (e.g. C:\MPlayer). The extension .bat indicates that this is a Windows batch file which lets you automate actions in Windows, like a shell script in Linux. Copy the text below into it, save and close the file. If your MPlayer files are in a folder other than C:\MPlayer, you will need to adjust the text to reflect that.


Title %~n1 -- pass 1
"C:\MPlayer\mencoder.exe" %1 -oac pcm -ovc x264 -x264encopts pass=1:turbo:bitrate=6000:frameref=1:analyse=all:me=umh:subme=4:trellis=1:bframes=1:subq=4:mixed_refs:weight_b:no_fast_pskip:direct_pred=auto:mixed_refs:nr=200 -vf harddup -oac pcm -o NUL

Title %~n1 -- pass 2
"C:\MPlayer\mencoder.exe" %1 -oac pcm -ovc x264 -x264encopts pass=2:bitrate=6000:frameref=4:analyse=all:me=umh:subme=7:trellis=2:bframes=1:subq=7:mixed_refs:weight_b:no_fast_pskip:direct_pred=auto:mixed_refs:nr=200 -vf harddup -oac pcm -o "%~1_x264.avi" %1
\\@pause
del *.log

If you drag one of your 1080p .mov files and drop it on the .bat file you've just created in the MPlayer folder, it will automatically convert it to a compressed .avi file.

Create a batch script for converting .mov to HuffyUV .avi

If you want to continue to edit the clip, for instance trim it and append it to other clips to make a short movie, then it doesn't really make much sense to take a file that your camera has already compressed (with a slight loss in image quality) and recompress it (incurring another slight loss in image quality) to allow editing in Windows, since after editing you will surely compress the video again (incurring yet another loss in image quality). To avoid all of these so-called lossy steps, we can immediately convert to a lossless type of .avi: for instance using the HuffyUV codec instead of the x264 codec. The disadvantage of the HuffyUV codec is that the files are very large, but that's probably ok since they will only be temporary files which you will trim and edit before compressing your final masterpiece with a more efficient (but lossy) codec such as x264. Once the final video is complete (after editing with VirtualDub for instance), you can delete the temporary, intermediate HuffyUV files to free up space.

To convert .mov files to HuffyUV .avi files, just create an empty text file called MPLossless.bat in the MPlayer folder, as before and add the text below to it:


Title %~n1 -- Lossless encoding
"C:\MPlayer\mencoder.exe" %1 -oac pcm -ovc lavc -lavcopts vcodec=ffvhuff:vstrict=-1:vhq:psnr -noskip -o "%~1_huff.avi" %1
\\@pause

Again, dragging and dropping a .mov file onto this .bat file will perform the appropriate conversion.

Windows tricks to make conversion easier

We can allow instant conversion of files by right-clicking on them by adding these files to the "Send To" menu. Simply copy both files to:

C:\Users\[USER NAME]\AppData\Roaming\Microsoft\Windows\SendTo

replacing [USER NAME] with the appropriate Windows user name.

We can also write some batch files to carry out batch conversion of many .mov files in series, freeing you to do other things while your computer is busy converting your files.

As above, simply create a .bat file (e.g. Makex264.bat) containing the following text and place it into a folder containing .mov files, in order to convert each one to an x264 .avi:

for /f %%1 IN ('dir /b *.MOV') do call C:\MPlayer\MPConvertx264.bat %%1

Similarly, create a .bat file (e.g. MakeLossless.bat) with the following text for lossless conversion of all .mov files:

for /f %%1 IN ('dir /b *.MOV') do call C:\MPlayer\MPLossless.bat %%1

Simply double-click on either .bat file to convert all clips in the folder.

Credits

I patched these instructions together, adjusting handy example scripts from these three sources.

Anne Doyle

Fields of Athenry 10k - 2011

DSCN0023

For the third time, I ran the same 10k race in Athenry this year (St Stephen's day 2011) and as in previous years, the results are available online. For the second time I wore my GPS watch & heart-rate monitor and data captured by that device can be browsed and downloaded here. Kevin Doherty shot a video of the race. As for the previous events, I was again totally impressed at how well organised the whole thing was. I really enjoyed it, I felt great afterwards and the event is a credit to the town.

I forgot to stop the timer at the end, but my finishing time was about 45:44, which I am extremely happy with. I'd been injured since June, only running 25 min once a week (plus squash), I'm a little overweight and had thoroughly enjoyed all the christmas trimmings the day before. I had expected to come in at around 48 or 49 mins.

During the first couple of kilometers I was finding it hard to keep a constant pace, I kept tearing off after people overtaking me and having to stop myself. I knew that I wasn't in great shape and that I'd really struggle for the last half of the run if I went flat out from the start. So I picked out one of my fellow runners who looked like she knew what she was doing (Edel Tighe) and tried to just keep up. I chose well as it turned out, she kept going like a metronome for the whole race. It helped a lot to have a target to focus on, especially for the last 3km when I was really struggling.

As I did last year, I have again tried to make my result look better by unleashing the power of statistics, again without much success. The race was very windy this year, and I wondered if this would have a significant effect on the finishing times. Median finishing time was indeed 1.03 mins slower in Dec 2011 than in Jan 2011 (p-value 0.049, Wilcoxon test). Interestingly this increased time to finish seems to be largely caused by men running slower, there is no significant difference between median women's finishing time in January and that in December. I can't think of a clear explanation for that.

Below is a plot summarising finishing time distributions (probability density estimate) for runners in all categories for the same race run at the end of 2008 and at the beginning and end of 2011. The small, right-hand peak seems to have diminished this year and the main peak fattened, which might indicate that people are training harder on average. Vertical dashed lines are my times in those races. The variability in my finishing times is reassuringly small compared to the population spread.

Finishing Time Distributions

And below are two boxplots summarising finishing times by category (age and sex) for Dec 2011. The horizontal dashed line in each plot is my finishing time in that race.

Boxplots: comparing sexes

Boxplots: comparing ages

The finish line...
Athenry Arch

Michael D and The Mountain

Michael D Higgins will be the 9th President of Ireland.


He recorded his poem The Mountain with The Stunning in 1993. The Stunning were a band from the west of Ireland that I used to listen to a lot when I was a young lad. I really like this track, and you can listen to it here and buy it here.

Short walks in Northumberland

Took advantage of the surprisingly warm weather in NE England this week and did some short day trips out to Northumberland. Went on two walks, outlined below, in nice, quiet places with loads of scope for further exploration by foot or mountain bike.

People living in Wooler have access to Wooler common on their back door, would be a great place to be able to do your daily run. View from Kings Fort is great. Found many mushrooms. After taking samples home for identification (two were possibly Russula anthracina and Cortinarius multiformis) concluded none were edible. Identification was not easy. (OS map OL16, about 4 km):


View Wooler walk in a larger map

Saw lots of wildlife during a walk near Biddlestone, starting out from Elilaw farm: red kite, red squirrel, weasel, many partridge, grouse & pheasant. Also watched the army blow something up across the valley from us while having a break, and saw a farmer run over his dog with a tractor (ouch!). Dog was ok, after much yelping. Quite boggy at the top (must be very wet in winter). Saw a few mushrooms growing in old cowpats (mostly Shaggy Inkcap I think). OS map OL16, about 10.3 km.


View Elilaw Walk in a larger map

Testing optimisation algorithms: Part IV - Objective function and optimisation in Python

In part I of this series of posts, I presented the mathematics behind circle packing. In part II I demonstrated an implementation of this problem in R, together with an R function for visualising candidate solutions. In part III I tested some optimisation algorithms from various R packages.

In this post I will present an implementation of this problem in Python (code below), a first attempt at optimisation with the same L-BFGS-B algorithm (local optimisation subject to boundary constraints, this time from the scipy.optimize package) presented in part III and some functions for visualising solutions with Scalable Vector Grapics (.svg files).

I generally prefer writing code in Python to R. Python is a more "proper" programming language, is more flexible and extensible than R and has more beautiful syntax and is therefore more readable (which is particularly important when you write code as messily as I do).

R does have some advantages. Its package system (for extending and documenting R functionality) is extremely easy to use and efficient. Data frames in R are versatile, spreadsheet-like objects allowing you to mix different types of data (e.g. text, integers, real numbers) and label data with strings (names of rows and columns). There is no equivalent in Python or other programming languages. R handles plotting, in a flexible, dependable way (requires an external package in Python).

Theoretically, Python should be faster than R, but while writing the code below, I was surprised to find that accessing elements of NumPy arrays using square bracket [] notation was approximately 3 times slower than the equivalent vector operations in R. Fortunately I had read this post about numpy.take (cheers Darren!) which speeded up access by a factor of 3, rendering objective function evaluation speeds approximately equal in both R and Python. I am still quite surprised that they are not significantly faster in Python.

In part II I presented the drawCircles R function to visualise circle packing solutions. I am quite happy with this, particularly because it produces vector graphic output (in the form of a .pdf file). However, it's not perfect. For one thing, I had to convert the .pdf to a .png file to display the image online.

For the Python equivalent, I no longer have access to native plotting functions, but can trivially use Python to build up text files to draw graphics. I have been a fan of manually writing PostScript files for a long time (I am a proud owner of a copy of Adobe's Blue Book), but for this example I chose to try out svg. I don't yet know if svg is as powerful as PostScript, but it is more straightforward for creating images of arbitrary size and aspect ratio with svg (no longer constrained by physical page sizes). SVG is a vector graphics format which can be converted losslessly to PostScript or pdf. SVG is a native web format, so I can display output images online directly like this.

Unfortunately Drupal (the CMS I use to run this blog) doesn't yet allow embedding of .svg images. I haven't been able to get John Flower's SVG module to work (I probably need to upgrade my installation of Drupal). As far as I can tell, WordPress suffers from the same difficulties with svg as Drupal. Something to watch out for when choosing a tool for posting online... I would really rather have something which serves up standard .html. Here is the raster image version of the original:

144 sets of 27 circles packed into squares

Finally, here's the code. It uses the NumPy package to allow building and manipulating of arrays and the numpy.take trick mentioned above to speed up accessing elements. It also uses a function closure to define the objective function for comparison with the R equivalent from part II.


  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
import random, numpy

def genGuess(N,W,H):
    '''Generate sensible initial guess vector (random circle coordinates and radii)'''
    z=numpy.zeros(3*N)
    for i in xrange(0,N):
        z[i*3]=random.random()*W
        z[i*3+1]=random.random()*H
        z[i*3+2]=0.001*min(W,H)+random.random()*(0.009*min(W,H))
    return(z)


def getPairs(N):
    '''Generate all (2-way) unique pairings of N objects'''
    result=[]
    for i in xrange(0,N):
        for j in xrange(0,i):
            result.append([i,j])
    return(numpy.array(result))

def createObj(N,W,H):
    '''Function closure which initialises and creates a function for calculating the packing density of N circles in an W*H rectangle.  Circle coords and dimensions are represented by a single list z'''
    cpairs=getPairs(N)
    A=cpairs[:,0]
    B=cpairs[:,1]
    ind=range(0,3*N)
    a,b,c=ind[0:3*N:3],ind[1:3*N:3],ind[2:3*N:3]
    def newObj(z):
        '''Calculate packing density of N circles in W*H rectangle (N,W,H defined on function initialisation).  Circle coords and dimensions are represented by a single list z'''
        z=numpy.array(z,dtype=numpy.float)
        # Split z into x,y,r triplets: z[0],z[1],z[2] -> x1,y1,r1
        #x,y,r=z[a],z[b],z[c]
        x,y,r=z.take(a,axis=0),z.take(b,axis=0),z.take(c,axis=0)

        # Some linear inequality constraints to be satisfied
        c1=x+r-W # <0	
        c2=y+r-H # <0
        c3=r-x # <= 0
        c4=r-y # <=0
        # Many nonlinear inequality constraints
        #c5=r[A]+r[B]-numpy.sqrt((x[A]-x[B])**2+(y[A]-y[B])**2) # <=0
        c5=r.take(A,axis=0)+r.take(B,axis=0)-numpy.sqrt(numpy.power(x.take(A,axis=0)-x.take(B,axis=0),2)+numpy.power(y.take(A,axis=0)-y.take(B,axis=0),2)) # <=0

        c1=c1[c1>0]
        c2=c2[c2>0]
        c3=c3[c3>0]
        c4=c4[c4>0]
        c5=c5[c5>0]
        constraints=numpy.concatenate((c1,c2,c3,c4,c5))

        # Actual objective function (fraction of area covered)
        res=numpy.pi*numpy.sum(numpy.power(r,2))/(W*H)-numpy.sum(numpy.power(constraints,2))
        return(-1*res)
    return(newObj)

def vecToCircles(z):
    N=len(z)/3
    '''Convert z vector to an array of circle centres and radii'''
    ind=[x for x in xrange(0,3*N)]
    # Split z into x,y,r triplets: z[0],z[1],z[2] -> x1,y1,r1
    x,y,r=numpy.array([z[i] for i in ind[0:3*N:3]]),numpy.array([z[i] for i in ind[1:3*N:3]]),numpy.array([z[i] for i in ind[2:3*N:3]])
    clist=numpy.empty((N,3),numpy.float)
    clist[:,0],clist[:,1],clist[:,2]=x,y,r
    return(clist)

def drawCircle(x,y,r,s=2,fillcol="none",strokecol="black"):
    '''Draw a circle with centre x,y radius r, line thickness s, fill colour fillcol and line colour strokecol.'''
    return('''<circle cx="%s" cy="%s" r="%s" stroke-width="%s" fill="%s" stroke="%s"/>\n'''%(x,y,r-4*s/10,s,fillcol,strokecol))

def drawRect(x,y,w,h,s=2,fillcol="none",strokecol="black"):
    '''Draw a rectangle with top left corner x,y, width w, height h, line thickness s, fill colour fillcol, line colour strokecol, exploded slightly so inner edge is rectangle specified by 1st 4 params.'''
    return('''<rect x="%s" y="%s" width="%s" height="%s" stroke-width="%s" fill="%s" stroke="%s"/>\n'''%(x-s/2,y-s/2,w+s,h+s,s,fillcol,strokecol))

def svgHeader(w,h):
    '''Write valid svg header specifying bounding box w wide and h high'''
    string='''<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 %s %s" version="1.1">\n'''%(w,h)
    return(string)

def svgFooter():
    '''Write svg footer'''
    return('</svg>')

# Demo script
if __name__ == "__main__":
    import time
    from scipy import optimize
    
    N=27            # Number of circles
    w,h=10,10       # Native rectangle dimensions
    rows,cols=12,12 # Graphical array size
    celldim=10      # Final graphics scaling factor
    margin=1        # Gap between plots (native scale).  Should be >= 2* line
    line=0.1        # Line thickness for graphics (native scale)
    # Generate a suitable objective function
    ObjFun=createObj(N,w,h)
    maxr=min(w,h)/2.0
    outf=open("BubblesOptArray100.svg","w") # Prepare to draw circles in .svg file
    outf.write(svgHeader(cols*w*(celldim+margin)+margin*celldim,rows*h*(celldim+margin)+margin*celldim))
    
    b=[(0,w),(0,h),(0.01*maxr,maxr)]*N
    for x in xrange(0,rows*cols):
        arrx=x%cols
        arry=x//rows
        print((arrx,arry))
        start=time.time()
        # Randomly generate a sensible starting guess
        z=genGuess(N,w,h)
        # Gradient based fmin_l_bfgs_b from scipy.optimize
        opt=optimize.fmin_l_bfgs_b(ObjFun,z,bounds=b,m=12,maxfun=500000,approx_grad=True)
        print("Optimisation time: "+str(time.time()-start))
        res=opt[0]
        print("Initial guess objective function: "+str(ObjFun(z)))
        print("Final, optimised objective function: "+str(ObjFun(res)))
        print(opt[2]['warnflag'])
        print(opt[2]['task'])
        print(opt[2]['funcalls'])
        # Draw resulting circles
        clist=vecToCircles(res*celldim)
        # Offset circles for array
        clist[:,0]=clist[:,0]+margin*celldim+arrx*(celldim+margin)*w
        clist[:,1]=clist[:,1]+margin*celldim+arry*(celldim+margin)*h
        for j in xrange(0,N):
            outf.write(drawCircle(clist[j,0],clist[j,1],clist[j,2],s=line*celldim))
        outf.write(drawRect(margin*celldim+arrx*(celldim+margin)*w,margin*celldim+arry*(celldim+margin)*h,w*celldim,h*celldim,s=line*celldim))
    outf.write(svgFooter())
    outf.close()

Testing optimisation algorithms: Part III - Global optimisation in R

In part I of this series of posts, I presented the maths behind posing circle packing as a high-dimensional, non-linear, constrained optimisation problem. I am using this problem as an example to try out various optimisation algorithms. In part II I demonstrated an implementation of this problem in R, together with an R function for visualising candidate solutions.

In this post, I will describe some findings from attempts to optimise the circle packing objective function with R's default, general-purpose optimisation package: optim and two other packages for performing global optimisation using genetic algorithms with derivative-based intermediate searches.

I have updated the code from the last post slightly (added two more small functions). You can download the updated version here: CirclePacking.R. I've also written a second script demonstrating the use of the functions in CirclePacking.R in combination with the 3 optimisation packages mentioned above: TestingOptimisation.R (duplicated at the bottom of this post). Running this code generates plots like the one below, visualising solutions, together with some statistics about running time distributions and packing density distributions for ensembles of solutions.

L-BFGS-BN4_200
64 solutions for packing 200 circles into a square with the optim package.

The optim package includes an implementation of the L-BFGS-B algorithm, which is an efficient, derivative-based local optimiser allowing lower and upper bounds to be placed on parameters.

The rgenoud package uses genetic algorithms and derivatives to solve global optimisation problems. I used it successfully to fit a mixture model to pixel intensity distributions as part of the segmentation step in the image analysis tool: Colonyzer. It also has some interesting functionality for distributing computing load across multiple CPUs, by using a snow cluster which I wanted to investigate.

The DEoptim package also uses genetic algorithms and derivatives to solve global optimisation problems. I have previously found that this algorithm was faster than rgenoud when automatically fitting a logistic model to yeast growth curves in Quantitative Fitness Analysis.

Results

My findings are problem-specific, and depend on the number of circles allowed (which you can vary by modifying the code below) but so far, they are as follows:

1) For all N I've tried (up to 200), the L-BFGS-B algorithm was fastest. It achieved the same packing densities, but in shorter times.

2) Using the rgenoud cluster functionality did not speed up the genoud algorithm, or improve the packing densities of its solutions.

3) DEoptim crashes with mysterious errors for N>200 (more than 600 parameters) on a 64-bit machine with 12 Gb RAM. However, this is probably many more parameters than are required for most practical purposes.

4) rgenoud was much faster than DEoptim in this case (achieved the same or higher packing densities more quickly).

It's difficult to draw solid conclusions from this analysis, as I now suspect that for the circle packing problem, there are many local minima with largely similar packing densities and so it may be difficult or impossible to distinguish between local optima and a global optimum (if it exists). This would explain result 1) and possibly result 2). Result 1) was particularly surprising since I expected this high-dimensional problem to be very vulnerable to getting stuck in local minima, and so I expected global optimisers to be more efficient. As mentioned above, result 4) is reversed for at least one other optimisation problem.

I'm happy with the solutions (see image above), they have very pleasing organic patterns.


  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
library(DEoptim)
library(rgenoud)
library(snow)
source("CirclePacking.R")

# Number of circles and dimensions of bounding rectangle 
N=100; W=10; H=10
root="N64_100"
arrnum=64
arrdim=ceiling(sqrt(arrnum))

# Create objective function
obj=createObj(N,W,H)

# Lowest acceptable values for dimensions (x,y) is zero
# Minimum radius (to prevent circles being optimised out of existance)
low=rep(c(0,0,0.01*min(W,H)),N)
# x can go as high as W, y as high as H
# Radius of biggest possible circle contained in box is the smaller of W/2 and H/2
up=rep(c(W,H,min(W,H)/2),N)

# Generate initial population of guesses
NumPart=10*3*N
pop=matrix(0,nrow=NumPart,ncol=3*N)
for(i in 1:NumPart) pop[i,]=genGuess(N,W,H)

# L-BFGS-B test
results=c()
times=c()
pdf(paste("L-BFGS-B",root,".pdf",sep=""))
op<-par(mfrow=c(arrdim,arrdim))
for(i in 1:arrnum){
	print(i)
	z=genGuess(N,W,H); #plotCircles(z,W,H,numbers=TRUE) # Starting guess
	tim=system.time({
		out=optim(par=z,fn=obj,method="L-BFGS-B",lower=low,upper=up,control=list(maxit=300)) # Optimise
		z=as.numeric(out$par);  results=c(results,plotCircles(z,W,H,numbers=FALSE))# Result
	})
	times=c(times,as.numeric(tim[1]))
}
par(op)
op<-par(plt=c(0.125,0.95,0.13,0.9))
plot(times,results,pch=16)
hist(times)
hist(results)
par(op)
dev.off()

# parallel genoud Differential optimisation test
doms=matrix(c(low,up),nrow=3*N,ncol=2,byrow=FALSE) # Format bounds
results=c()
times=c()
cl <- makeCluster(12, type = "SOCK") # Initiate cluster
# Unfortunately, snow clusters don't seem to respect local R variables, so we pass them as global variables to each node, like this...
clusterExport(cl,list("N","W","H"))
pdf(paste("Parallelrgenoud",root,".pdf",sep=""))
op<-par(mfrow=c(arrdim,arrdim))
for(i in 1:arrnum){
	print(i)
	tim=system.time({
		z=genGuess(N,W,H); #plotCircles(z,W,H,numbers=TRUE) # Starting guess
		out=genoud(obj,3*N,starting.values=z,boundary.enforcement=2,Domains=doms, cluster=cl, max.generations=10,print.level=0,control=list(maxit=300))# Optimise
		z=out$par; results=c(results,plotCircles(z,W,H,numbers=FALSE))# Result
	})
	times=c(times,as.numeric(tim[1]))
}
stopCluster(cl)
par(op)
op<-par(plt=c(0.125,0.95,0.13,0.9))
plot(times,results,pch=16)
hist(times)
hist(results)
par(op)
dev.off()

# genoud Differential optimisation test
doms=matrix(c(low,up),nrow=3*N,ncol=2,byrow=FALSE) # Format bounds
results=c()
times=c()
pdf(paste("rgenoud",root,".pdf",sep=""))
op<-par(mfrow=c(arrdim,arrdim))
for(i in 1:arrnum){
	print(i)
	tim=system.time({
		z=genGuess(N,W,H); #plotCircles(z,W,H,numbers=TRUE) # Starting guess
		out=genoud(obj,3*N,starting.values=z,boundary.enforcement=2,Domains=doms, max.generations=10,print.level=0,control=list(maxit=300))# Optimise
		z=out$par; results=c(results,plotCircles(z,W,H,numbers=FALSE))# Result
	})
	times=c(times,as.numeric(tim[1]))
}
par(op)
op<-par(plt=c(0.125,0.95,0.13,0.9))
plot(times,results,pch=16)
hist(times)
hist(results)
par(op)
dev.off()

# DEoptim Differential optimisation test
results=c()
times=c()
pdf(paste("DEoptim",root,".pdf",sep=""))
op<-par(mfrow=c(arrdim,arrdim))
for(i in 1:arrnum){
	print(i)
	pop=matrix(0,nrow=NumPart,ncol=3*N)
	for(i in 1:NumPart) pop[i,]=genGuess(N,W,H); # Starting guesses
	tim=system.time({
		out=DEoptim(obj, low, up, DEoptim.control(trace=0,itermax=2000,NP=NumPart,initialpop=pop)) # Optimise
		z=as.numeric(out$optim$bestmem);  results=c(results,plotCircles(z,W,H,numbers=FALSE))# Result
	})
	times=c(times,as.numeric(tim[1]))
}
par(op)
op<-par(plt=c(0.125,0.95,0.13,0.9))
plot(times,results,pch=16)
hist(times)
hist(results)
par(op)
dev.off()

Testing optimisation algorithms: Part II - Implementing objective function in R

In Part I I presented the reasons I'm interested in optimisation, and gave a mathematical statement of the example constrained optimisation problem I will solve. In this post, I am going to present two practical implementation algorithms, coded in R. I'll also present an R function for visualising the resultant circles.

The code can be found at the end of this post, it's fairly well commented, so just some general comments first:

Most optimisation functions take a single vector of parameter values as an argument. This is a bit unwieldy for our problem, since we have N triplets of x,y and r. Here I arbitrarily define a parameter vector z, of length 3*N and split it up (consistently) into vectors for x, y and r.

R has a convenient combinatorics facility, so I use that to generate a unique list of all possible circle pairs to optimise the calculations required for constraint g5.

I'm using function closures here to avoid defining global variables. In R, a function closure is a function which returns another function but with its own environment for storing variables etc., which can be initialised within the outer function before returning the inner function. For example, I have created a function createObj which initialises the number of circles N, and the rectangle dimensions (W and H), generates a unique list of circle pairs (cpairs) and creates a function newObj which only depends on z, but nevertheless has access to private copies of N, W, H and clist.

Here, constraints are embedded into the objective function. A numerical value is associated with constraint violation and is subtracted from the objective function we are trying to maximise. There are two alternative versions of the objective function closures in the code below: createObj is "soft", the constraint violation values are the sum of the square of the magnitude of violations, createHardObj is "hard", any single violated constraint, no matter how small, returns in an area of -Infinity. In the first case, a derivative based algorithm will always be able to figure out a path towards satisfying constraints if it finds itself in a region where they are violated, but small violations may be tolerated in the final solution. In the latter case, all constraints must be completely obeyed, but on the other hand, there are no hints about how to improve from an invalid set of parameters.

The drawCircles function can plot circles to screen, or in vector graphics format (to a .pdf for example). Here are some outputs generated with this function (initial guess on left, optimised result on right):

CirclePackingCircles

Finally, the optimisation routine presented here just uses the L-BFGS-B method which is a pretty good, bounded, derivative-based optimiser, which is made available within the optim function which is standard in R installations (it's part of the core). It's fairly slow for more than 100 circles, and the results are not perfect, as circles are often forced to the edge of the box with negligble radii. Nevertheless, it's a good start.


  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
###########
# Optimum circle packing in R
###########

# Function closure, sets up list of possible combinations
# and returns objective function
createObj<-function(N,W,H){

print("Generating combinations...")
# Generate all possible circle pairings
cpairs=data.frame(t(combn(1:N,2)))
colnames(cpairs)=c("A","B")
print("Combinations complete.")

return({
	newObj<-function(z){
		# This vector is split up into triplets of x,y,r 
		# z[1],z[2],z[3] -> x1,y1,r1
		x=z[1:N*3-2]
		y=z[1:N*3-1]
		r=z[1:N*3-0]

		res=0
		# Some linear inequality constraints to be satisfied
		c1=x+r-W # <0	
		c2=y+r-H # <0
		c3=r-x # <= 0
		c4=r-y # <=0
		# Many nonlinear inequality constraints
		c5=r[cpairs$A]+r[cpairs$B]-sqrt((x[cpairs$A]-x[cpairs$B])^2+(y[cpairs$A]-y[cpairs$B])^2) # <=0

		c1=c1[c1>0]
		c2=c2[c2>0]
		c3=c3[c3>0]
		c4=c4[c4>0]
		c5=c5[c5>0]
		constraints=c(c1,c2,c3,c4,c5)
		if(length(constraints)>0){
			constraints=constraints^2
			res=res-sum(constraints)
		}
		
		# Actual objective function (fraction of area covered)
		res=res+(pi*sum(r^2))/(W*H)
	
		return(-1*res)
	}
})
}

# Function closure, sets up list of possible combinations
# and returns objective function
createHardObj<-function(N,W,H){

print("Generating combinations...")
# Generate all possible circle pairings
cpairs=data.frame(t(combn(1:N,2)))
colnames(cpairs)=c("A","B")
print("Combinations complete.")

return({
	newHardObj<-function(z){
		# This vector is split up into triplets of x,y,r 
		# z[1],z[2],z[3] -> x1,y1,r1
		x=z[1:N*3-2]
		y=z[1:N*3-1]
		r=z[1:N*3-0]

		res=0
		# Some nonlinear inequality constraints to be satisfied
		if(FALSE%in%(x+r<W)) res = Inf
		if(FALSE%in%(y+r<H)) res = Inf
		if(FALSE%in%(x-r>=0)) res = Inf
		if(FALSE%in%(y-r>=0)) res = Inf
		# Many nonlinear inequality constraints
		if(FALSE%in%(((x[cpairs$A]-x[cpairs$B])^2+(y[cpairs$A]-y[cpairs$B])^2)>=(r[cpairs$A]+r[cpairs$B])^2)) res = Inf
	
		# Actual objective function (fraction of area covered)
		res=res+(pi*sum(r^2))/(W*H)
	
		return(-1*res)
	}
	})
}

# Draw circles (centres and radii speficied in df)
# within a bounding rectangle (width * height)
drawCircles<-function(df,width,height,colour="black",pdfName=NULL){
	if(is.null(pdfName)){
		graphics.off()
		dev.new(width=7, height=7*height/width)
	}else{
		pdf(pdfName,width=7,height=7*height/width)
	}
	maxdim=max(width,height)
	width=width/maxdim
	height=height/maxdim
	df=df/maxdim
	# Change to coordinate system based on centre of rectangle
	df$x=df$x-width/2
	df$y=df$y-height/2
	# Draws circles on a square
	par(plt=c(0,1,0,1))
	plot(NULL,xlim=c(-width/2,width/2),ylim=c(-height/2,height/2),axes=FALSE,xaxt='n',yaxt='n',ann=FALSE)
	symbols(df$x,df$y,circles=df$r,fg=colour,bg=NA,add=TRUE,inches=FALSE) 
	rect(-width/2,-height/2,width/2,height/2)
	if(!is.null(pdfName)) dev.off()
}

# Number of circles and dimensions of bounding rectangle 
N=13; W=10; H=10

# Create objective function
obj=createObj(N,W,H)

# Lowest acceptable values for dimensions or radii (x,y,r) is zero
low=rep(0,3*N)
# x can go as high as W, y as high as H
# The biggest allowable circle radius is the smaller of W and H
up=rep(c(W,H,min(W,H)),N)
# Initial guess (random circle coordinates and radii)
z=rep(0,3*N)
z[1:N*3-2]=runif(N,0,W)
z[1:N*3-1]=runif(N,0,H)
z[1:N*3-0]=runif(N,0,0.1*min(W,H))

out=optim(par=z,fn=obj,method="L-BFGS-B",lower=low,upper=up,control=list(maxit=2000))
cat("Fractional area:",-1*out$value)

z=as.numeric(out$par)
x=z[1:N*3-2]; y=z[1:N*3-1]; r=abs(z[1:N*3-0])
dfA=data.frame(x=x,y=y,r=r)
drawCircles(dfA,W,H)

Testing optimisation algorithms: Part I - A circle packing objective function

Many mathematical modelling problems I face can be boiled down to constrained, multi-dimensional, global optimisation problems. Typically I have some model, with several parameters and some noisy data. I describe the relationship between the goodness-of-fit of a model to some data and the parameter values in the model as a mathematical function (an objective function) which I can then maximise within feasible parameter ranges. See my last post for an example of fitting a sine wave model to some made-up data.

The more parameters there are in the model, the harder the optimisation problem (particularly if there are few data!). Computational optimisation can take up a lot of CPU time, but maybe that's ok? CPUs get faster all the time, so maybe we can just wait for our next computer and the problem will go away naturally? Actually, over the past few years it has become apparent that individual CPUs are not really getting any faster, but instead we are seeing increased performance from computers and other devices through increased number of cores (effective CPUs) per device rather than increased speed per CPU. This shift in computational architecture represents a significant general programming challenge: how can we split up our problems to be solved by several independent CPUs and merge the resulting pieces efficiently?

This general problem is relevant for optimisation too. I recently built a new desktop computer. It has 6 cores, and with hyperthreading is capable of emulating 12 cores. How can I use all of these CPUs effectively to solve problems? To explore this, over the last few days, I've been looking at some optimisation libraries (so far in Python and R) in the hope of finding a satisfactory answer to this question. I will continue investigating and outline the results at a later date, but for now I want to present the optimisation problem that I am using for testing.

Packing circles into a rectangle

Consider N circles fully enclosed by a rectangle of width W and height H. The problem is to maximise the area enclosed by the circles while ensuring they do not overlap each other, or cross the box boundaries. Let each circle i have centre coordinates (xi,yi) and radius ri. We can write down the objective function (the fractional area enclosed by the circles, or packing density), which depends on xi,yi and ri, as:

subject to (xi,ri) being at least ri away from the rectangle edges (4*N linear inequality constraints):

and subject to no overlap between any circles (N*(1-N/2) non-linear inequality constraints):

The idea here is to maximise f, by choosing xi, yi and ri while obeying the constraints g1,...,g5. We can also consider a simpler variant where instead of choosing xi, yi and ri, we only choose xi and yi, and "inflate" each circle with a deterministic algorithm to satisfy constraint g5. This variant has only linear constraints and only 2*N parameters (instead of 3*N), but has an algorithmic objective function f.

These examples are useful, since a global optimum solution (where N is also allowed to vary) is known, giving us some idea of achievable packing densities. Also, it is trivial to increase the number of parameters to make the problem arbitrarily more difficult (simply increase the number of circles N).

For now, here are some examples of some (sub-optimally) packed circles:

N=10, f=0.6622
C

N=50, f=0.7517
C1

N=100, f=0.7563
C2

Oscillations

Last week a colleague asked for advice on how to estimate the frequency and amplitude of an oscillating signal. My initial response was to suggest reading the data into R, using the spectrum function to estimate the spectral density of the time series, taking the maximum as a good initial guess for the frequency of oscillation and using that initial guess to fit a sinusoid model to the data by least squares. I suggested trying to fit the rather gruesome "monthly deaths from bronchitis, emphysema and asthma in the UK, 1974–1979" dataset ldeaths which is embedded in R, and referred to as an example in the spectrum documentation.

On reflection, I think that this was a terrible example to give, since the ldeaths time-series is nice and regular. One thing I've learned from 10 years of working with biological data is that they are almost never regular or well-behaved. Therefore analysing this dataset is likely to be misleading. Many statistical tools assume regular datasets.

The alternative, more realistic example R workflow I now recommend (including the generation of a realistic, synthetic, example dataset) can be seen at the bottom of this post. Here is an example output:

Oscillating dataset and fitted models

While writing this, I remembered that way back in 2000 I ran a website which described solving this very problem as an example of using the Newton-Rhapson method for multi-dimensional optimisation. The tools I used then are different, but the two main principles are the same: 1) fit a model to the data for a robust estimate of frequency & amplitude and 2) take care to generate good initial guesses for parameter values if you want a sensible fit.

If we wanted to test for the significance of differences between frequency estimates generated from different datasets, without multiple experimental repeats, the problem becomes more interesting. In that case I would recommend using JAGS and rjags to infer distributed parameter values for the same model (with an added error model).


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
############################
# Generate best estimate for frequency and amplitude given irregular timeseries
# displaying oscillations by fitting a sinusoidal model.
############################

### Define sinusoid model
mod=function(A,f,c,p,t) A*sin(2*pi*f*t+p)+c

### Use model to generate a synthetic irregular time-series (with noise)
### For a real problem, replace this section with actual data!
# True paramter values
Atrue=5; ftrue=1/30; ctrue=20; ptrue=0
# Random simulation of irregular observation schedule
expt=sort(sample(0:100,50,replace=FALSE))
# Generate realistic dataset (irregular observation times and observations with error)
df=data.frame(time=expt,obs=mod(Atrue,ftrue,ctrue,ptrue,expt)+rnorm(length(expt),mean=1,sd=2))

### Estimate frequency of data by simulating a regular time-series
# Construct interpolating function for observed data
intf=approxfun(df$time,df$obs)
# Simulate a regular measurement frequency
mfreq=2
# Create regularly spaced time-series
regtimes=min(df$time)+mfreq*(0:((max(df$time)-min(df$time))/mfreq))
regobs=intf(regtimes)
# Generate and plot frequency spectrum for regular time-series
spec=spectrum(regobs,plot=TRUE)
# Extract the fundamental frequency
fguess=spec$freq[which.max(spec$spec)]
# Convert units from measurent frequency original units
fguess=fguess/mfreq

### Estimate other paramters
# Estimate amplitude as half range of observations
Aguess=(max(df$obs)-min(df$obs))/2
# Estimate offset as mean of observations
cguess=mean(df$obs)
# Estimate the phase as 
pguess=asin((df$obs[1]-cguess)/Aguess)-2*pi*fguess*df$time[1]

### Fit model to data
# Objective function (sum of squared difference between model and observations)
obj<-function(z){
	A=z[1]; f=z[2]; c=z[3]; p=z[4];
	return(sum((mod(A,f,c,p,df$time)-df$obs)**2))
}
# Optimisation
# Approximate sampling frequency (to avoid Nyquist limit)
sfreqest=length(df$time)/max(df$time)
lowerlim=c(0,0,-Inf,0)
upperlim=c(Inf,2*sfreqest,Inf,2*pi)
params=optim(c(Aguess,fguess,cguess,pguess),obj,lower=lowerlim,upper=upperlim,method="L-BFGS-B",control=list(maxit=10000))$par
Aopt=params[1]; fopt=params[2]; copt=params[3]; popt=params[4]
Apercent=100*(Atrue-Aopt)/Atrue; fpercent=100*(ftrue-fopt)/ftrue; cpercent=100*(ctrue-copt)/ctrue; ppercent=100*(ptrue-popt)/(2*pi)
report=paste("% Diffs: A:",signif(Apercent,3),"f:",signif(fpercent,3),"c:",signif(cpercent,3),"p:",signif(ppercent,3))

# Inspect goodness of fit
#pdf("Oscillations.pdf",width=8,height=8)
	plot(NULL,,xlim=c(min(df$time),max(df$time)),ylim=c(0.9*min(df$obs),1.2*max(df$obs)),type="n",xlab="time",ylab="value",main=report,cex.lab=2,cex.axis=2)
	curve(mod(Aguess,fguess,cguess,pguess,x),min(df$time),max(df$time),add=TRUE,col="red",lwd=3,,n=1001)
	curve(mod(Aopt,fopt,copt,popt,x),min(df$time),max(df$time),add=TRUE,col="blue",lwd=3,n=1001)
	curve(mod(Atrue,ftrue,ctrue,ptrue,x),min(df$time),max(df$time),add=TRUE,col="green",lwd=3,n=1001)
	points(df$time,df$obs,type="b",lwd=3)
	legend("topright",c("Truth","Guess","Optimised","Observed"),lwd=3,col=c("green","red","blue","black"))
#dev.off()
# Generate a small report
res=data.frame(A=c(Atrue,Aguess,Aopt),f=c(ftrue,fguess,fopt),c=c(ctrue,cguess,copt),p=c(ptrue,pguess,popt))
rownames(res)=c("Truth","Guess","Optimised")
print(res)

Parsing and analysis of GPX files in R

As I mentioned in a previous post I've been working on ways to parse the .gpx XML files which are generated by my Garmin Forerunner so that I can analyse the captured data more carefully. The online data visualisation tool provided by Garmin is good, but there are a few flaws. It's not always possible to scale data nicely, it's not possible to compare between datasets, and a few other things. Reading the raw data into R would allow all kinds of analysis.

R is a very powerful, flexible, open-source statistical programming language which is really good for parsing and analysing datasets.

I've written some R code to parse .gpx files, analyze the data and generate .pdf reports. You just put the script into a directory, together with the .gpx file to be analyzed (which you can access by clicking the export button on the page for any of these activities for example), add the filename to the top of the script, and run it. A .pdf report like this one is generated.

The reports contain several types of plot, including the following four types:

Elevation plot
Elevation varying along route followed. Colour indicates elevation, x and y coordinates represent distance from the starting point in a northerly and easterly direction respectively.

Heart rate plot
Heart rate varying along route followed. Colour indicates heart rate, x and y coordinates represent distance from the starting point in a northerly and easterly direction respectively.

Pace plot
Pace varying along route followed. Colour indicates pace, x and y coordinates represent distance from the starting point in a northerly and easterly direction respectively.

Four different performance statistics plots
Elevation, heart rate, speed and pace varying with distance covered along route.

The R code is effective but untidy. I had intended to write it nicely into S3 classes or even an R package, as I think that others might be interested in using it, but I have run out of steam a bit. If anyone is interested in it, let me know as this would give me some motivation!

Here are some other reports I've generated. The first four runs are along essentially the same route and are comparable.

Scanning Plant Tissue

I have an Epson V700 Photo Scanner, designed for making relatively high resolution scans of film transparencies (technical specs claim 6400dpi, but scanning options include up to 12800dpi). Usefully, it allows scanning of reflective media at the same high resolution, which means I can capture images (of objects flat enough to go on the A4 glass bed) at almost microscopic levels of detail. It can also capture colour in these images at 48 bit depth (16 bits per RGB channel instead of the usual 8 bits), which allows lossless colour optimisation before converting to a 24 bit image for final output.

I've just moved house, and there are a range of strange plants in the garden. I've scanned some tissue samples from them in the hope of capturing scientific resolution images (below). Not quite able to see stomata on the leaves, but on the flower petals you can make out individual cells if you view the images at their original size (click through images below, or here for example). Nevertheless, the magnification is impressive, there are many features visible which you cannot make out with the naked eye, or even after macro photography. Also, I like the colours!

GreenLeaf2
YellowPurple
RedLeaf
Purple
GreenLeaf

Quantitative Fitness Analysis and the Telomere Cap

TheEndOfTheLine

Our article describing a new, high-quality, high-throughput method for carrying out screens using arrays of micro-organisms is published today in PLoS Genetics. I'm very pleased to be one of three joint first authors of this article!

Quantitative Fitness Analysis Shows That NMD Proteins and Many Other Protein Complexes Suppress or Enhance Distinct Telomere Cap Defects

We present Quantitative Fitness Analysis (QFA), which is a robot-driven technology for capturing high-resolution growth curves on a genome-wide scale, allowing sensitive differentiation between growth phenotypes of arrays of micro-organisms and inference of genetic interaction strengths. In this paper we examine the response of brewers yeast to two distinct telomere disruptions, in combination with independent deletions of all 4300 non-essential genes in the yeast genome. By doing this we greatly increased our understanding of telomere biology, which is particularly relevant to ageing and cancer.

We are very interested to hear from anybody who might like to carry out QFA screens relevant to their own particular area of biological research, and have recently set up a high-throughput screening service at Newcastle University:

http://research.ncl.ac.uk/bioht/

Real datasets with R - 10k finish times

In my last post, I demonstrated my GPS running traces. One of those traces (this one) was from a 10km road race I ran over christmas: the Fields of Athenry 10k which is organised every year by the Athenry Athletic Club. When I was growing up, I lived next door to Paul McNamara who is some kind of running machine, and he's involved with organising this race every year. It is a fantastic event, a real credit to the town, and gets better every year.

I've run the race this year in January 2011, (45.01 (135th)), it was postponed from St Steven's day 2010) and in December 2008 (44.26 (150th)). To complement the data I captured during the race, there are two more interesting datasets which are the finishing times for participants for these two years (available for 2008 here and for Jan 2011 here).

This year, I ran more slowly than I would have liked. I have a bucket of excuses (couldn't train in weeks leading up to race because of bastard snow and ice in Newcastle, got flu in weeks before race, over-trained just before race in attempt to compensate) but what I thought I would like to do is demonstrate statistically that my time wasn't as bad as all that. While attempting to show that everyone ran slower this year than in 2008 (they didn't) I wrote some R code to do the analysis, which I thought might be useful for demonstrating one of the strengths of R: reading in data.

Similar to Python, R has some functions for painlessly reading in text files. This makes idle mucking about with data much more likely to happen, since it doesn't involve building complicated functions to parse files first. These two finish times datasets are good examples of real data, they are formatted in non-standard ways, they have gaps, empty records, empty columns, timestamp formatting to deal with. The code got a bit too complicated for a simple demonstration of how to deal with data in R, but it is well commented and it is a good example of data handling, you can have a look at it here. The datasets to make the code run are also here and here.

Density Plots: detailed look at finishing time distributions

This is a density plot of the finishing times for all participants for both
years I raced. My times are marked with vertical dashed lines.

Finishing Time Distributions

I had hoped to demonstrate that everyone was slower in 2011, however, comparing these distributions using the Mann-Whitney (or Wilcoxon) test yields no evidence for a significant difference in medians (p-value 0.097). This test is more appropriate than a t-test because the data are clearly non-normal (asymmetric and bimodal). Blindly using a t-test regardless gives the same result, but with a larger p-value (0.49).

Instead of inspecting these differences visually by comparing the distributions above, we can generate some summary statistics for the two datasets:


# Summary Statistics December 2008
Min. 1st Qu. Median Mean 3rd Qu. Max.
30.67 46.12 51.48 54.61 59.08 120.00

# Summary Statistics January 2011
Min. 1st Qu. Median Mean 3rd Qu. Max.
31.20 44.48 51.05 55.26 58.78 195.60

In 2008 my time was around the 19.5% centile and in 2011, around the 27.5% centile. Too depressing to figure out of this difference is significant (I suspect it is).

Besides trying to justify my slower time, there are other interesting comparisons which can be drawn using these data. The following plot shows both the male and female finishing time distributions for both years.

Comparing Finishing Time Distributions By Sex

Here are the summary statistics for each of these categories:

# Male December 2008
Min. 1st Qu. Median Mean 3rd Qu. Max.
30.67 44.09 48.60 50.13 53.97 120.00
# Female December 2008
Min. 1st Qu. Median Mean 3rd Qu. Max.
36.88 51.08 57.75 61.75 67.96 102.70
# Male January 2011
Min. 1st Qu. Median Mean 3rd Qu. Max.
31.20 42.16 46.87 48.42 52.70 103.30
# Female January 2011
Min. 1st Qu. Median Mean 3rd Qu. Max.
35.87 51.50 58.67 65.41 67.34 195.60

Again we can test for significant differences with the Mann-Whitney test, and find that in 2008 the median finishing time for men was significantly faster than that for women (between 8.2 and 11.16 mins faster). In 2011 this was again significant and between 6.7 and 10.3 mins faster. More surprisingly we can show that the median finishing time for men in 2011 was between 0.5 and 2.7 mins faster than that for 2008, and was significant. Similarly for women it was between 0.35 and 5 mins faster.

Boxplots: visually comparing many distributions

There are still further levels of detail in this dataset, we can break down each sex by age category and look at distributions within each category, for instance comparing junior men (MJ <18) with senior men (MS 18-40) and men between 60 and 65 (M60). We can even begin to speculate whether we can see evidence for (and even quantify) age-related physical deterioration.

Comparing Finishing Time Boxplots By Sex

I've plotted my finish time as a horizontal dashed line in these boxplots. You can see that plenty of senior women, men over 60 and junior men beat me.

We can plot the same data for both sexes side by side, matched by age:

Comparing Finishing Time Boxplots By Age

I was surprised at how well the M60 and M45 categories fared compared to their younger competitors. The M60 category in particular would be my father's category if he were to race! Several of those guys beat me well. This is not really how ageing is supposed to work. I was also surprised at how closely matched the WJ and MJ categories were.

The equivalent of these last two plots for 2011 can be found here and here. The surprising observations are consistent across the two years.

I am now interested to know why does the M60 category perform so well? Is it because people who are still running at age 60 have been running all their lives and have self-selected themselves to be good runners? My colleague Peter Banks believes that older runners have less competition for their time (children grown up, possibly retired) and can train more effectively. Also, 10k running requires quite long-term training to build up the necessary stamina to get a good time. Maybe there's hope for me to improve yet (hopefully before I'm 60).

More data

One of the best things about my job is that I get to play with all kinds of interesting biological data. Especially interesting is being involved in gathering data to develop mathematical/computational models of biological processes. In my particular area of research, the processes I try to model and understand are related to how humans age. These data are expensive and difficult to obtain, requiring laboratory facilities, expensive robotic equipment, powerful microscopes, laser dissection tools, access to rare tissue samples for example.

My background and principal interest is in computational and mathematical modelling. Although interesting, complex biological problems are definitely the most satisfying applications for these tools, I sometimes enjoy applying these tools to other problems for fun. For example, much of the video and digital image work I've done over the past few years has involved taking other kinds of data (image data from scanners, DSLRs, telecines, DVDs and DV cameras, audio data as MIDI instructions or sampled waveforms) and analysing and manipulating these mathematically. These data are in some sense rich (there are a lot of pixels/frames/samples -> a lot of numbers -> a lot of data) and are easy and cheap for me to capture, which makes them enjoyable to work with.

Lately I have also become interested in biological data that are easy to capture for fun. Over the past year or so, I have gathered some interesting datasets about myself. I've been thinking about writing a massive blog post describing them all, but I am struggling to find the time, so I think I'll just start with one, and hopefully add to this later.

I've written before about my return to running, but recently, I've been given a GPS training device, a Garmin Forerunner 405CX (much cheaper to buy from Amazon). This device records GPS coordinates and heart rate simultaneously, roughly once per second, for up to several hours, and if you wear it while training, you can use it to plot your route, calculate how your speed changes with time, and how your heart rate reacts to speed, terrain and durational exertion.

These data are presented in a nice website hosted by Garmin, which you can have a look at here. They can also be downloaded as .xml files for more detailed analysis. Hopefully this will come later once I've gathered more data. For now, I will go for a run!

11th International Conference on Systems Biology, Edinburgh 2010

I've just returned from the 2010 International Systems Biology Conference in Edinburgh. It was an academic meeting, consisting of 4 full days of scientific presentations, many in parallel. It was attended by about 1000 scientists from around the world. Some of the talks were live-blogged here.

The definition of systems biology is a bit nebulous and controversial, but I have some definite ideas on this subject. Here is the definitive definition:

Systems biology is scientific research whose goal is an holistic, mechanistic understanding of biological systems. This goal should be achieved through multi-disciplinary research using an iterative cycle of experimental and theoretical research. The most natural framework for such research is the development of mechanistic, mathematical, dynamic simulation models and the testing of scientific hypotheses embedded in those models by rigourous comparison with (preferably a lot of) data.

My view of systems biology is strongly influenced by my background in systems engineering, which is a more developed and clearly defined area of scientific endeavour. Systems engineering shares many common features with systems biology, it's typically interdisciplinary (largely because engineers are mathematicians, chemists, biologists, accountants, physicists, computer scientists all in one person), it involves simulation modelling, it involves iteration between predictive modelling (for design initially), experimentation (the building, operation and testing of machines or chemical plants), model refinement based on the capture of data specifically relevant to the model and model use in predictive mode (process optimisation). There are two major qualitative differences between systems engineering and systems biology. 1) Physicochemical observations required for parameterising and validating engineering models are much easier to make or estimate than equivalent biological observations. 2) Engineering models are built with very specific goals in mind (e.g. to maximise speed, production rate, performance or safety of a machine or plant) which can all be reduced to attempts to make as much money as possible.

The first difference makes systems biology more challenging and interesting than systems engineering. However the second difference is the most striking and informative, particularly in light of the current state of systems biology as presented at the ICSB meeting. This current state is a bit different to my ideal of how systems biology should work as outlined above. During the meeting I was struck by several examples of people gathering data and performing analysis which had no underlying mechanistic modelling framework. Eric Werner seemed to want complete virtual organisms, capable of exact reproduction of living systems, but I think that this approach would be very inefficient, and that useful models are not complete, but rather focussed on a particular biological problem or question. There's nothing wrong with high-throughput hypothesis-free research, I do some of this kind of work myself, however it's not systems biology. An interesting example of this was Mike Tyers' complaint that the Tyson-Novak model of the cell cycle cannot explain the variation observed by Costanzo et al. in their 2010 Science paper. The comparison is simply unfair. The Tyson-Novak model is elegant, relatively simple and focussed on a specific aspect of yeast biology: the cell cycle. It increases our understanding by highlighting the important aspects of a complex system that make it work the way it does. The Costanzo data is a high-throughput dataset, captured in a completely hypothesis-free fashion and is affected by all aspects of yeast biology.

I have been interested in systems biology and attending systems biology conferences for about 10 years now, and I think that as time goes on the systems biology community is straying from mechanism- and model-based research towards high-throughput data capture and analysis. This kind of work can be exciting too, but I think that it's unhelpful to classify it as systems biology. High-throughput data analysis can provide an excellent starting point for the systems biology cycle but I would prefer it was classified as genomics and bioinformatics. I think that the meeting would have been better had the organisers selected more ruthlessly for mechanistic modelling talks. It would have been a shorter and smaller meeting and perhaps even not as high profile, but I think it could have been more useful and exciting.

Despite my negative rantings, there were many excellent talks that I really enjoyed only a few of which I had seen before. In no particular order these included:

Michael Stumpf's talk on using Approximate Bayesian Computing to infer parameter distributions (and model robustness) for a model describing the movement of macrophages towards wound sites in zebra fish. Testing whether macrophage walks are random. Observed constrained "drunken" walk, constrained in a line, but randomly moving forwards or backwards. Different amounts of information for different model parameters, some parameters are "sloppy". Signal transduction detail largely irrelevant.

Thomas Pollard's very pro-dynamic-modelling, pro-mechanism talk about actin motility. "You can't understand complicated biology without mathematical models". "Passionate about using time as a variable in experiments". "Without quantitative rate parameter measurements you are lost". Described careful calibration of GFP measurements to protein number. Estimated reaction and diffusion rates. Demonstrated that assuming mass-action kinetics in a well-stirred vessel not appropriate sometimes, need a spatial dimension. Described careful selection of GFP tag to ensure no effect on growth rates.

Luis Serrano's talk presenting the huge amount of diverse data captured describing one small bacterium. Spent a lot of time demonstrating that mRNA and protein levels are not correlated, but this is well known at this stage.

Michaela de Clare and Annette Alcasabas presented focussed combination of modelling and experimental work on a switchable tetraploid yeast system where each of 4 copies of relevant genes can be switched on and off at will, giving haploid, diploid, triploid and tetraploid yeast, a sort of more controlled method of overexpression, together with modelling work predicting mutant growth rates and comparisons of predicted and observed rates.

Douglas Murray presented a continuous yeast culture system, allowing the culture to remain out of stationary phase, but for cell concentrations to remain constant. These cells begin to oscillate in oxygen uptake, growth and transcription with a period of about 40 mins. He uses his observations to validate the Manchester yeast metabolism model.

Franziska Matthaeus described work on the tumble-run decision in bacteria. Exponentially distributed run lengths gives a random walk but power-law distributed run lengths can give a Levy walk which is qualitatively different and explores space more rapidly. Can construct power-law distribution from the sum of exponential distributions with different parameters. Showed that Gaussian parameter distribution gives power-law run-length distribution therefore better space-coverage, therefore example of noise giving selective advantage. Showed great 2D simulations of bacteria motion in response to nutrient gradient (random walk slower but more adaptive and precise than Levy walk) and discussed mechanisms underlying run-length distributions.

Darren Wilkinson presented work demonstrating the importance of modelling the measurement process when using GFP tags to infer stochastic model parameters. Measuring protein levels directly is best, but not practical. Modelling connection between GFP observations is the way to do things. The alternative (common) way to proceed is to use GFP as a surrogate for protein, with perhaps a fixed proportionality confusing the two. This gives poor results since you are lumping measurement error and intrinsic model stochasticity together, when they are in fact independent. The relation between GFP and protein levels was a pretty major theme throughout the meeting.

Austrian-Italian holiday

KTGlasses

Katy and I went on a long trip through Germany, Austria and Italy this year. We spent a day in Munich, and wandered around the Englischer Garten which is huge and features a surprising array of water sports right in the centre of a major city. We saw lots of people swimming between bridges in the fast-flowing (and cold) Eisbach river, on a hot day, which looked like great fun.

InThePongau

From there we went to Sankt Johann im Pongau in Austria. Highlights there included climbing from town (600m) to Hahnbaumalm (1122m), climbing from town (600m) to Ober-Alpendorf (~850m), down to Alpendorf, getting a cable-car from there to Obergassalm (1520m) and climbing to Sonntagskogel (1849m). No snow of course! It was about 30C. We walked all the way back down. The Jagersteig detour from Kreistenalm was very nice with black squirrels, wild rasberries and loads of mushrooms.

Another day we drove along the Grossglockner High Alpine Road, very impressive, but a lot of the views were obscured by heavy cloud. We did see the Pasterze Glacier for a few minutes, and lots of marmots. We had a day in Salzburg, there was hot sunshine and torrential rain.

NoCycling

From there, we went to Bolzano, just over the border in Italy, home of Otzi the iceman. The Otzi exhibition is really interesting. Bolzano is a gorgeous city built in a steep valley with a large river flowing through it. There are several churches, two of which are particularly impressive, nice parks, amazing array of paths, bridges and cycle-lanes, but best of all, there are 3 extremely cheap and regular cable cars going from the city centre (267m, was 33C while we were there) up to the mountains (which were as cool as 10C at the highest). Bolzano is the only city I've ever been to where you have such complete control over your ambient temperature, and when you have as much hair as I do, this really is a wonderful thing. We had two good day trips, one from Bolzano (267m) to Oberbozen (1220m) by cable car. From there we took a narrow-guage train to Kolbenstein (1156m), and walked through the forest to Pemmern (1538m), where we took another cablecar to Schwarzseespitze (2070m) and walked to Rittner Horn (2260m), where there were spectacular views and panoramawegs... Our second trip from Bolzano (267m) was to Bar Eidelweiss (1108m) on the extremely vertical and scenic Seilbahn Kolhern (cable-car). We then wandered through the woods to Titschen (1616m) and accidentally stumbled across the Rotenweg (near Rotenstein (1506m)) which gave a stunning view over the Bolzano valley. Our holiday was from 20th August - 4th September. It seemed a good time to go, there was loads of sunshine, we were in the mountains for mushroom season (saw many, many varieties of mushroom).

Boots

Finally we went to Fiesole on a hill overlooking Florence for the wedding of my friend Helena, which was undescribably great.

Islamophobia

There was a big English Defence League (EDL) march in Newcastle today. The EDL seems to be a new and pretty offensive anti-Islam movement, populated by British National Party supporters, with a similar ethos to the National Front. I would have liked to have attended the counter-protest by Unite Against Fascism, but I only heard about all this yesterday and didn't get my act together in time. I did go into town this afternoon, but by the time I got there, there was nothing left except for hundreds of police and an empty stage under Grey's Monument. By all accounts, the EDL supporters were outnumbered by UAF supporters, which makes me glad. I worry a little bit about English nationalism in Newcastle. I wonder why the white front door of a flat on Nuns Moor Road, surrounded by Asian businesses, was daubed with red gloss paint a couple of weeks ago, making a huge St. George's flag. I wonder why I had to take an EDL sticker off the door of the squash courts I use last week. Anyway, having missed the marches today, I was reading about them online, and came across this interesting video:

http://www.guardian.co.uk/uk/video/2010/may/28/english-defence-league-un...

To my dismay, I know the EDL supporter interviewed 7mins 36 seconds in. He works for my landlord, who I believe is a Muslim, in the chip shop beneath my flat!

Colonyzer: automated quantification of micro-organism growth characteristics on solid agar

Our article describing a new, sensitive image analysis tool for quantifying the density of micro-organism cultures growing on solid agar plates has been accepted for publication in BMC Bioinformatics, you can access it here.

Colonyzer is largely written in Python using the Python Imaging Library, but also uses the rgenoud genetic optimisation package in R (via RPy).

Colonyzer has a website here and if you want to use or develop the code, it is freely available from its sourceforge page.

DOI: 10.1186/1471-2105-11-287.

Python for Scientists

I will be running a half-day workshop on Python programming for Medical School postgraduate students on Tuesday.

I've written some notes up as a website describing how to install Python on Newcastle University Windows Desktops, how to execute Python scripts, how to install Python packages, a brief introduction to programming, and some examples.

http://www.staff.ncl.ac.uk/conor.lawless/ScientificPython/

I'm trying to make the examples in particular relevant for biological researchers.

Amie knits gorillas

Ageing cured!

Ageing has been solved. We are all out of a job.

Quantitative assessment of markers for cell senescence

Our paper describing the use of a simple, well-parameterised cell population model to compare the usefulness of various cellular markers as indicators of cellular senescence is now available in Experimental Gerontology. We used CaliBayes (published here) to give distributed estimates of the increase in senescent fraction with time in passaged human fibroblasts.

DOI: 10.1016/j.exger.2010.01.018.

Daedalus of DREADCO.

A few weeks ago, I had the very great pleasure of visiting the home of Prof. David Jones to take some photographs of him and some of his creations as promotional material for a documentary about him by Adrin Neatrour. Prof. Jones is probably most famous for writing the Daedalus columns in New Scientist and Nature, but is also well known for building unridable bicycles, investigating Napoleon's death, and building puzzlingly perpetual motion machines.

ProfessorJonesMirror4

Perpetual

DavidProjector

FiveFrames

ChemicalGardenInspection

He is a fascinating, mischevious and inspirational scientist and engineer with a house packed to the brim with amazing tools and gadgets. Prof. Jones showed me a lathe where he had recently carved a set of billiard balls with centres of gravity which were not located at their centres to give to a relative as a christmas present!

Adrin's documentary is really enjoyable, I recommend it to anyone, and it will shown, for free at the Tyneside Cinema at 3pm on Tuesday 16th March 2010 as part of the Newcastle Science Festival 2010.

The documentary is distributed by Galloping Films.

CloseEncounters

I stumbled across the Python winsound module today. It is a standard library on Windows machines and it has a function for controlling the PC speaker. Harf.

Hear it play the music from Close Encounters of a Third Kind from a list of frequencies and durations:

import winsound
from winsound import *
music=((784,1),(880,1),(698,1),(349,1),(523,3))
for x in music:
    Beep(x[0],550*x[1])

A list of frequencies and durations is a simple, computational form of musical notation.

CaliBayes and BASIS: integrated tools for the calibration, simulation and storage of biological models.

Our new article published today in Briefings in Bioinformatics, describing CaliBayes (Bayesian inference for parameter distributions in SBML models, including discrete stochastic models) and BASIS (a tool for simulating cohorts of stochastic simulations from an SBML model and storing the results). Both of these systems are accessible as webservices and there are various convenient Python, Java and R client libraries to make consuming these webservices extremely straightforward. Both systems make significant computing power at Newcastle University publicly available.

Free access to .pdf

The DOI entry for the article is now working:

10.1093/bib/bbp072

Zoom on a Random Walk

After reading Darren's post about the Mandelbrot set I was browsing through the Wikipedia entry on the Mandelbrot set which has a few amazing animations of really deep zooms on the fractal bounded by the set.

The good/bad thing about fractals is that as you magnify them, the pattern and resulting image that you see is in some way similar to patterns you saw at lower magnifications. So the zooms are in some sense self-similar.

It occurred to me that viewing random walks at different magnifications should have a similar property. You should not be able to tell what magnification you are looking at, since the average trajectory should be random, regardless of the scale of the trajectory. As long as you are away from the scale at which you can observe any discrete steps making up the walk, that is.

So, we should be able to make semi-infinite zooms of random walks, in which all magnifications look believably random and as if they were generated from the same process but without any repeating or recurring patterns. I think the random walks are quite pleasing to look at. It is a bit tricky to create this zoom though, since I can't think of any cunning way to step from one magnification to another by "filling in" the gaps in the walk that have just come into view. So I have to create a massive random walk image, and progressively resize and crop it to achieve the zoom. Also, the PyGame code I posted a while ago is too slow for generating a massive random walk. However, there's no need to see the walk being generated in this case, so we can just colour pixels black and anti-alias during image resize (using the Python Imaging Library) instead of doing it in real-time.

Deep zooming is cool anyway as you can see in the amazing Powers of Ten video by Charles and Ray Eames.

So, here's a zoom on a random walk that I made (approximately 100 times zoom, made on a 32-bit machine with 3.5Gb RAM):

Zoom on a Random Walk

The frames were generated with the Python code below. If you have a more powerful computer than me to hand, you can generate a deeper zoom! Just increase the width and height parameters until you run out of memory or patience. I patched the output images together and created a .gif animation using the amazing VirtualDub.

import PIL,random,math,os
from PIL import Image,ImageDraw
 
# Define screen size and initialize
size=width,height=3500,3500
plot=Image.new("L",(width,height),'white')
pix=plot.load()
 
# Output Frame size
wf,hf=250,250
# Zoom parameters
zmax,zno=1.0,700
 
# Start at centre of image
x,y=width/2,height/2
pix[x,y]=0
 
# Prepare to find the bounding box for the walk
xmin,ymin,xmax,ymax=x,y,x,y
pcount=0
 
# Start walking until we hit the edge
while 1<=x<width-1 and 1<=y<height-1:
        # Generate new steps
        xnew=x+random.randint(-1,1)
        ynew=y+random.randint(-1,1)
        x,y=xnew,ynew
        pix[x,y]=0
        pcount+=1
        if x<xmin:xmin=x
        if x>xmax:xmax=x
        if y<ymin:ymin=y
        if y>ymax:ymax=y
 
# Crop to the walk      
plot=plot.crop((xmin,ymin,xmax,ymax))
 
# Rotate if necessary to make image wider than tall
if xmax-xmin<ymax-ymin:
        plot=plot.rotate(90)
(w,h)=plot.size
 
# Set minium zoom to just capture entire walk
zmin=float(wf)/float(w)
 
# For geometric zooming:
# zoom(z) = a*pow(phi,b*z)
# insist that zoom(0)=zmin, zoom(zno)=zmax
phi=10.0
a=zmin
b=math.log(zmax/zmin,phi)/zno
 
for z in xrange(0,zno+1):
        zoom=a*pow(phi,b*z)
        print "FRAME: ",z, zoom
        blank=Image.new("L",(wf,hf),'white')
        if z>0 and z%50==0:
                # Throw away pixels to achieve deeper zoom with set memory
                cz=float(hf)/float(hnew)
                xa=max(0,int(round(float(w)/2.0-cz*float(w)/2.0)))
                ya=max(0,int(round(float(h)/2.0-cz*float(h)/2.0)))
                xb=min(w,int(round(float(w)/2.0+cz*float(w)/2.0)))
                yb=min(h,int(round(float(h)/2.0+cz*float(h)/2.0)))
                plot=plot.crop((xa,ya,xb,yb))
                (w,h)=plot.size
        wnew=int(round(zoom*float(w)))
        hnew=int(round(zoom*float(h)))
        # Resize the frame
        small=plot.resize((wnew+1,hnew+1),Image.ANTIALIAS)
        # This crop is to trim grey lines which appear
        small=small.crop((1,1,wnew+1,hnew+1))
        xmin=int(round(float(wnew)/2.0-float(wf)/2.0))
        ymin=int(round(float(hnew)/2.0-float(hf)/2.0))
        xmax=int(round(float(wnew)/2.0+float(wf)/2.0))
        ymax=int(round(float(hnew)/2.0+float(hf)/2.0))
        if wnew<wf or hnew<hf:
                frame=Image.new("L",(wf,hf),'white')
                px=int(round(float(wf-wnew)/2.0))
                py=int(round(float(hf-hnew)/2.0))
                frame.paste(small,(px,py))
        else:
                frame=small.crop((xmin,ymin,xmax,ymax))
        fname="FrameL%05d.png"%z
        frame.save(fname)
 
print "Zoom: ",1.0/zmin
print "Steps: ",pcount

Random Walks

I went to a nice talk today by Nicolas Le Novere about some current developments in SBML, a standard language for describing biological models to computers.

During his talk he mentioned some of his work on modelling neurons and flashed up an image representing a molecule diffusing into a synapse. I think it was undergoing Brownian motion, which is basically a random walk in space (looked like 2-dimensional space to me).

In May I wrote some Python code using PyGame which simulates random walks, generating still images of them, but also displaying an animation of their generation (see below). It generates anti-aliased lines, so it still looks smooth even when taking tiny step sizes. Simply run it, and it will show you how it is choosing the random steps in the walk, and finally it will produce a .png image which looks like this:

RandomWalk

You can change the resolution, step-sizes etc. yourself. I am not sure if the steps should be normally distributed or uniformly distributed to be true Brownian motion, but I can't see any difference in the patterns generated by eye. Anyway, it's a nice simple example of using PyGame for something non-game related.

import pygame, sys, numpy
from pygame.locals import *

# Define screen size
size=width,height=1000,1000
# It's much faster if we don't show the walk developing...
showwalk=True

# Initialise screen, set size
pygame.init()
screen = pygame.display.set_mode(size)
pygame.display.set_caption('A Random Walk')

# Fill background with white
background = pygame.Surface(size)
background = background.convert()
background.fill((255, 255, 255))

# Start at centre of image
x,y=width/2,height/2

def Exit():
        # Blit everything to the screen
        screen.blit(background, (0, 0))
        pygame.display.flip()
        # Save image
        fname="RandWalk"+str(numpy.random.randint(0,99999))+".png"
        pygame.image.save(background,fname)
        # Quit gracefully
        pygame.quit()
        sys.exit(0)

# Start walking until we hit the edge
while 0<=x<width and 0<=y<height:
        # Generate new steps
        xnew=x+numpy.random.uniform(-1.0,1.0)
        ynew=y+numpy.random.uniform(-1.0,1.0)
        # Let's try to draw an anti-aliased line
        pygame.draw.aaline(background, (0,0,0), (x,y), (xnew,ynew), True)
        x,y=xnew,ynew
        # Blit everything to the screen if we want to see development
        if showwalk:
                screen.blit(background, (0, 0))
                pygame.display.flip()
        # Exit gracefully if required by pressing q
        for event in pygame.event.get():
                if event.type == KEYDOWN and event.key==K_q:
                        Exit()
Exit()

Rasterizing Archimedes Spirals

Imagine you want a list of coordinates, following a spiral, on a discrete, regular grid. I can think of several examples where this could be useful for digital image construction, where the regular grid corresponds to pixel coordinates (integer only). It's easy to code up a square spiral like Ulam's spiral, since pixels are naturally arrayed on a rectangular grid, but using square spirals in image construction leaves obvious square corners. A circular spiral can be smoother and better.

Let's look at the Archimedes spiral in polar coordinates:

If we choose a = 0 and select b so that:

we can write the equations describing the spiral in Cartesian coordinates, in terms of , the distance between spiral loops:

To get an approximation of this smooth curve on a regular grid, we can think about trying to move along it in jumps of constant length , where the jump size is achieved by selecting a change in angle . Computing the distance between one point and the next we can write down this expression:

We can rearrange this expression into something simpler:

It would be nice to solve this for , then we would know how big an angle we need to jump to move exactly at a given time. Remember that as the radius gets bigger, the angle required to achieve a given displacement decreases. Unfortunately, I can't solve it analytically, due to the presence of the cosine term, however we can approximate cos() with the Taylor Series expansion up to 3rd order, giving us this expression:

This can easily be solved for small values of . If we define the perpendicular distance between pixel centres as 1, we know that the maximum distance between any two adjacent pixels is 1/. Therefore if we choose our parameters like this:

we would expect to move along the spiral, moving close to every pixel at least once. If we continue to step along the spiral like this indefinitely, and round every coordinate to the nearest integer value, we should hit every pixel in Archimedes spiral order!

This diagram shows the result of applying this algorithm:

Rasterizing Archimedes Spiral

The blue curve is a continuous Archimedes spiral, the numbered red points are the pixel approximations of the continuous curve and the red lines demonstrate the actual path taken between unique, discrete approximation points.

Below is a demonstration of what this sort of thing can look like. In this case I sorted the pixels of an original image by intensity and re-drew them along the approximate pixellated Archimedes spiral path:

Archimedes Spiral

The original image is here:

YellowCandleHolder

I've written some python code to generate these rasterized pixel spirals. You specify an image size (rectangle) and the coordinates of the spiral focus, and it generates a unique list of points lying inside that rectangle, in spiral order (as in the diagram above). It's quite fast, it can generate enough points to fill a 6 megapixel image in a couple of minutes.

from math import *
 
def uniquer(seq, idfun=None): # http://code.activestate.com/recipes/52560/
    # Generates a unique list of elements from seq
    # maintaining their original order
    if idfun is None:
        def idfun(x): return x
    seen = {}
    result = []
    for item in seq:
        marker = idfun(item)
        if marker in seen: continue
        seen[marker] = 1
        result.append(item)
    return result
 
def x(theta,alpha):
    return int(round(alpha*theta*cos(theta)/(2*pi)))
 
def y(theta,alpha):
    return int(round(alpha*theta*sin(theta)/(2*pi)))
 
def delta(alpha, theta, omega):
    return (2*pi*omega)/sqrt(pow(alpha,2)*(1+pow(theta,2)))
 
def dist(a,b):
    return sqrt(pow(a[0]-b[0],2)+pow(a[1]-b[1],2))
 
def rasterspiral(TL,BR,F):
    # Distance between spiral loops
    alphas=1/sqrt(2)
    # Distance along spiral at each step
    omegas=1/sqrt(2)
    # Starting theta
    thetas=0
 
    # Define top left pixel and bottom right pixel coordinates
    (tlx,tly)=TL
    (brx,bry)=BR
    # Spiral focus coordinates (must lie within rectangle above)
    (xf,yf)=F
 
    # Find which corner is farthest from the spiral focus
    corners=[[tlx,tly],[tlx,bry],[brx,bry],[brx,tlx]]
    distances=[dist([xf,yf],i) for i in corners]
    farthestind=max(xrange(len(distances)), key=distances.__getitem__)
    farthest=corners[farthestind]
 
    # Initialise the spiral at F
    xc,yc=xf,yf
    results=[(xc,yc)]
 
    # Keep spiralling until we've capture the farthest corner
    while [xc,yc]!=farthest:
        # Update the step in angle based on current angle etc.
        deltanew=delta(alphas,thetas,omegas)
        thetas=thetas+deltanew
        # Calculate the next coordinates
        xc,yc=xf+x(thetas,alphas),yf+y(thetas,alphas)
        # If these coordinates lie within the defined rectangle
        # then store them
        if tlx<=xc<=brx and tly<=yc<=bry:
            results.append((xc,yc))
 
    # Now eliminate duplicate copies of points on the spiral
    results=uniquer(results)
    return results
 
# Example of use:
test=rasterspiral((0,0),(50,100),(25,50))

Elevation

bikehike is a UK site on which you can record run/hike/cycle routes online. Routes can be input by either clicking on a google map (or indeed using google routefinder) or by uploading a set of GPS coordinates as generated by a GPS-aware training watch for example. This is pretty much how the gmap-pedometer site works, BUT this is far more awesome since as well as using the full range of google map tools (satellite view, nice interface for zooming around, postcode search etc.) it also incorporates information from the ordnance survey, such as elevation data! Also OS maps tend to have many more nameplaces on them than google maps.

To see it in action, here is the route I run on Wednesdays. If you check the Show Elevation Data box, you can see the route elevation, and if you click on the Satellite button, you can see a browsable overhead photograph of the route. This is far cooler than the gmap-pedometer routes I had saved previously.

Now all I need is one of those GPS watches...

Syndicate content