Note the contrast:
In every math book I’ve ever seen, the log(⋯) function is the basee natural logarithm. The same goes for every nonspreadsheet computer language I’ve ever heard of.  In contrast, alas, excel decided to implement log(⋯) as the base10 common logarithm. Other spreadsheet programs have followed suit. 
This includes imperative languages such as C++ and functional languages such as lisp.  FWIW, a spreadsheet is considered more of a declarative language, as opposed to imperative or functional. 
This is a huge trap for the unwary. Constructive suggestion:
When writing in a nonspreadsheet computer language, do not write log(⋯). Write ln(⋯) instead. If necessary, define your own ln function, e.g. #define ln(x) log(x).  When writing spreadsheet code, do not write log(⋯). Write log10(⋯) instead. That’s another name for the same builtin function. 
Note the contrast:
In every math book I’ve ever seen, the floor(⋯) function rounds toward minus infinity (not toward zero). The same goes for every nonspreadsheet computer language I’ve ever heard of including imperative languages such as C++ and functional languages such as lisp.  In contrast, alas, excel decided to implement floor(⋯) as rounding toward zero. Other spreadsheet programs have followed suit. 
This is another huge trap for the unwary. I don’t know of any good way to write code that is nonmisleading. See also section 1.1.3.
The int(⋯) function has always been a mess.
This topic is named in honor of the proverbial long straight fence where there are N lintels held up by N+1 posts.
This is relevant in many different computing situations, especially when an interval is divided into N subintervals (not necessarily all the same size) demarcated by N+1 endpoints. In particular, this is relevant to running sums, simple numerical integrals, and simple numerical derivatives. Here are some suggestions:
As a corollary, this means that each lintel is named after the post at the left end of the lintel, i.e. the lessernumbered of the two posts that hold that lintel.
Format all the lintelvalues as subscripts, and all the postvalues as superscripts. This makes it easier to visualize the idea that the lintels fall between the posts. Alas the contrast between subscripts and superscripts is not as striking as one might have hoped, but it’s better than nothing.
Beware that the ancient Romans mostly did not play by these rules. For example, in music, there are only 7 notes per diatonic octave; the eighth note is logically the beginning of the next octave. So the name octave, which suggests eight, is inconsistent with modern notions of how to count.
Similarly, there is no year zero. The year 1 AD immediately follows the year 1 BC, even though this is inconsistent with modern notions of how to construct a number line.
Beware that counting is different for a fence that forms a closed loop. Then you have N lintels supported by only N posts. For example, a circle can be divided into four quadrants, demarcated by four points.
The foregoing applies to arrangements that are effectively onedimensional. In two or more dimensions, e.g. dividing a sphere into octants, things get much more complicated.
Let’s assume you have column vectors, not row vectors. Put the formula in the top cell of the result. Replace V$1 by the top of one of your vectors (two places), and W$1 by the top of the other (two places). Be sure to spell it with a $ sign in front of the rownumber. Then fill down into the remaining cells of the result.
=offset(V$1,mod(1+row(A1)row(A$1),3),0)*offset(W$1,mod(2+row(A1)row(A$1),3),0) offset(V$1,mod(2+row(A1)row(A$1),3),0)*offset(W$1,mod(1+row(A1)row(A$1),3),0)
Even though the formula mentions cell A1, we do not care what (if anything) is there.
In typical spreadsheets, curly braces are used to denote an array constant, for example {1,2,3} is a horizontal array. Similarly, {5;6;7} is a vertical array constant. Continuing down this road, {11,12,13;21,22,23;31,32,33} is a 3×3 matrix constant. The general rule is that commas are used to separate values on any given row, while semicolons are used to separate rows.
Here are some example of how this can be used:
=sum(row($A$1:$A$10)) (not necessarily entered as an array formula) To generate <nval> values starting from <start>: =row(offset($A:$A,start1,0,nval,1))
Additional applications of this idiom can be found below.
Hamming weight =sum(bitand(1,C9/2^(row($A$1:$A$64)1))) Hamming distance =sum(bitand(1,bitxor(C9,C10)/2^(row($A$1:$A$64)1))) parity =bitand(1,sum(bitand(1,C9/2^(row($A$1:$A$64)1))))
String initial match =sum(if(left(C6,row(offset($A:$A,0,0,len(C6),1)))=left(C7,row(offset($A:$A,0,0,len(C6),1))),1,0))
=sum(normdist(K10,AA$11:BD$40,K$4,0)*K$2)
Explanation: The normdist() function ordinarily returns a scalar. If however you pass an array expression (constant or otherwise) as one of its arguments, it returns an array. This is an ephemeral array, not visible on the spreadsheet anywhere. In this example, the sum() collapses the ephemeral array, producing a scalar that fits in a single cell. Remember to enter the expression using <ctrlshiftenter>.
Hint: If you ever enter something as a scalar when you should have entered it as an array, don’t panic. The formula is still there, and can easily be promoted to arrayformula status. Highlight the area that is to be filled by the array, hit the <F2> key to move the focus to the formula bar, and then hit <ctrlshiftenter>.
As we have seen in the examples above, you can do arithmetic involving array constants, for instance by highlighting cells A1:C1, then typing in =10^{1,2,3} and hitting <ctrlshiftenter>. This will put the values 10, 100, 1000 into the three cells.
Obviously you can compute the maximum of these three cells using the expression max(A1:C1), and store this result in a single cell such as A3.
Logic suggests that you can do the same thing on the fly, by entering the combined expression max(10^{1,2,3}) into a single cell such as B3.
However, beware: Due to a bug in current versions of the gnumeric spreadsheet program, and possibly other programs as well, if you try in the normal way, you will get a numericallywrong answer. See reference 3.
As a workaround for this bug, you can get the right answer by entering the formula =max(10^{1,2,3}) as an array formula using <ctrlshiftenter>. There is no good reason why <ctrlshiftenter> should be needed in this situation, because we are entering an ordinary scalarvalued expression, equivalent to the expression in the previous paragraph (which did not require <ctrlshiftenter>).
Suppose you have a region that is 4 rows high and 2 columns wide, and you want to reverse each row and/or each column. Here is one way to do it:
=offset($E$10,4row(A1),2column(A1)) [rows and columns] =offset($E$10,0,2column(A1)) [just columns] =offset($E$10,4row(A1),0) [just rows]
where $e$10 is the upper left corner of the region.
This formula is meant to be filleddown and/or filledacross to cover the destination region; it is not meant to be entered as an array formula.
It would be supernice to be able to express the reverse as an array expression, but I have not found a way to do this. The index() and offest() functions do not appear to tolerate array expressions in their second or third arguments.
You can reverse rows and/or columns using the index() function or the offset() function. For example, if you want a copy of the data in columns I through K, with the columns in reverse order, you can use the following. Note that index() starts counting at 1, whereas offset() starts counting at 0. Also, index requires an array as its first argument, while for present purposes offset requires a single cell as its first argument (since we want each call to offset to yield a single cell).
To reverse columns:
=index($I2:$K2,1,1+column($K2)column(I2))
or equivalently
=offset($I2,0,column($K2)column(I2))
Fill down if there are multiple rows of interest in each column.
Similarly, to reverse rows:
=index(E$6:E$26,1+row(E$26)row(E6),1)
or equivalently
=offset(E$6,row(E$26)row(E6),0)
This turns out to be important in connection with convolutions.
Another important application arises in connection with a band plot such as in figure 1. Ideally it would be possible to make such a plot using the "fill to next series" feature of the spreadsheet program, but in older versions of gnumeric that has bugs. It is better to use the "fill to self" feature. This requires tracing the top of each band in the normal order, then tracing the bottom of each band in reverse order.
Here is an example of a convolution:
npts: 8 =sumproduct(offset($B$10,max(0,row(A1)$C$7),0,min(row(A1),2*$C$7row(A1))),offset($D$10,max(0,$C$7row(A1)),0,min(row(A1),2*$C$7row(A1)))) =offset($C$10,$C$7row(A1),0) label 0 0 0 0 0 1 1 1 1 0 2 1 1 1 1 3 1 1 1 2 4 1 1 1 3 5 1 1 1 4 6 1 1 1 5 7 0 0 0 6 8 5 9 4 10 3 11 2 12 1 13 0 14 0
where cell $c$7 contains the number of points in each vector (npts: 8) and the first vector begins at $b$10 and the second begins at $c$10. The reverse of the second vector begins at $d$10.
Suppose you have a row of data: 1, 2, 3, et cetera. You would like to plot each data item twice. Usually the best way to do this is to put the duplicates in adjacent rows as discussed in section 1.19.
However, sometimes you might want to keep everything in a single row, and just have twice as many cells in the row, so that the row reads 1, 1, 2, 2, 3, 3, et cetera. This situation can easily arise if you want to draw a stairstep function. The vertical risers require you to duplicate the abscissas, and the horizontal treads require you to duplicate the ordinates.
The index function can do this for you:
=index($I11:$W11,1,1+(column(I11)column($I11))/2)
Note that stairsteps arise if you are plotting the cumulative probability for a discrete distribution, such as in figure 2.
The column(M7) function returns the address of the given column; in this example, “M” is column number 13. If however you want the column letter, aka the column name formatted as a letter, you have to work harder. Here’s the recommended expression:
=substitute(address(1,column(M7),4),"1","")
Explanation: The row number (“7”) here is irrelevant. The column(M7) function returns 13, ignoring the row number. The address(...,4) function returns an address, namely the text string “M1”. The “4” here tells it to return a relative address (i.e. without any $ signs). Lastly, we get rid of the “1”.
Here’s one rationale for doing something like this: Suppose you want to format a message that refers to the column in which cell M7 was found. Next, suppose you cutandpaste cell M7 to a new location such as N8. The result of the recommended expression will change accordingly; the new result will be “N”. In other words, the recommended expression is translation invariant – whereas a simple string constant (in this case “M“) would not be.
In some cases there are simpler ways of achieving translation invariance, as discussed in section 1.11.
Use the match(needle,haystack,0) to return a relative row number. A result of 1 corresponds to the first row in the haystack range. In particular, argmax is match(max(haystack),haystack,0) and argmin is match(min(haystack),haystack,0).
The result can then be used in the index(range,relRow,relCol) function to fetch a value. Equivalently, it can be used in the offset function, provided you subtract 1.
To sort numbers in ascending order:
=small($D$4:$D$99,1+row(D4)row(D$4)) <E>
where <E> means you can enter it as a plain old scalar formula (not array formula).
Step 1: To get an array of indices, apply the following to the sort key, to encode the key and the corresponding row number. Suppose the input data starts at D$4 and we want the indices to start at A$4:
=round(percentrank(D$4:D$99,D4,20)*(count(D$4:D$99)1))+(row(A4)row(A$4))/1000 <E>
Format that with three decimal places.
See section 1.14 for a discussion of the percentrank function. Let N denote the number of valid elements. Multiplying the prank by N−1 (and rounding) gives the rank as an integer (with N possible values from zero to N−1 inclusive). We use the integer rank as our sort key. The rest of the expression uses the fraction part to encode the index, to remember where this element was before sorting.
Step 2: Sort the indices. Let’s put the sorted results in column B:
=small(A$4:A$99,1+row(B4)row(B$4)) or =large(A$4:A$99,1+row(B4)row(B$4))
Then extract the row number, which serves as an index:
=round(mod(B4,1)*1000)
Or combine the two previous steps, to calculate the index directly:
=round(mod(small(A$4:A$99,1+row(B4)row(B$4)),1)*1000) =round(mod(large(A$4:A$99,1+row(B4)row(B$4)),1)*1000)
Then you can use offset(...) to fetch all the sorted field(s) you want, not just the sort key field.
=offset(x!D$2,$B2,0) <E>
Fill that across (to pick up all fields) and down (to pick up all records) as required.
You can check for ties by looking at the ascending and descending pranks; they should add up to unity:
=percentrank(D$2:D$6,H2,20) =percentrank(iferror(D$2:D$6,"x"),H2,20)
This makes sense insofar as it is 1/(N−1) less than the prank of the nextlarger element of the haystack, i.e. R_{next} = R_{low} + M/(N−1)
To find the number of valid numeric elements in a region, ignoring strings, empty cells and #ERROR cells, use the following, entered as an array formula:
=count(B15:C17) or equivalently =sum(iferror(B15:C17B15:C17+1,0)) (array formula)
If you want something that does not ignore #ERROR cells, try something like this:
=rank(max(B15:C17),B15:C17,1)+sum(if(B15:C17=max(B15:C17),1,0))1
If there are one or more errors anywhere in the range, the result will be an error. The type of the result error will match one of the input errors.
Sometimes you want to select a subset of the original data, and put it into a new region without gaps. This is called a query. You might want to run multiple queries against the same original data.
Suppose the original data is on page "x", in cells x!e6:j37. That’s a total of N=32 rows. Columns A,B,C, and D are not needed on this page. Optionally, they can be used for some unrelated purpose.
On the next page, use column A to keep a running sum of how many times the selection criterion is met. There has to be a zero in cell A5.
=A5+1e06+(left(x!I6,2)="**")
where the stuff in parentheses is the criterion; it evaluates to 1 whenever the criterion is met, and 0 otherwise. Note the 1e6 which is necessary to make sure the elements in this column are strictly sorted, as required by vlookup(). This is not an array formula. Fill down so there are N rows of this.
As another example, suppose you just want to select rows that contain valid numerical data in column I:
=A5+1e06+(count(x!I6))
Then in the next column, set up sequence numbers, starting with 1 in cell B5. In gnumeric, this column is not needed, but if you want the spreadsheet to be XL compatible it is needed.
Then in the next column, find where each successive match occurs. This returns a 0based index into the original data.
=1+vlookup(row()row(B$6)+0.5,A$5:B$37,2,1)
The second argument to vlookup has the column to be searched and the return value. The next argument says that the return value comes from the second column, which in this case is the array of sequence numbers.
Then in the next column, be defensive about running off the end of the original data region. You can hide most of the rows with "xxx" in them, although it might be smart to leave one of them, to demonstrate that you are not inadvertently hiding any results.
=if($B6<rows(x!$D$6:$D$37),$B6,"xxx")
Finally, the offset() function can be used to grab the desired subset of the data:
=if(count($D6),offset(x!E$6,$D6,0),"")
That can be filled down and across as much as you like.
When plotting, in the simplest case, the ordinate of the plot is a onedimensional spreadsheet region, by which I mean a region that is either N rows by 1 column, or 1 row by N columns. (Ditto for the abscissa, if any.)
However, it is perfectly possible for the ordinate (and/or abscissa) to be a rectangular region, with M rows and N columns. Each row is read out before advancing to the next row. For example, the region A1:C10 contains 30 elements, in the following order A1, B1, C1, A2, ⋯ B10, C10.
It is reasonably straightforward to plot a set of arbitrary line segments (disconnected from each other). Similarly it is reasonably straightforward to plot a set of arbitrary triangles or other polygons.
The basic idea is to plot them as an ordinary “XY lines” plot. To separate one item from the next, put a nonnumeric value in the list of abscissas and/or ordinates. A blank will do, it is more readable if you use a string such as “break” or “X”.
This idea combines nicely with the rectangular regions discussed in section 1.17. To plot N line segments, use three columns of N rows for the abscissas (tail.x, tip.x, and “break”) plus three columns for the ordinates (tail.y, tip.y, and “break”).
Similarly, to plot N quadrilaterals, use six columns of N rows; the first four define the corners, the fifth returns to the starting point to close the last side, and the sixth is the “break”. An example is shown in figure 3.
If you don’t need to close the last side, you can get by with only five columns. For an example, see figure 7 in section 2.1.
For a discrete distribution, such as we get from rolling a die, cumulative probability is a stairstep function. Suppose there are N steps. Depending on details of the desired appearance, this will correspond to either N treads and N+1 risers, or perhaps N+1 treads and N risers. Either way, there is a nice representation using N+1 rows of four columns: two columns for the abscissas, and another two columns for the ordinates. Each row represents one tread: It has two unequal abscissas (the left edge and the right edge of the tread) and two equal ordinates (since the tread is horizontal).
For an example, see section 2.1.
In order to understand what gets plotted in front of what, it helps to know some terminology. However, the terminology for graphical objects is maddening. You might imagine that words like graph, chart, and plot should mean pretty much the same thing ... but in this context they don’t. I write these words in small caps to indicate that they are codewords.
Gnumeric doesn’t even use its own terminology consistently. The popup for customizing a graph and its contents is entitled «Customize Chart» whether or not the graph contains zero, one, or many charts.
Here is a subset of what is going on, even when you try to do something simple. Suppose we are making an XY plot of some data:
In contrast, the size and position of the plots is controlled by properties of the parent chart – not by the axes, and not by the plots themselves. Don’t ask me to explain this.
Now, you might imagine that the drawing routines would recurse through the objects pretty much as shown above. In particular, you might imagine that gnumeric would finish drawing one series before going on to the next. However, that’s not how it actually works. The innermost loops are done “inside out” as explained below. This is completely undocumented, indeed contrary to the documentation.
The actual behavior can be outlined in pseudocode, as follows:
foreach plot in chart.children { foreach drawable in [line, symbol] { foreach series in plot.children { draw(series, drawable) } } }
In English, that means that within each plot, the lines (for all the series) are drawn in order, and then the symbols (for all the series) are drawn in order. All of the lines wind up behind all of the symbols. The stacking of the lines (relative to other lines) is controlled by the order of the series as they appear in the user interface, and similarly for the stacking of the symbols (relative to other symbols). However, within a plot, there is no way to draw a line on top of a symbol ... even if you have one series that uses only lines and another that uses only symbols. No amount of reordering the series within a plot will change this.
This ordering may not be what you want. (An example arises when plotting arrowheads on top of target symbols, as discussed in section 1.21.) In such a case, the only way to make progress is to use more than one plot, and order the plots appropriately. You can achieve quite fine control over the “stacking” – i.e. the order of plotting – by putting each series in its own plot. Use one series per plot, i.e. one plot per series.
For even finer control of the plot order, don’t rely on a single series to draw both lines and symbols. Instead, split it into two series looking at the same data. One series displays the lines, while the other series displays the symbols.
This business of creating more than one plot would be a lot more practical if it were possible to move a series from one plot to another, but evidently that is not possible. The buttons that move a series up and down in plot order get stuck at the boundary between plots.
While we’re on the subject: What would help most of all is the ability to copyandpaste a series ... and also copyandpaste an entire plot (along with all its series). This includes being able to copy a series from one plot and paste it into another. The ability to copyandpaste (not just cutandpaste) has innumerable applications. It is very common to want a new curve that is just like an old curve, with minor modifications.
It is important to distinguish between data space and screen space.
There are some things such as arrowheads that clearly need to be plotted in screen space, not data space. You want the arrowhead shape to be preserved, even if it is rotated by some odd angle. The idea of rotation makes sense in screen space, but violates basic requirements of dimensional analysis in data space, where there is not (in general) any natural metric, i.e. not any natural notion of distance or angle.
In figure 4, the extent of the plot in the x direction is more than fivefold larger than the extent in the y direction. The arrowheads would be grossly distorted if we tried to draw them in data space without appropriate correction factors.
Although typical spreadsheet apps do not provide easy access to screen space, we can work out the required transformations. The first step is to figure out the metric. We know the metric in screen space, and we use this to construct a conversion metric for data space. Let’s denote data space by (x, y) and screen space by (u, v).
In data space, the metric has the trivial, obvious form:

In data space, we assume the conversion metric tensor can be written
in the form:

Equivalently, this idea can be expressed in terms of the element
of arc length:

Equivalently, the conversion can be expressed in terms of the element
of arc length:

It must be emphasized that the arc length ds is the same in both equation 3 and equation 4. It is the real physical arc length on the screen.
You can determine the coefficients a and b as follows: Lay out a rectangle on the chart. Let the edges of the rectangle be (Δu, Δv) in screen space, and (Δx, Δy) in data space. Then the needed coefficients are:
 (5) 
The conversion metric defines a notion of length and angle in data space. Given a line segment in data space, we can construct a tangent unit vector with the same orientation ... where “unit” refers to its length in screen space. We can also define a transverse unit vector, perpendicular to the line segment ... where “perpendicular” refers to its angle in screen space.
It is then straightforward to construct arrowheads and other objects in terms of the tangent unit vector and the transverse unit vector.
If you change the size of the chart or rescale the axes within a chart, you will have to recompute the a and b coefficients in accordance with equation 5. This is a pain in the neck if the axis bounds are being computed automatically.
Before you start plotting arrowheads, be sure to read section 1.20. You may want to use the oneplotperseries construction for your vectors and arrowheads.
The spreadsheet code to do this is cited in reference 4. In this example, the plotting area is square, and the edgelength of the plotting area is taken as the unit of distance in screen space. The size of the arrowheads is 1/20th of this distance. This is of course only an example; any other distance could be used in accordance with equation 5.
There are several ways to see the formula aka expression in a cell (as opposed to the value obtained by evaluating the expression).
I use this a lot. It is useful in pedagogical situations, where you want people to be able to see the formula and the value at the same time. It is also useful as a way to put formulas and/or addresses into a .csv file, which (as the name suggests) nominally only includes values. There are lots of good reasons why you might want formulas and/or addresses in .csv files.
Sometimes it is useful to turn of the “Automatic Recalculation” feature. Examples include
You can manually cause a new calculation at any time by hitting the <F9> key.
The default format that gnumeric uses for storing spreadsheet documents is structured as XML. Often there is gzip compression on top of that, but uncompressing it is supereasy.
(I have no idea what format excel uses.)
It is fairly easy to figure out what the XML means, and then to munge it using a text editor, or using tools such as grep, awk, perl, et cetera.
For example, suppose you want to have a plot with 35 curves on it. Configuring all of those using the pointandclick interface is unpleasant the first time, and even more unpleasant if you want to make a small change and then propagate it to all the clones. Therefore I wrote a little perl program that take one curve (aka Series) as a template and then makes N clones.
The spreadsheet in reference 6 calculates the probabilities for a pair of dice. Look at the “histo2efficient” worksheet.
Figure 7: Two Dice : Probability Histogram + Horizonal Lines  Figure 8: Two Dice : Probability Histogram + Dots  
Note the following:
Therefore it is better to use the approach seen in figure 7 and figure 8. They plotted the histogram as a bunch of quadrilaterals, using the techniques mentioned in section 1.18.
For a continuous probability distribution, the right approach is to sort the data. Then the index (suitably normalized) is the cumulative probability, as a function of the data value. That is to say, there is a riser at every data point, and a tread between each two data points. See section 2.2 and reference 7.
For discrete data, with relatively few categories, sorting is inefficient and unnecessarily complicated.
Figure 9: Gaussian Cumulative Probability Distribution  Figure 10: Gaussian Probability Density Distribution  
Figure 11: Gaussian Cumulative Probability Distribution  Figure 12: Gaussian Probability Density Distribution  
Figure 13: Gaussian Cumulative Probability Distribution  Figure 14: Gaussian Probability Density Distribution  
See reference 7.
For plotting the cumulative probability distributions, the following statement comes in handy:
=if(J8<1,small(offset($E$8,0,0,G$4),1+row($E8)row($E$8)),9e+99)
As for the probabilitydensity histogram:
The support of each bin is a halfopen interval of the form [L, R) such that the bin contains all events such that x is greater than or equal to L and strictly less than R. For continuous data, the chance that any data point falls exactly on a bin boundary is zero, but for discrete data this can become an issue, hence the careful definition of the binboundaries.
If the binwidth gets too small and there are not enough data points, the histogram becomes noisy. At some point a noisy histogram becomes hard to interpret; among other things, the idea of halfwidth at halfmaximum becomes useless.
The spreadsheet function combin(n, k) returns the binomial coefficient “n choose k”. For example:
 (6) 
 (7) 
Here is a statement that produces a random integer according to a binomial distribution. In this example, there are 60 trials. (The number of trials is built into the statement.) The first argument to the binomdist function is the outcome. Here we use an array expression, using the rangeofnumbers idiom mentioned in section 1.4. The second argument is the number of trials. The third argument is the probability of success per trial, which we assume is stored in cell $D$3. We use the sum function to count how many times the probability of a given outcome was less than the threshold, where the threshold is some random number stored in cell G7.
=sum(if(binomdist(row($A$1:$A$61)1,60,$D$3,1)<G7,1,0))
Suppose we use that formula 200 times, using a different random threshold each time. The results are shown in the following figures. See reference 8 for details.