In much of my work, a data summary looks like a spreadsheet, with column headers like StudentID, Cohort, Year, GPA, Retained, and other potentially interesting data points. Hadley Wickham calls this arrangement tidy data, where the cases are in rows and the attributes of each case are columns. When these are all scalars (e.g. an ID number is not a scalar, because it isn't meaningful to take half the ID or add two of them) we can abstract this tidy data set as a matrix.
Typically there are a lot more cases than descriptive elements, so the matrix is long and relatively narrow, with say \(N\) rows and \(p\) columns. As a tiny example, consider this made-up matrix of grade averages and credits earned for three students.
$$ X = \begin{bmatrix}
3.25 & 12 \\
3.4 & 15 \\
3.0 & 12
\end{bmatrix} $$
We often want to do a regression analysis or compute a correlation matrix on data like this, which involves a kind of matrix self-multiplication in both cases. Understanding what's going on with that process helps to understand those types of analysis.
"Squaring" a Matrix
The question I want to explore is what does it mean to square that matrix? Squaring means to multiply something by itself, and for real numbers, the result is always zero or positive. But for matrices, we can only multiply them by themselves if the are square, because the number of columns in the left matrix of \(AB\) must be the same as the number of rows in the right matrix. When I first encountered matrix multiplication as a freshman at SIUC, I thought what a dumb way this was to treat an array of numbers; why not just multiply them element-wise or something? I had a habit back then of assuming I was a lot wiser than I actually was (as it turns out). But I memorized the method as a -1 arrangement, where each row of the left matrix (the - is the row) gets "dotted" with each column of the right matrix (the 1 is the column). A dot product just matches up each element, multiplies the pairs, and sums.
So we can't just multiply \(XX\) with the example above, because the dimensions don't work. We'd be applying a row with two elements to a column with three. But we can always make this work if we turn the left version of X on its side first, which is called the transpose. Algebraically we write it as \(X^tX\), and it looks like this:
$$ X^tX = \begin{bmatrix}
3.25 & 3.4 & 3.0 \\
12 & 15 & 12
\end{bmatrix}
\begin{bmatrix}
3.25 & 12 \\
3.4 & 15 \\
3.0 & 12
\end{bmatrix} $$
Now we can happily apply the -1 formula of rows-times-columns and do three multiplications for each, to end up with a new matrix that has two rows and two columns, rounded to:
$$ X^tX = \begin{bmatrix}
31.1 & 126 \\
126 & 513
\end{bmatrix} $$
The top left element is \(3.25^2 + 34^2 + 3.0^2\), since it's the first column of \(X\) dotted with itself. This is a kind of "square" of the original \(X\) matrix (there are competing definitions for squaring a matrix when its rows and column dimensions are the same). We might expect that \(X^tX\) would behave like a squared number in some ways, but it's not immediately clear what that is.
Properties of \(X^tX\)
We can convert the new square matrix \(X^tX\) into a single number by using a vector \(y\) (i.e. a matrix with one column) to pre- and post-multiply like this:
$$y^tX^tXy$$
For this to work, \(y\) must be length \(p\), and it helps to think of the multiplication in two groups, like this:
$$(y^tX^t)(Xy) = (Xy)^t(Xy)$$
where the second version uses the definition of transpose to rewrite the expression so that we can see it's the same thing in parentheses, namely \(Xy\). Since X has \(N\) rows and \(p\) columns, and \(y\) has one column and \(p\) rows, the result of the multiplication of these two is a column matrix with \(N\) elements. Its transpose cousin will be a row matrix with \(N\) elements, otherwise identical. When we dot-product these, each element of \(Xy\) gets multiplied by itself and then summed. Therefore \(Xy \ge 0\), and the only way the sum can be zero is if all the elements of \(Xy\) are zero.
The range of results we can get for a matrix multiplication like \(Xy\) is a subject of linear algebra that has been well-explored. If the columns of \(X\) are sufficiently different, in a technical sense, then it's impossible to add up a combination of them to get the zero vector. In that case, we'd say that \(X\) has rank \(p\) (the number of columns). I will assume that is the case here.
If the rank of \(X\) is less than the number of columns in a regression problem, say, it means that one of the explanatory variables is redundant and can be eliminated.
We have sketched a proof that if \(X\) has rank \(p\), then when we multiply \((Xy)^t(Xy)\) to get a scalar, that number must be strictly positive, as long as \(y\) isn't the zero vector itself. This property of \(X\) is called positive definite.
Block Multiplication
The multiplication \((Xy)^t(Xy)\) gives us the result, but it's unsatisfying. It doesn't tell us why the result will be positive, only that it is so. For more intuition, we can unpack matrix multiplication. The -1 formula I learned in Introduction to Linear Algebra can be written out as a sum, when we multiply matrices \(A\) and \(B\) to get \(C\):
$$ c_{ij} = \sum_{k = 1}^{N} a_{ik} b_{kj} $$
where \(c_{ij} \) is the element of \(C\) at the \(i\)th row and \(j\)th column (matrix indices are always row-column by convention). For this to work, \(A\) must have \(N\) rows of length \(k\) and \(B\) must have \(N\) columns each of length \(k\).
When we multiply \(X^tX\) using the sum formula we get
$$ x_{ij} = \sum_{k = 1}^{N} x_{ki} x_{kj} $$
where the main difference is the indices for row and column are exchanged on the first summand because of the transposition.
The sum formula shows that the elements are dot products, but it otherwise doesn't bring clarity. Instead of diving into sums, we can abstract the process at an intermediate level. Think of \(X\) as a row of columns, \(X = [c_1, \dots , c_p]\), so that
$$ X^tX = \begin{bmatrix}
c^t_1 \\
\vdots \\
c^t_p
\end{bmatrix} \begin{bmatrix}
c_1 & \dots & c_p
\end{bmatrix} = \begin{bmatrix}
c^t_1 c_1 & \dots & c^t_1 c_p \\
\vdots & & \vdots\\
c^t_p c_1 & \dots & c^t_p c_p
\end{bmatrix}$$
Now it's clear that each element comes from a dot product of columns of the original \(X\), but this doesn't shed any light on the positive-definite property. Let's try it the other direction, breaking \(X\) into rows instead of columns, i.e.
$$ X = \begin{bmatrix}
r_1 \\
\vdots \\
r_N
\end{bmatrix} $$
Now we have
$$ X^tX = \begin{bmatrix}
r^t_1 & \dots & r^t_p
\end{bmatrix} \begin{bmatrix}
r_1 \\
\vdots \\
r_p
\end{bmatrix} = r^t_1 r_1 + \dots + r^t_p r_p $$
Each of the terms on the right is an outer product, where each element of \(r_i\) gets multiplied by each other element in an \(N \times N\) matrix.
Example
Using the example above, the first row is [3.25 12], and its outer product is
$$ \begin{bmatrix} 3.25 \\ 12 \end{bmatrix} \begin{bmatrix} 3.25 & 12 \end{bmatrix} = \begin{bmatrix}
3.25^2 & (3.25)(12) \\
(12)(3.25) & 12^2
\end{bmatrix} $$
and these are the same terms we get by squaring the binomial \( (3.25 + 12)(3.25 + 12) = (3.25 + 12)^2\). So if we sum all those terms, we must always get a non-negative number. This property is true for each of the \(p \times p\) matrices in the sum \(r^t_1 r_1 + \dots + r^t_p r_p\).
Consequences
Since each matrix in the outer-product version of matrix multiplication \( X^tX = r^t_1 r_1 + \dots + r^t_p r_p \) comprises the terms in a multinomial square like \( (3.25 + 12)(3.25 + 12) = (3.25 + 12)^2\) in the example, we can conclude:
The sum of the matrix elements of each \( r^t_i r_i \) is non-negative, and only zero if the row is zero. Therefore the sum of the elements of \( X^tX \) is positive unless \( X \) is the zero matrix.
The sum of elements of a matrix \( A \) can be written as a matrix multiply with
$$ \sum{a_{ij}} = 1^t A 1 $$
where \( 1 \) is a vector of all 1s of suitable dimension. When we multiply \( X^tX \) by an arbitrary vector \( y \) to get \( (Xy)^t(Xy) \), there are two distinct operations in the multiplication of \( X y \). First, the columns of \( X \) are scaled by the elements of \( y \), so the first column of \( X \) gets multiplied by the first element of \( y \) and so on. Then the columns are summed across each row to obtain the new matrix (vector) element for that row. This is just a description of the dot product, but thinking of it as two operations is helpful. Imagine that the function \(S_y \) is the scaling part of the operation just described. Then
$$ Xy = S_y(X) 1 $$
That is, we scale by the elements of \( y \) then multiply by the column of 1s to do the addition, breaking up the usual matrix multiplication into two parts. Then we have
$$ (Xy)^t(Xy) = (S_y(X) 1)^t (S_y(X) 1) = 1^t S_y(X) ^t S_y(X) 1 $$
This arrangement lets us observe that \( S_y(X) ^t S_y(X) \) is just another matrix with the same property that the original \( X \) has: the sum of the elements is positive. (Recall that we're assuming that \( X \) is non-singular, meaning that \( X y \) can't be the zero matrix unless \( y \) is all zeros.) When we multiply it by the 1 matrices, this sums up the elements, which we just concluded is positive.
Conclusions
For a "squared" matrix of the form \( X^tX \), the property of positive definiteness stems from the fact that the elements of an outer product of a vector with itself must sum to a non-negative number, since those elements are the same ones we get from squaring a multi-nomial. The consequence is that the sum of matrix elements is positive, and hence the "quadratic form" \( (Xy)^t(Xy) \) is a positive scalar as long as \( y \) isn't zero and \( X \) has full rank.
A couple of extensions of the outer-product sum idea that I won't detail here:
- it's easy to prove trace(AB) = trace(BA)
- when adding new rows to a data set, we can efficiently compute the change to the covariance matrix or regression coefficients
Update: Here's a nice treatment of matrix factorizations using diagrams.