Friday, September 10, 2021

trace(AB) = trace(BA)

 Last time I showed how we can think of matrix multiplication as (1) a matrix of inner products, or (2) a sum of outer products, and used the latter to unpack the "positive definite" part of the product \(X^tX\). At the end I mentioned that the outer-product version of matrix multiplication makes it easy to show that  trace \(AB\) = trace \( BA \). 

In the late 1980s I took a final exam that asked for the proof, and totally blanked. I mucked around with the sums for a while, and then time ran out. A little embarrassing, but also an indication that I hadn't really grasped the concepts. One can get quite a ways in math by just pushing symbols around, without developing intuition. 

Some Sums

The matrix multiplication of \(AB = C\) is usually defined as an array of inner products, so the \(i\)th row and \(j\)th column of the product is 
$$ c_{ij} = \sum_{k=1}^q a_{ik} b_{kj} $$
where \(q\) is the number of columns of \(A\), which must be the same as the number of rows of \(B\) or else the multiplication isn't defined. 

The trace function just adds up the diagonal of a matrix (usually assumed to be square, although that isn't strictly necessary). So in the product \(AB = C\), we only care about the diagonal as far as the trace is concerned. That subset of elements of \(AB = C\) is 
$$ c_{ii} = \sum_{k=1}^q a_{ik} b_{ki}. $$
If the matrices swap positions \(BA = D\), then the diagonal stripe elements are
$$ d_{ii} = \sum_{k=1}^p b_{ik} a_{ki}, $$
where  \(p\) is the number of columns of \(B\), which is the same as the number of rows of \(A\). 

The trace is the sum of these diagonal elements, so if trace \(AB\) = trace \( BA \), then we must have
$$\sum_{i=1}^q \sum_{k=1}^p a_{ik} b_{ki} = \sum_{i=1}^p \sum_{k=1}^q b_{ik} a_{ki},$$
assuming the result is a square matrix \(p \times p \). To see that this is true, we just need to swap the names of the counting indices \(i\) and \(k\) on one side of the equation. These names are arbitrary, and afterwards the expressions are identical. 

This, however, is not very enlightening to me. The problem is that working with inner products forces us to get into the weeds, mucking around with sums. I don't want to do that.

Outer Products

Let's abstract the matrices to think of them as organizations of vectors (one-dimensional matrices). By convention, these are column matrices, so the outer product form is
$$ AB = \begin{bmatrix} a_1 & \dots & a_q \end{bmatrix} \begin{bmatrix} b_1^t \\ \vdots \\ b_q^t \end{bmatrix} = a_1 b_1^t  + \dots + a_q b_q^t $$
 
where \(A\) is expressed as columns and \(B\) as rows (hence the transpose on each vector). Each term on the right is a \(p \times p \) matrix resulting from the outer product of a column of \(A\) and a row of \(B\). The trace takes the diagonal elements of these square matrices, and since the sum of the trace is obviously the trace of the sum, we have a sum of terms, with the first one (the first column of \(A\) times the first row of \(B\) ) expressed as
$$\text{tr}  (a_1 b_1^t) =  \text{tr} \begin{bmatrix} a_{11} b_{11} & \dots & a_{11} b_{q1} \\ \vdots \\
a_{1q} b_{11} & \dots & a_{1q} b_{q1} \end{bmatrix} = a_{11} b_{11} + \dots + a_{1q} b_{q1} =
a_1^t b_1.$$
I'm abusing the notation for convenience. The one-subscript variables are column matrices, and the two-subscript versions are individual elements. The equations above illuminate the key property of the trace function: it converts outer products to inner products: \( \text{tr}(a_1 b_1^t  + \dots + a_q b_q^t) = a_1^t b_1  + \dots + a_q^t b_q \). 
 
Commuting the original matrix product gives the inner product form directly.
$$\text{tr} \left( \begin{bmatrix} b_1^t \\ \vdots \\ b_q^t \end{bmatrix} \begin{bmatrix} a_1 & \dots & a_q \end{bmatrix} \right) = b_1^t a_1 + \dots + b_q^t a_q $$
 
Since for real vectors the inner product is symmetric, i.e. \( x^ty = y^tx \), the sums for \( \text{tr} AB \) and \( \text{tr} AB \) are equal.
 

An Application

The trace has interesting properties, such as being the sum of eigenvalues of a square matrix, but its main character is that it converts outer products to inner products. Or the reverse, which can be a useful analytical trick. The proof below comes from page 95 of Linear models in Statistics by Alvin C. Rencher.
 
 
 
A clever trick in this proof is "expanding" a scalar result of an inner product into a whole matrix because the trace is the same for both. Then the properties of the matrix are used to reduce to the given form. It was difficult for me to understand what was going on here before developing the intuition about the inner/outer product conversion of trace.

 
 


 


Wednesday, September 08, 2021

Positive Definite Matrices

In much of my work, a data summary looks like a spreadsheet, with column headers like StudentID, Cohort, Year, GPA, Retained, and other potentially interesting data points. Hadley Wickham calls this arrangement tidy data, where the cases are in rows and the attributes of each case are columns. When these are all scalars (e.g. an ID number is not a scalar, because it isn't meaningful to take half the ID or add two of them) we can abstract this tidy data set as a matrix. 

Typically there are a lot more cases than descriptive elements, so the matrix is long and relatively narrow, with say \(N\) rows and \(p\) columns. As a tiny example, consider this made-up matrix of grade averages and credits earned for three students.

$$ X = \begin{bmatrix}
3.25 & 12 \\
3.4 & 15 \\
3.0 & 12
\end{bmatrix} $$

We often want to do a regression analysis or compute a correlation matrix on data like this, which involves a kind of matrix self-multiplication in both cases. Understanding what's going on with that process helps to understand those types of analysis.

"Squaring" a Matrix

The question I want to explore is what does it mean to square that matrix? Squaring means to multiply something by itself, and for real numbers, the result is always zero or positive. But for matrices, we can only multiply them by themselves if the are square, because the number of columns in the left matrix of \(AB\) must be the same as the number of rows in the right matrix. When I first encountered matrix multiplication as a freshman at SIUC, I thought what a dumb way this was to treat an array of numbers; why not just multiply them element-wise or something? I had a habit back then of assuming I was a lot wiser than I actually was (as it turns out). But I memorized the method as a -1 arrangement, where each row of the left matrix (the - is the row) gets "dotted" with each column of the right matrix (the 1 is the column). A dot product just matches up each element, multiplies the pairs, and sums. 

So we can't just multiply \(XX\) with the example above, because the dimensions don't work. We'd be applying a row with two elements to a column with three. But we can always make this work if we turn the left version of X on its side first, which is called the transpose. Algebraically we write it as \(X^tX\), and it looks like this:

$$ X^tX = \begin{bmatrix}
3.25 & 3.4 & 3.0  \\
12 & 15 & 12
\end{bmatrix}
\begin{bmatrix}
3.25 & 12 \\
3.4 & 15 \\
3.0 & 12
\end{bmatrix} $$

Now we can happily apply the -1 formula of rows-times-columns and do three multiplications for each, to end up with a new matrix that has two rows and two columns, rounded to:

$$ X^tX = \begin{bmatrix}
31.1 & 126 \\
126 & 513
\end{bmatrix} $$

 The top left element is \(3.25^2 + 34^2 + 3.0^2\), since it's the first column of \(X\) dotted with itself. This is a kind of "square" of the original \(X\) matrix (there are competing definitions for squaring a matrix when its rows and column dimensions are the same). We might expect that \(X^tX\) would behave like a squared number in some ways, but it's not immediately clear what that is.

Properties of \(X^tX\)

We can convert the new square matrix \(X^tX\) into a single number by using a vector \(y\) (i.e. a matrix with one column) to pre- and post-multiply like this:

$$y^tX^tXy$$

For this to work, \(y\) must be length \(p\), and it helps to think of the multiplication in two groups, like this:

$$(y^tX^t)(Xy) = (Xy)^t(Xy)$$

where the second version uses the definition of transpose to rewrite the expression so that we can see it's the same thing in parentheses, namely \(Xy\). Since X has \(N\) rows and \(p\) columns, and \(y\) has one column and \(p\) rows, the result of the multiplication of these two is a column matrix with \(N\) elements. Its transpose cousin will be a row matrix with \(N\) elements, otherwise identical. When we dot-product these, each element of \(Xy\) gets multiplied by itself and then summed. Therefore \(Xy \ge 0\), and the only way the sum can be zero is if all the elements of \(Xy\) are zero. 

The range of results we can get for a matrix multiplication like \(Xy\) is a subject of linear algebra that has been well-explored. If the columns of \(X\) are sufficiently different, in a technical sense, then it's impossible to add up a combination of them to get the zero vector. In that case, we'd say that \(X\) has rank \(p\) (the number of columns). I will assume that is the case here. 

If the rank of \(X\) is less than the number of columns in a regression problem, say, it means that one of the explanatory variables is redundant and can be eliminated. 

We have sketched a proof that if \(X\) has rank \(p\), then when we multiply \((Xy)^t(Xy)\) to get a scalar, that number must be strictly positive, as long as \(y\) isn't the zero vector itself. This property of \(X\) is called positive definite.

Block Multiplication

The multiplication \((Xy)^t(Xy)\) gives us the result, but it's unsatisfying. It doesn't tell us why the result will be positive, only that it is so. For more intuition, we can unpack matrix multiplication. The -1 formula I learned in Introduction to Linear Algebra can be written out as a sum, when we multiply matrices \(A\) and \(B\) to get \(C\):

$$ c_{ij} = \sum_{k = 1}^{N} a_{ik} b_{kj} $$

where \(c_{ij} \) is the element of \(C\) at the \(i\)th row and \(j\)th column (matrix indices are always row-column by convention). For this to work, \(A\) must have \(N\) rows of length \(k\) and \(B\) must have \(N\) columns each of length \(k\).  
 
When we multiply \(X^tX\) using the sum formula we get
 
 $$ x_{ij} = \sum_{k = 1}^{N} x_{ki} x_{kj} $$
 
where the main difference is the indices for row and column are exchanged on the first summand because of the transposition. 
 
The sum formula shows that the elements are dot products, but it otherwise doesn't bring clarity. Instead of diving into sums, we can abstract the process at an intermediate level. Think of \(X\) as a row of columns, \(X = [c_1, \dots , c_p]\), so that

$$ X^tX = \begin{bmatrix}
c^t_1 \\
\vdots \\
c^t_p
\end{bmatrix}  \begin{bmatrix}
c_1 & \dots & c_p
\end{bmatrix} = \begin{bmatrix}
c^t_1 c_1 & \dots & c^t_1 c_p \\
\vdots & & \vdots\\
c^t_p c_1 & \dots & c^t_p c_p
\end{bmatrix}$$

 Now it's clear that each element comes from a dot product of columns of the original \(X\), but this doesn't shed any light on the positive-definite property. Let's try it the other direction, breaking \(X\) into rows instead of columns, i.e. 

$$ X  = \begin{bmatrix}
r_1 \\
\vdots \\
r_N
\end{bmatrix} $$

Now we have 

 $$ X^tX = \begin{bmatrix}
r^t_1 & \dots & r^t_p
\end{bmatrix} \begin{bmatrix}
r_1 \\
\vdots \\
r_p
\end{bmatrix} = r^t_1 r_1 + \dots + r^t_p r_p $$

Each of the terms on the right is an outer product, where each element of \(r_i\) gets multiplied by each other element in an \(N \times N\) matrix. 

Example

Using the example above, the first row is [3.25 12], and its outer product is
 
$$ \begin{bmatrix} 3.25 \\ 12 \end{bmatrix} \begin{bmatrix} 3.25 & 12 \end{bmatrix} = \begin{bmatrix}
3.25^2 & (3.25)(12) \\
(12)(3.25) & 12^2
\end{bmatrix} $$
 
and these are the same terms we get by squaring the binomial \( (3.25 + 12)(3.25 + 12) = (3.25 + 12)^2\). So if we sum all those terms, we must always get a non-negative number. This property is true for each of the \(p \times p\) matrices in the sum \(r^t_1 r_1 + \dots + r^t_p r_p\). 

Consequences

Since each matrix in the outer-product version of matrix multiplication \( X^tX = r^t_1 r_1 + \dots + r^t_p r_p \) comprises the terms in a multinomial square  like \( (3.25 + 12)(3.25 + 12) = (3.25 + 12)^2\) in the example, we can conclude:

The sum of the matrix elements of each \( r^t_i r_i \) is non-negative, and only zero if the row is zero. Therefore the sum of the elements of \( X^tX \) is positive unless \( X \) is the zero matrix. 

 The sum of elements of a matrix \( A \) can be written as a matrix multiply with 

$$ \sum{a_{ij}} = 1^t A 1 $$

where \( 1 \) is a vector of all 1s of suitable dimension. When we multiply \( X^tX \) by an arbitrary vector \( y \) to get \( (Xy)^t(Xy) \), there are two distinct operations in the multiplication of \( X y \). First, the columns of \( X \) are scaled by the elements of \( y \), so the first column of \( X \) gets multiplied by the first element of \( y \) and so on. Then the columns are summed across each row to obtain the new matrix (vector) element for that row. This is just a description of the dot product, but thinking of it as two operations is helpful. Imagine that the function \(S_y \) is the scaling part of the operation just described. Then 

$$ Xy = S_y(X) 1 $$

That is, we scale by the elements of \( y \) then multiply by the column of 1s to do the addition, breaking up the usual matrix multiplication into two parts. Then we have

$$ (Xy)^t(Xy) = (S_y(X) 1)^t (S_y(X) 1) = 1^t S_y(X) ^t S_y(X) 1 $$

This arrangement lets us observe that  \( S_y(X) ^t S_y(X) \) is just another matrix with the same property that the original \( X \) has: the sum of the elements is positive. (Recall that we're assuming that \( X \) is non-singular, meaning that \( X y \) can't be the zero matrix unless \( y \) is all zeros.) When we multiply it by the 1 matrices, this sums up the elements, which we just concluded is positive. 

Conclusions

For a "squared" matrix of the form \( X^tX \), the property of positive definiteness stems from the fact that the elements of an outer product of a vector with itself must sum to a non-negative number, since those elements are the same ones we get from squaring a multi-nomial. The consequence is that the sum of matrix elements is positive, and hence the "quadratic form" \( (Xy)^t(Xy) \) is a positive scalar as long as \( y \) isn't zero and \( X \) has full rank.
 
A couple of extensions of the outer-product sum idea that I won't detail here:
  • it's easy to prove trace(AB) = trace(BA) 
  • when adding new rows to a data set, we can efficiently compute the change to the covariance matrix or regression coefficients 

 Update: Here's a nice treatment of matrix factorizations using diagrams.