2.3.6 Indexing Data
To begin, we create a two-dimensional numpy array.
In [56]: A = np.array(np.arange(16)).reshape((4, 4))
A
Out[56]: array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
Typing A[1, 2] retrieves the element corresponding to the second row and third column. (As usual, Python indexes from 0.)
In [57]: A[1, 2]
Out[57]: 6
The first number after the open-bracket symbol [ refers to the row, and the second number refers to the column.
Indexing Rows, Columns, and Submatrices
To select multiple rows at a time, we can pass in a list specifying our selection. For instance, [1, 3] will retrieve the second and fourth rows:
In [58]: A[[1, 3]]
Out[58]: array([[ 4, 5, 6, 7],
[12, 13, 14, 15]])
To select the first and third columns, we pass in [0, 2] as the second argument in the square brackets. In this case we need to supply the first argument : which selects all rows.
In [59]: A[:, [0, 2]]
Out[59]: array([[ 0, 2],
[ 4, 6],
[ 8, 10],
[12, 14]])
Now, suppose that we want to select the submatrix made up of the second and fourth rows as well as the first and third columns. This is where indexing gets slightly tricky. It is natural to try to use lists to retrieve the rows and columns:
In [60]: A[[1, 3], [0, 2]]
Out[60]: array([ 4, 14])
Oops — what happened? We got a one-dimensional array of length two identical to
In [61]: np.array([A[1, 0], A[3, 2]])
Out[61]: array([ 4, 14])
Similarly, the following code fails to extract the submatrix comprised of the second and fourth rows and the first, third, and fourth columns:
In [62]: A[[1, 3], [0, 2, 3]]
IndexError: shape mismatch: indexing arrays could not be broadcast
together with shapes (2,) (3,)
We can see what has gone wrong here. When supplied with two indexing lists, the numpy interpretation is that these provide pairs of i, j indices for a series of entries. That is why the pair of lists must have the same length. However, that was not our intent, since we are looking for a submatrix.
One easy way to do this is as follows. We first create a submatrix by subsetting the rows of A, and then on the fly we make a further submatrix by subsetting its columns.
In [63]: A[[1, 3]][:, [0, 2]]
Out[63]: array([[ 4, 6],
[12, 14]])
There are more efficient ways of achieving the same result. The convenience function np.ix_() allows us to extract a submatrix using lists, by creating an intermediate mesh object.
In [64]: idx = np.ix_([1, 3], [0, 2, 3])
A[idx]
Out[64]: array([[ 4, 6, 7],
[12, 14, 15]])
Alternatively, we can subset matrices efficiently using slices. The slice 1:4:2 captures the second and fourth items of a sequence, while the slice 0:3:2 captures the first and third items (the third element in a slice sequence is the step size).
In [65]: A[1:4:2, 0:3:2]
Out[65]: array([[ 4, 6],
[12, 14]])
Why are we able to retrieve a submatrix directly using slices but not using lists? Its because they are different Python types, and are treated differently by numpy. Slices can be used to extract objects from arbitrary sequences, such as strings, lists, and tuples, while the use of lists for indexing is more limited.
Boolean Indexing
In numpy, a Boolean is a type that equals either True or False (also represented as 1 and 0, respectively). The next line creates a vector of 0’s, represented as Booleans, of length equal to the first dimension of A.
In [66]: keep_rows = np.zeros(A.shape[0], bool)
keep_rows
Out[66]: array([False, False, False, False])
We now set two of the elements to True.
In [67]: keep_rows[[1, 3]] = True
keep_rows
Out[67]: array([False, True, False, True])
Note that the elements of keep_rows, when viewed as integers, are the same as the values of np.array([0, 1, 0, 1]). Below, we use == to verify their equality. When applied to two arrays, the == operation is applied elementwise.
In [68]: np.all(keep_rows == np.array([0, 1, 0, 1]))
Out[68]: True
(Here, the function np.all() has checked whether all entries of an array are True. A similar function, np.any(), can be used to check whether any entries of an array are True.)
However, even though np.array([0, 1, 0, 1]) and keep_rows are equal according to ==, they index different sets of rows! The former retrieves the first, second, first, and second rows of A.
In [69]: A[np.array([0, 1, 0, 1])]
Out[69]: array([[0, 1, 2, 3],
[4, 5, 6, 7],
[0, 1, 2, 3],
[4, 5, 6, 7]])
By contrast, keep_rows retrieves only the second and fourth rows of A — i.e. the rows for which the Boolean equals True.
In [70]: A[keep_rows]
Out[70]: array([[ 4, 5, 6, 7],
[12, 13, 14, 15]])
This example shows that Booleans and integers are treated differently by numpy.
We again make use of the np.ix_() function to create a mesh containing the second and fourth rows, and the first, third, and fourth columns. This time, we apply the function to Booleans, rather than lists.
In [71]: keep_cols = np.zeros(A.shape[1], bool)
keep_cols[[0, 2, 3]] = True
idx_bool = np.ix_(keep_rows, keep_cols)
A[idx_bool]
Out[71]: array([[ 4, 6, 7],
[12, 14, 15]])
We can also mix a list with an array of Booleans in the arguments to np.ix_():
In [72]: idx_mixed = np.ix_([1, 3], keep_cols)
A[idx_mixed]
Out[72]: array([[ 4, 6, 7],
[12, 14, 15]])
For more details on indexing in numpy, readers are referred to the numpy tutorial mentioned earlier.