NumPy Arrays
Contents
NumPy Arrays 1#
1. Introduction #
NumPy stands for Numerical Python. It is a Python library that supports large multi-dimensional arrays. NumPy also offers mathematical functions designed to operate over this type of array. But why not use normal Python data types–such as lists or tuples? The main reason is that NumPy arrays are way more efficient when storing and manipulating large amounts of data. Why are they more efficient? Mainly, because NumPy arrays are fixed-type arrays, meaning that they can contain items of a specific type–contrary, to list
structures that handle diverse types at the same type. This fixed-type array condition makes NumPy arrays less flexible but more efficient, placing them at the basis of the data science ecosystem in Python.
To start using NumPy you need to add the following code.
import numpy as np
By convention, the numpy
module is renamed to np
, so your code is less verbose.
Create NumPy Arrays #
You can create NumPy arrays out from Python lists or from scratch.
NumPy Array from list
#
When creating NumPy arrays out from Python lists you first need to create a Python list. For example, in the following cell, we create a Python list that contains prime numbers greater than 0 and less than 30.
# Create Python list
primes: list = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]
print(type(primes))
primes
<class 'list'>
[2, 3, 5, 7, 11, 13, 17, 19, 23, 29]
We notice that all elements within the primes
list are integers. Therefore, we can create a NumPy array with integer values out from this list.
# Creat NummPy array out from Pyton list
np_primes: np.ndarray = np.array(primes)
print(type(np_primes))
np_primes
<class 'numpy.ndarray'>
array([ 2, 3, 5, 7, 11, 13, 17, 19, 23, 29])
If it happens that your list contains different types, NumPy will attempt to upcast them. For instance, the following list has both integers and float point numbers. Integers can then be upcasted to floats.
scores: list = [10, 9.6, 4.3, 8, 7.1, 6.1, 8]
np_scores: np.ndarray = np.array(scores)
np_scores
array([10. , 9.6, 4.3, 8. , 7.1, 6.1, 8. ])
You can go further and specify the data type of the items in the array by using the dtype
keyword parameters. NumPy has a set of predefined types such as bool_
(for Booleans), int_
(for integers), float_
(for floating-point numbers), and str_
(for strings). When creating arrays, NumPy’s default type is float_
. The whole list of types can be found here.
In the following cell, we take the list of integers scale
, and we convert it into a NumPy array with floating-point numbers.
scale: list = [1, 2, 3, 4, 5]
np_scale: np.ndarray = np.array(scale, dtype='float_')
np_scale
array([1., 2., 3., 4., 5.])
NumPy Array from Scratch#
When dealing with large arrays it is better to create them from scratch instead of using Python lists. To do so, you can use diverse built-in functions provided by NumPy.
For instance, you can use the zeros()
function, which creates an array with n
number of zeros. In the following cell, we create an array with 8
zeros. Notice that by default the numbers are created as floats. If you want to change the type you can also use the dtype
keyword.
# Create an array with 8 zeros
zeros: np.ndarray = np.zeros(8)
zeros
array([0., 0., 0., 0., 0., 0., 0., 0.])
The ones()
function works similar to the zeros()
one, the only difference is that it creates 1
s instead of 0
s.
ones: np.ndarray = np.ones(8, dtype='int')
ones
array([1, 1, 1, 1, 1, 1, 1, 1])
You can also pick a specific number to fill in your array using the full()
function. Look into the following cell. The first parameter of the function receives the number of items of the array; while the second parameter gets the value that should be copied in each location of the array.
pi_numbers: np.ndarray = np.full(8, 3.1418)
pi_numbers
array([3.1418, 3.1418, 3.1418, 3.1418, 3.1418, 3.1418, 3.1418, 3.1418])
Multi-dimensional Arrays #
NumPy also supports the creation of multi-dimensional arrays (or matrices). The previous example can be altered, so we can define the number of rows and columns that our array will have. If both the numbers of rows and columns are greater than 1, we will end up with a multi-dimensional array.
For instance, in the following cell, we use again the full()
function to create the pi_matrix
, which consists of 4 rows and 8 columns. Therefore, instead of passing an integer as the first argument, we pass a tuple whose first item is the number of rows (i.e. 4
) and the number of columns (i.e. 8
).
pi_matrix: np.ndarray = np.full((4, 8), 3.1418)
pi_matrix
array([[3.1418, 3.1418, 3.1418, 3.1418, 3.1418, 3.1418, 3.1418, 3.1418],
[3.1418, 3.1418, 3.1418, 3.1418, 3.1418, 3.1418, 3.1418, 3.1418],
[3.1418, 3.1418, 3.1418, 3.1418, 3.1418, 3.1418, 3.1418, 3.1418],
[3.1418, 3.1418, 3.1418, 3.1418, 3.1418, 3.1418, 3.1418, 3.1418]])
You can also create an array out of regular Python nested lists: inner lists become rows in the matrix.
nested_lists: list = [list(range(i, i + 3)) for i in [1, 4, 7]]
print(nested_lists)
np.array(nested_lists)
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
Some other interesting functions for creating multi-dimensional arrays include:
arange()
: creates a linear sequences. It is similar to therange()
function.
# Creates a linear sequence between 0 and 10. It steps by 2.
np.arange(0, 10, 2)
array([0, 2, 4, 6, 8])
linspace()
: creates an array of evenly spaced numbers.
# Creates an array of 5 items, evenly spaced between 0 and 1.
np.linspace(0, 1, 5)
array([0. , 0.25, 0.5 , 0.75, 1. ])
random.random()
: creates an array with random numbers between 0 and 1.
# Creates a matrix (3 rows and 4 columns) with random numbers between 0 and 1.
np.random.random((3, 4))
array([[0.01376755, 0.41037085, 0.89454677, 0.81621551],
[0.19421344, 0.62002607, 0.48237312, 0.01212608],
[0.33344347, 0.89549543, 0.69766052, 0.45191442]])
random.randint()
: creates an array with random integers in a given interval.
# Creates a 4x4 matrix with random integers between 0 and 100
np.random.randint(0, 100, (4, 4))
array([[99, 47, 57, 50],
[68, 25, 38, 2],
[60, 32, 25, 34],
[82, 30, 60, 66]])
random.normal()
: creates a normally distributed array with random numbers. You must set the mean and the standard deviation values.
# Creates a normally distributed 4x4 matrix with random numbers.
# The mean is set to 0 and the standard deviation to 1.
np.random.normal(0, 1, (4, 4))
array([[ 0.26637009, -0.69498313, 0.51127947, 2.17342232],
[ 1.25257033, -0.5211245 , -1.20132996, 0.76883407],
[ 0.36167525, 0.5211331 , -0.25855829, 0.3010557 ],
[ 0.43305217, -1.85090513, -1.04801338, -2.03473037]])
Array Attributes #
Every NumPy array has the following set of attributes:
Attribute |
Description |
---|---|
|
Number of dimensions |
|
Size of each dimension |
|
Number of elements in the array |
|
Data type of the array |
|
Size in bytes of each array element |
|
Size in bytes of the array |
Hereafter, we will create three different arrays (one-dimensional, two-dimensional, and three-dimensional arrays) to see the output of these attributes.
# One-dimensional array
one_dim: np.ndarray = np.random.randint(10, size=6)
one_dim
array([2, 0, 2, 9, 2, 9])
# Two-dimensional array
two_dim: np.ndarray = np.random.randint(10, size=(2, 6))
two_dim
array([[9, 8, 1, 5, 7, 7],
[5, 9, 6, 9, 8, 2]])
# Three-dimensional array
three_dim: np.ndarray = np.random.randint(10, size=(2, 3, 6))
three_dim
array([[[2, 2, 0, 6, 1, 2],
[6, 4, 3, 9, 9, 5],
[3, 7, 9, 2, 0, 2]],
[[1, 4, 2, 7, 0, 4],
[4, 2, 2, 3, 3, 6],
[2, 0, 0, 7, 2, 8]]])
Let us explore now the ndim
, shape
, and size
attributes.
print(f'One-dimensional array: ndim={one_dim.ndim}, shape={one_dim.shape}, size={one_dim.size}')
print(f'Two-dimensional array: ndim={two_dim.ndim}, shape={two_dim.shape}, size={two_dim.size}')
print(f'Three-dimensional array: ndim={three_dim.ndim}, shape={three_dim.shape}, size={three_dim.size}')
One-dimensional array: ndim=1, shape=(6,), size=6
Two-dimensional array: ndim=2, shape=(2, 6), size=12
Three-dimensional array: ndim=3, shape=(2, 3, 6), size=36
What about the dtype
?
print(f'One-dimensional array: dtype={one_dim.dtype}')
print(f'Two-dimensional array: dtype={two_dim.dtype}')
print(f'Three-dimensional array: dtype={three_dim.dtype}')
One-dimensional array: dtype=int64
Two-dimensional array: dtype=int64
Three-dimensional array: dtype=int64
The int64
data type refers to a signed integer (can be both positive or negative) that uses 64 bits for its representation.
Finally, let us look into the itemsize
and nbytes
attributes. Notice that nbytes
is equal to \(itemsize \times size\).
print(f'One-dimensional array: itemsize={one_dim.itemsize} bytes, nbytes={one_dim.nbytes} bytes')
print(f'Two-dimensional array: itemsize={two_dim.itemsize} bytes, nbytes={two_dim.nbytes} bytes')
print(f'Three-dimensional array: itemsize={three_dim.itemsize} bytes, nbytes={three_dim.nbytes} bytes')
One-dimensional array: itemsize=8 bytes, nbytes=48 bytes
Two-dimensional array: itemsize=8 bytes, nbytes=96 bytes
Three-dimensional array: itemsize=8 bytes, nbytes=288 bytes
Indexing #
Indexing in NumPy arrays is pretty similar to Python list indexing. Actually, when dealing with one-dimensional arrays there is no difference at all. To see this, let us redefine and use the primes
array–now as a NumPy array.
primes: list = np.array([2, 3, 5, 7, 11, 13, 17, 19, 23, 29])
primes
array([ 2, 3, 5, 7, 11, 13, 17, 19, 23, 29])
If we want to access the first element in the array, we can use the 0
value between the squared brackets.
primes[0]
2
To access, the last item in the array, we refer to the length of the array minus one position. Otherwise, we will get an out-of-bounds index error.
primes[len(primes) - 1]
29
We can also opt for accessing items backwards by using negative indices. For instance, if we want to access the last item in the array, we can employ the -1
index.
primes[-1]
29
When dealing with multi-dimensional arrays, you need to specify a comma-separated sequence of indices. To see how it works, let us create the integers
two-dimensional array.
integers: np.ndarray = np.array([list(range(i, i + 3)) for i in [1, 4, 7]])
integers
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
If we want to access the item located in the first row of the matrix, second column (i.e. 2
), we use the [0, 1]
index (remember that we start counting from 0).
integers[0, 1]
2
To access the element placed at the middle of the integers
matrix (i.e. 5
), we can find the middle point of each dimension as follows. (Notice that the following code will only work for a matrix whose dimension sizes are both odd!)
rows, columns = integers.shape
middle_rows: int = rows // 2
middle_columns: int = columns // 2
print(f'Middle index rows: {middle_rows}, Middle index columns: {middle_columns}')
integers[middle_rows, middle_columns]
Middle index rows: 1, Middle index columns: 1
5
Finally, to access the last item in the two-dimensional array–that is, the element in the bottom-right corner (i.e. 9
), we can consider the size of each element and subtract 1
unit to each one of them. Then we use these new values as part of the sequence of indices.
rows, columns = integers.shape
last_rows: int = rows - 1
last_columns: int = columns - 1
print(f'Last index rows: {last_rows}, Last index columns: {last_columns}')
integers[last_rows, last_columns]
Last index rows: 2, Last index columns: 2
9
NumPy arrays are mutable, meaning that you can directly modify their values. In the following cell, we modify the first element of the matrix integers
.
integers[0, 0] = 100
integers
array([[100, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9]])
Slicing #
Array slicing also works like the regular slicing of Python lists. That is, we use the squared brackets ([]
) and the colon symbol (:
) to play around and choose the items we are interested in. The regular slicing notation looks as follows:
x[start:stop]
However, there is one extra attribute you can define when slicing (this also applies to other Python sequences such as lists and strings): the step
, which only considers every \(n^{th}\) item in the sequence–being \(n\) the value you assign to it. Notice that the last colon and attribute are always optional in Python.
x[start:stop:step]
If you do not specify a value for any of these three attributes, they will get their default value. All default values are shown in the following table.
Attribute |
Default value |
---|---|
|
|
|
Size of dimension |
|
|
One-dimensional Array#
Now, let us go back to our primes
array and let us slice it in different ways.
primes: np.ndarray = np.array([2, 3, 5, 7, 11, 13, 17, 19, 23, 29])
primes
array([ 2, 3, 5, 7, 11, 13, 17, 19, 23, 29])
In the following cell, we only consider the first five items of the primes
array. Notice that the element placed at the 5\(^{th}\) position won’t be included.
first_five: np.ndarray = primes[:5]
first_five
array([ 2, 3, 5, 7, 11])
Computing the middle point manually might be error-prone. We can, therefore, compute the index in an integer variable (a.k.a. middle
), and use it to define where half of the array starts. In the following cell, we consider all items placed in the second half of the array.
middle: int = len(primes) // 2
last_half: np.ndarray = primes[middle:]
last_half
array([13, 17, 19, 23, 29])
In the following cell, we specify both the start
and stop
attributes to take the two middle values of the array.
two_middle: np.ndarray = primes[4:6]
two_middle
array([11, 13])
Finally, we can start using the step
attribute. If we set it to 2
without specifying any other value for the start
and stop
attributes, we will get every two elements from the array.
every_two: np.ndarray = primes[::2]
every_two
array([ 2, 5, 11, 17, 23])
We can go further and decide to consider every two elements only in the last half of the array.
middle: int = len(primes) // 2
every_two_half: np.ndarray = primes[middle::2]
every_two_half
array([13, 19, 29])
Multi-dimensional Arrays#
Slicing works similarly when dealing with multi-dimensional arrays. You only need to separate every slicing criterion for each dimension with a comma
.
x[start_dim1:stop_dim1:step_dim1, start_dim2:stop_dim2:step_dim2, ...]
Let us have a look at the integers
matrix.
integers: np.ndarray = np.array([list(range(i, i + 3)) for i in [1, 4, 7]])
integers
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
# Pick first two rows and first two columns
integers[:2, :2]
array([[1, 2],
[4, 5]])
# Pick last two rows and last two columns
integers[1:, 1:]
array([[5, 6],
[8, 9]])
# Pick all rows and column in position 2 (last column)
integers[:, 2]
array([3, 6, 9])
# Pick first row and all columns
integers[0, :]
array([1, 2, 3])
# Pick first row and all columns (another way)
integers[0]
array([1, 2, 3])
# Pick all rows and every two steps pick a column
integers[:, ::2]
array([[1, 3],
[4, 6],
[7, 9]])
Copies of Arrays#
A difference of foremost importance between NumPy arrays and Python lists is that array slices return views instead of copies of the data. This means that if you modify the array slices you are actually modifying the underlying data. This property is very useful especially when dealing with large datasets: we can focus on the data we need to modify and there is no need to handle extra copies.
sub_integers: np.ndarray = integers[:2, :2]
sub_integers
array([[1, 2],
[4, 5]])
# Change the first value of the view
sub_integers[0, 0] = 100
sub_integers
array([[100, 2],
[ 4, 5]])
# the integers matrix has been also updated
integers
array([[100, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9]])
However, the views property comes at a cost: if you just wanted to experiment with your data without really aiming at modifying it, you can end up with a polluted dataset. If you prefer to work with a copy, you need to be explicit about it. Luckily, NumPy provides the method copy()
, which will help you in this regard.
integers: np.ndarray = np.array([list(range(i, i + 3)) for i in [1, 4, 7]])
integers
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
# Create a copy instead
sub_integers: np.ndarray = integers[:2, :2].copy()
sub_integers
array([[1, 2],
[4, 5]])
# Update first item of the copy
sub_integers[0, 0] = 100
sub_integers
array([[100, 2],
[ 4, 5]])
# The integers matrix is intact
integers
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
Masking#
We can use Boolean arrays to mask (or filter) values in the array. Suppose that we would like to get all values from the integers
array that are greater than or equal to 5. Let us first create the Boolean array and see how it looks like.
integers >= 5
array([[False, False, False],
[False, True, True],
[ True, True, True]])
You can see that NumPy checks the Boolean expression against all values in the array, then it creates a new NumPy array with Boolean values stating if the condition is satisfied by the item in a given position.
Now, if we want to mask the integers
array and consider the subset of data points that are greater than or equal to 5, we need to use the Boolean array to index on the boolean condition.
integers[integers >= 5]
array([5, 6, 7, 8, 9])
Notice that NumPy returns a one-dimensional array with all the values that meet the condition.