Slicing in Python: A Comprehensive Guide
PYTHON PROGRAMMING

Python offers a variety of data containers, the most popular being the list. From a higher level, it's a sequence of objects kept one after another in the container. Therefore, we need a mechanism to access a particular element of such a container – this mechanism is called indexing. It works like in the following code block, and it works the very same way for both lists and tuples:
>>> objects = [
... 'Zen',
... float('nan'),
... 25,
... None,
... ValueError,
... ValueError(),
... ]
>>> objects[0]
'Zen'
>>> objects[1]
nan
>>> objects[-1]
ValueError()
>>> objects[-2]
Indexing enables one to access an individual element of a container – often, however, we need to access more elements than just one. A mechanism to achieve this is called slicing.
Slicing is one of the most powerful and convenient features in Python, enabling one to access and manipulate portions of sequences – lists, tuples, strings, arrays and dataframes, but also custom ones. In Data Science, slicing is among the critical tools you need to master, as without it, you won't be able to achieve many tasks that require manipulating data containers.
In the data science context, however, there's a slight problem related to slicing – while it works in the same way for lists and tuples, it doesn't for NumPy arrays and Pandas dataframes. You also need to be aware that if a custom class implements indexing and slicing, this doesn't have to work the same way as for lists/tuples or other data containers implementing indexing and slicing.
So, what follows is the basic knowledge on slicing that every data scientist should posses:
- How slicing works for lists and tuples (the basic Python data containers)
- How it works for NumPy arrays, both one- and multi-dimensional ones.
- How it works for Pandas dataframes.
- How to read and grasp how slicing works in a custom class.
- How to implement a custom class that implements slicing.
For an advanced data scientist, this doesn't have to be enough. However, with this basic set of information, you're ready to start working on objects that implement slicing. If you need more advanced techniques, you'll be ready to learn and use it.
So, today, you'll learn the basic set of tools that are critical to use slicing for the most important data containers in data science. In particular, we'll explore how it works with lists and tuples, NumPy arrays, and Pandas dataframes. Note, however, that we won't discuss indexing but slicing.
We'll also cover how to implement slicing in custom classes, providing you with the tools to extend this functionality in your own projects. To this end, we'll implement a NamedList
class, a list-like data container that enables the user to index and slice it like a regular list, but also with the help of element labels.
Basics of slicing
Let's start with the basic knowledge of how slicing works in Python. This will be general knowledge that you'll be able to use for slicing different types of data containers.
Ranging, which is basic slicing, is a technique to extract a subset of elements from a sequence based on their indices. The full syntax for slicing is sequence[start:stop:step]
, but you don't need to use all the elements every time:
start
is the starting index of the slice. For positive values ofstep
, omitting it is equivalent to using the index0
, meaning starting from the beginning of the container. For negative values ofstep
, omittingstart
is equivalent to using the index-1
, meaning starting from the end of the container.stop
indicates where Slicing stops, meaning that the last element to include in the slice isstop-1
. For positive values ofstep
, omitting it is equivalent to using the last index of the container plus one, meaning ending the slice at the end of the container. For negative values ofstep
, omittingstop
is equivalent to using the index 0, meaning ending at the first element of the container.stop
indicates where slicing stops, meaning that the last element to include in the slice isstop-1
. For positive values ofstep
, omitting it is equivalent to using the last index of the container plus one, meaning ending the slice at the end of the container (technically, Python uses the length of the sequence, which is one past the last index). For negative values ofstep
, omittingstop
is equivalent to using the index0
, meaning ending at the first element of the container.step
indicates the interval between each index for the slice. Omitting it means using the default of1
, meaning the slice will include every element between the start and stop indices. A negativestep
value will reverse the direction of the slice (hencestop
needs to be smaller thanstart
).
Python uses the following convention for slicing (including ranging): take the first element of a range/slice, but exclude the last one. Remember this, especially if you come from a language that uses a different convention, like R. Unlike Python, R indexing starts from 1
, not 0
, and its ranges include the last element.
Let's illustrate various index-based slicing techniques.
Basic slicing
>>> my_list = list(range(10))
>>> my_list[:3]
[0, 1, 2]
>>> my_list[7:]
[7, 8, 9]
>>> my_list[2:5]
[2, 3, 4]
>>> my_list[2:3]
[2]
Note the difference between my_list[2]
, which gives 2
, and my_list[2:3]
, which gives a one-element list, [2]
.
Negative indexing
>>> my_list[-3:]
[7, 8, 9]
>>> my_list[-3:-1]
[7, 8]
Slicing with a step
>>> my_list[1:8:2]
[1, 3, 5, 7]
Negative values of step mean reversing the order:
>>> my_list[::-1] # reverse the sequence
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
>>> my_list[::-2]
[9, 7, 5, 3, 1] # take every second element and reverse the sequence
Non-existing indices in a slice
Note that whenever you use a non-existing index while indexing, Python will throw an error:
>>> my_list = list(range(10))
>>> my_list[20]
Traceback (most recent call last):
File "", line 1, in
IndexError: list index out of range
However, you can use non-existing elements in a range/slice:
>>> my_list[7:20]
[7, 8, 9]
>>> my_list[700:999]
[]
So, to summarize:
- Slicing lets you retrieve a subset of elements of a data sequence.
start
represents the first index to include in the slice.stop
represents the stopping index, so the last element to include is that on indexstop-1
.step
allows for skipping elements and/or reversing the sequence.- Negative indices enable slicing relative to the end of the sequence.
- You can use non-existent indices in ranges/slices.
Using slice
objects
When you use sequence[start:stop:step]
, Python calls sequence.__getitem__(slice(start, stop, step))
. As you see, the range is converted into a so-called slice
object. You can do it yourself, meaning that instead of sequence[start:stop:step]
you can do this: sequence[slice(start, stop, step])
.
A slice
object offers a different slicing technique. It enables you to keep the information on how to slice a data sequence – something you cannot do using a range. You create slice objects using the built-in slice
class, which takes start
, stop
, and step
parameters. Their meaning is the same as the corresponding elements as those we used above for ranges. slice
objects can be particularly useful when the slicing parameters need to be reused or passed around in your code.
>>> s = slice(9, 2, -2)
>>> my_list[s]
[2, 3, 4]
>>> my_list[s] == my_list[9:2:-2]
True
slice
objects can be particularly useful when the slicing parameters need to be reused or passed around in your code.
Like with ranging, you can use non-existing indexes in slices:
>>> my_list = list(range(10))
>>> my_list[slice(7, 20)]
[7, 8, 9]
>>> my_list[slice(700, 999)]
[]
Remember that [:n]
is represented by a slice(None, n)
slice – whenever we can omit a particular element in a range, we need to use None
for this element in the corresponding slice
object:
>>> my_list[slice(None, 4)]
[0, 1, 2, 3]
You may wonder, then, whether you should use slice
objects instead of ranges. Do this, but only when you need to reuse these objects. This is because slice
objects are costly to create, so if you're going to use it once, it's faster to slice a sequence using a range, not such an object.
Let's conduct a simple benchmark, using the built-in timeit
module (read here about such benchmarking):
from timeit import repeat
setup = 'sequence = range(10)'
code_range = 'sequence[9: 2: -2]'
code_slice = 'sequence[slice(9, 2, -2)]'
N = 10_000_000
R = 10
time_range = repeat(code_range, setup=setup, number=N, repeat=R)
time_slice = repeat(code_slice, setup=setup, number=N, repeat=R)
def mean_time(x):
return round(sum(x) / len(x), 3)
def min_time(x):
return round(min(x), 3)
print(
'Range:',
f' {min_time(time_range) = }',
f'{mean_time(time_range) = }',
'Slice:',
f'{min_time(time_slice) = }',
f'{mean_time(time_slice) = }',
sep='n'
)
These are the results on my machine (Windows 10, 32 GB RAM, 4 physical and 8 logical cores, Python 3.11):
Range:
min_time(time_range) = 1.832
mean_time(time_range) = 2.059
Slice:
min_time(time_slice) = 2.436
mean_time(time_slice) = 2.533
As we can see, creating a slice
object and using it is slower than using the corresponding range. Let's see what happens, however, when we use an already-created slice
object. In the above code, let's change the three lines defining setup
, code_range
and code_slice
:
setup = 'sequence = range(10); s = slice(9, 2, -2)'
code_range = 'sequence[9: 2: -2]'
code_slice = 'sequence[s]'
In this configuration, using a slice
object is slightly faster than using a range:
Range:
min_time(time_range) = 1.79
mean_time(time_range) = 1.907
Slice:
min_time(time_slice) = 1.555
mean_time(time_slice) = 1.63
Thus, remember to use slice
objects only when you need to keep it and/or reuse it.
When you may need to keep it? Imagine you need to extract substrings from a string of a standard format. You know that these substrings are located always in the same locations – you can create a slice per substring, This will be particularly efficient if you need to extract these substrings from many strings.
Combining slice objects
You cannot directly add two or more slice
objects in the sense of combining their ranges into a single slice
object. The slice
objects themselves do not support addition or concatenation because they represent specific slices of sequences rather than sequences themselves.
If you want to simply take all the elements of a sequence using two or more slices, simply use them for the element and combine the resulting sequences. However, if you don't want to repeat the same elements (in case the slices partially – or fully – overlap), this won't work. We need a custom function, and here it is:
def apply_slices(sequence, *slices):
all_indices = range(len(sequence))
indices = []
for s in slices:
current_indices = list(all_indices[s])
for idx in current_indices:
if idx not in indices:
indices.append(idx)
return [sequence[i] for i in indices]
Let's consider a couple of example. Let's start with a very simple one:
>>> my_list = list(range(20))
>>> slice1 = slice(1, 5) # [1, 2, 3, 4]
>>> slice2 = slice(3, 7) # [3, 4, 5, 6]
>>> apply_slices(my_list, slice1, slice2)
[1, 2, 3, 4, 5, 6]
We see this works in the basic configuration. Let's consider two more advanced examples:
>>> my_list = list(range(20))
>>> slice1 = slice(2, 15, 2) # [2, 4, 6, 8, 10, 12, 14]
>>> slice2 = slice(15, 2, -3) # [15, 12, 9, 6, 3]
>>> apply_slices(my_list, slice1, slice2)
[2, 4, 6, 8, 10, 12, 14, 15, 9, 3]
>>> my_list = list(range(20))
>>> slice1 = slice(0, 10, 2) # [0, 2, 4, 6, 8]
>>> slice2 = slice(5, 15, 3) # [5, 8, 11, 14]
>>> slice3 = slice(20, 10, -2) # [19, 17, 15, 13, 11]
>>> apply_slices(my_list, slice1, slice2, slice3)
[0, 2, 4, 6, 8, 5, 11, 14, 19, 17, 15, 13]
Slicing lists, tuples and strings
Now that we've covered the basics of slicing, let's see how ranging and slicing works in the three most common built-in Python sequence types: lists, tuples, and strings.
Lists
Lists constitute the most frequently used data sequence in Python. They are mutable, meaning that you can change their content after they are created – we'll see how to do this with slicing.
Retrieving slices
>>> my_list = list(range(10))
>>> my_list[2:5]
[2, 3, 4]
>>> my_list[:3]
[0, 1, 2]
>>> my_list[7:]
[7, 8, 9]
>>> my_list[::2]
[0, 2, 4, 6, 8]
>>> my_list[::-1] # reversing the list
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
Below are the same operations using the corresponding slice
objects:
>>> s = slice(2, 5)
>>> my_list[s]
[2, 3, 4]
>>> s = slice(None, 3)
>>> my_list[s]
[0, 1, 2]
>>> s = slice(7, None)
>>> my_list[s]
[7, 8, 9]
>>> s = slice(None, None, 2)
>>> my_list[s]
[0, 2, 4, 6, 8]
>>> s = slice(None, None, -1)
>>> my_list[s]
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
Note that when you retrieve a slice of a list, Python creates a new list, which is a copy of the original's list slice. Note:
>>> my_list = list(range(10))
>>> my_slice = my_list[2:5]
>>> my_slice
[2, 3, 4]
Modifying the slice doesn't affect the original list:
>>> my_slice[0] = 10
>>> my_slice
[10, 3, 4]
>>> my_list
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
The same way, modifying the original list doesn't affect the slice:
>>> my_list[2] = 20
>>> my_list
[0, 1, 20, 3, 4, 5, 6, 7, 8, 9]
>>> my_slice
[10, 3, 4]
Assigning to a list's slice
Lists are mutable, so you can assign new values to a slice of a list:
>>> my_list[2:5] = ['a', 'b', 'c']
>>> my_list
[0, 1, 'a', 'b', 'c', 5, 6, 7, 8, 9]
>>> my_list[:3] = ['x', 'y']
>>> my_list
['x', 'y', 'a', 'b', 'c', 5, 6, 7, 8, 9]
>>> my_list[5:] = [10, 11, 12]
>>> my_list
['x', 'y', 'a', 'b', 'c', 10, 11, 12]
>>> my_list[::2] = [1, 2, 3, 4]
>>> my_list
[1, 'y', 2, 'b', 3, 10, 4, 12]
You can do the same operations using slice
objects:
>>> my_list = list(range(10))
>>> s = slice(2, 5)
>>> my_list[s] = ['a', 'b', 'c']
>>> my_list
[0, 1, 'a', 'b', 'c', 5, 6, 7, 8, 9]
>>> s = slice(None, 3)
>>> my_list[s] = ['x', 'y']
>>> my_list
['x', 'y', 'a', 'b', 'c', 5, 6, 7, 8, 9]
>>> s = slice(5, None)
>>> my_list[s] = [10, 11, 12]
>>> my_list
['x', 'y', 'a', 'b', 'c', 10, 11, 12]
>>> s = slice(None, None, 2)
>>> my_list[s] = [1, 2, 3, 4]
>>> my_list
[1, 'y', 2, 'b', 3, 10, 4, 12]
Interestingly, the length of the new values does not need to match the length of the slice being replaced, but the new values must be a sequence. This is how replacing a slice with a shorter sequence works:
>>> my_list = list(range(10))
>>> len(my_list)
10
>>> s = slice(4, None)
>>> my_list[s] = [1]
>>> my_list
[0, 1, 2, 3, 1]
>>> len(my_list)
5
And this is what you'll get when replacing a slice with a longer sequence:
>>> my_list = list(range(10))
>>> len(my_list)
10
>>> s = slice(4, 7)
>>> my_list[s] = [20, 30, 40, 50, 60]
>>> my_list
[0, 1, 2, 3, 20, 30, 40, 50, 60, 7, 8, 9]
>>> len(my_list)
12
Tuples
Unlike lists, tuples are immutable: once they are created, you cannot change their content. Thus, you can slice tuples in order to create new tuples but you cannot modify their content. As was the case with lists, slicing a tuple leads to the creation of a new tuple.
Retrieving slices
>>> my_tuple = tuple(range(10))
>>> my_tuple[2:5]
(2, 3, 4)
>>> my_tuple[:3]
(0, 1, 2)
>>> my_tuple[7:]
(7, 8, 9)
>>> my_tuple[::2]
(0, 2, 4, 6, 8)
>>> my_tuple[::-1] # reversing the tuple
(9, 8, 7, 6, 5, 4, 3, 2, 1, 0)
Here are the same operations using slice
objects:
>>> s = slice(2, 5)
>>> my_tuple[s]
(2, 3, 4)
>>> s = slice(None, 3)
>>> my_tuple[s]
(0, 1, 2)
>>> s = slice(7, None)
>>> my_tuple[s]
(7, 8, 9)
>>> s = slice(None, None, 2)
>>> my_tuple[s]
(0, 2, 4, 6, 8)
>>> s = slice(None, None, -1)
>>> my_tuple[s]
(9, 8, 7, 6, 5, 4, 3, 2, 1, 0)
As already mentioned, due to tuples' immutability, you cannot assign new values to a slice of a tuple. Attempting to do so will result in a TypeError
:
>>> my_tuple[4:5] = [100]
Traceback (most recent call last):
...
my_tuple[4:5] = [100]
~~~~~~~~^^^^^
TypeError: 'tuple' object does not support item assignment
Strings
Strings are immutable sequences of characters. Therefore, like tuples, they cannot be modified after creation. Slicing a string creates a new string.
Retrieving slices
>>> my_string = 'abcdefghij'
>>> my_string[2:5]
'cde'
>>> my_string[:3]
'abc'
>>> my_string[7:]
'hij'
>>> my_string[::2]
'acegi'
>>> my_string[::-1] # reversing the string
'jihgfedcba'
Here are the same operations using slice
objects:
>>> s = slice(2, 5)
>>> my_string[s]
'cde'
>>> s = slice(None, 3)
>>> my_string[s]
'abc'
>>> s = slice(7, None)
>>> my_string[s]
'hij'
>>> s = slice(None, None, 2)
>>> my_string[s]
'acegi'
>>> s = slice(None, None, -1)
>>> my_string[s]
'jihgfedcba'
As with tuples, you cannot assign new values to a slice of a string. This will result in a TypeError
.
Slicing NumPy arrays
NumPy arrays are a powerful data structure used in data science for numerical computing in Python. You can create one-dimensional and multidimensional arrays, and indexing of the latter can be quite complex. NumPy arrays support advanced slicing techniques. In their core, they are similar to slicing Python lists, but in the case of multidimensional arrays, slicing can become quite complex.
Slicing one-dimensional NumPy arrays is similar to slicing Python lists, so let's use the same examples as before.
>>> import numpy as np
>>> arr = np.array(range(10))
>>> arr[2:5]
array([2, 3, 4])
>>> arr[:3]
array([0, 1, 2])
>>> arr[7:]
array([7, 8, 9])
>>> arr[::2]
array([0, 2, 4, 6, 8])
>>> arr[::-1] # reversing the array
array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
You can use slice
objects, too:
# Using slice objects
>>> s = slice(2, 5)
>>> arr[s]
array([2, 3, 4])
>>> s = slice(None, 3)
>>> arr[s]
array([0, 1, 2])
>>> s = slice(7, None)
>>> arr[s]
array([7, 8, 9])
>>> s = slice(None, None, 2)
>>> arr[s]
array([0, 2, 4, 6, 8])
>>> s = slice(None, None, -1)
>>> arr[s]
array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
In the case of multidimensional NumPy arrays, you can slice each dimension by specifying slices separated by commas. Let's start with two-dimensional arrays:
>>> arr2d = np.array(
... [[0, 1, 2, 3],
... [4, 5, 6, 7],
... [8, 9, 10, 11],
... [12, 13, 14, 15]]
... )
Let's start with slicing rows:
>>> arr2d[1:3]
array([[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
Now columns:
>>> arr2d[:, 1:3]
array([[ 1, 2],
[ 5, 6],
[ 9, 10],
[13, 14]])
Now let's slice both rows and columns at the same time, using ranges and slice
objects:
>>> arr2d[1:3, 1:3]
array([[ 5, 6],
[ 9, 10]])
>>> row_slice = slice(1, 3)
>>> col_slice = slice(1, 3)
>>> arr2d[row_slice, col_slice]
array([[ 5, 6],
[ 9, 10]])
Slicing a three-dimensional array will of course be a little more complex:
>>> arr3d = np.array([[[ 0, 1, 2], [ 3, 4, 5]],
[[ 6, 7, 8], [ 9, 10, 11]],
[[12, 13, 14], [15, 16, 17]]])
# Slicing along the first axis
>>> arr3d[1:]
array([[[ 6, 7, 8],
[ 9, 10, 11]],
[[12, 13, 14],
[15, 16, 17]]])
# Slicing along the second axis
>>> arr3d[:, 1:]
array([[[ 3, 4, 5]],
[[ 9, 10, 11]],
[[15, 16, 17]]])
# Slicing along the third axis
>>> arr3d[:, :, 1:]
array([[[ 1, 2],
[ 4, 5]],
[[ 7, 8],
[10, 11]],
[[13, 14],
[16, 17]]])
# Using slice objects for three dimensions
>>> slice1 = slice(1, None)
>>> slice2 = slice(None, 1)
>>> slice3 = slice(1, 3)
>>> arr3d[slice1, slice2, slice3]
array([[[ 7, 8]],
[[13, 14]]])
As you can see, you can slice arrays along any dimension and combine slices to access subarrays (subsets of arrays). Note that for multidimensional arrays, using slice
objects makes the slices reusable – but it also makes the code more readable.
An important difference between how NumPy arrays and lists work is that the former creates a view, not a new NumPy array. Note:
>>> arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> reversed_arr = arr[s]
>>> reversed_arr
array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
>>> reversed_arr[2:4] = 99
>>> reversed_arr
array([ 9, 8, 99, 99, 5, 4, 3, 2, 1, 0])
>>> arr
array([ 0, 1, 2, 3, 4, 5, 99, 99, 8, 9])
As you can see, reversed_arr
is a view, so changing its elements (here, we did so for its slice) means changing the corresponding elements of the original array, arr
.
The code in the above examples looks far more complex than slicing one-dimensional lists. We could, however, create multi-dimensional lists, and then slicing would become complex, too. Hence, in the case of multidimensional NumPy arrays, the complexity of the code results mainly from the complexity of the array (mainly its dimension) and less from the complexity of slicing itself.
Slicing Pandas dataframes
Pandas is a powerful library for data manipulation and analysis in Python. It's arguably the most frequently used library in Python for data science, and it's difficult to imagine a data scientist working in Python who is not familiar with Pandas.
The most important data structure in Pandas is the dataframe (pd.DataFrame
). It represents tabular data in a readable and easy-to-grasp form.
Dataframes allow for flexible indexing and slicing operations. As we did for NumPy arrays, we'll consider slicing both rows and columns of Pandas dataframes. We'll also explore slicing using the .loc
and .iloc
indexers, the former being a label-based indexer and the latter being a positional indexer.
Let's start by creating a sample dataframe we will work with:
>>> import pandas as pd
>>> data = {
... 'A': range(10),
... 'B': range(10, 20),
... 'C': range(20, 30),
... 'D': range(30, 40)
... }
>>> labels = [
... 'a', 'b', 'c', 'd', 'e',
... 'f', 'g', 'h', 'i', 'j'
... ]
>>> df = pd.DataFrame(data, index=labels)
>>> df
A B C D
a 0 10 20 30
b 1 11 21 31
c 2 12 22 32
d 3 13 23 33
e 4 14 24 34
f 5 15 25 35
g 6 16 26 36
h 7 17 27 37
i 8 18 28 38
j 9 19 29 39
Slicing rows
When indexing and slicing Pandas dataframes, you can use two types of indices:
- the
iloc
indexer, which allows for integer-location-based indexing - the
loc
indexer, which allows for label-location-based indexing
The former is always an integer and represents a positional index. In other words, it denotes the position of a row in the dataframe, starting from row 0
. The latter works differently: you can name each row using a number or a string, and then use this name – called a label – to access a particular row.
Positional slicing using iloc
You'll see that this type of slicing works in quite a similar way as standard indexing in Python. You can slice rows by providing a range of index positions:
>>> df.iloc[2:5]
A B C D
c 2 12 22 32
d 3 13 23 33
e 4 14 24 34
But you can also use a slice
object:
>>> row_slice = slice(2, 5)
>>> df.iloc[row_slice]
A B C D
c 2 12 22 32
d 3 13 23 33
e 4 14 24 34
Label slicing using loc
Label-based slicing is something we haven't discussed above. It uses row labels instead of row positions. This approach may look a little less natural than slicing based on positional index – but it works in the same way: it's enough to provide the label of a starting index, the label of a stopping index, and a step. So, the only difference is that you need to use the indices' labels instead of their positions.
You can use a range:
>>> df.loc['c':'e']
A B C D
c 2 12 22 32
d 3 13 23 33
e 4 14 24 34
and a slice
object:
>>> row_slice = slice('c', 'e')
>>> df.loc[row_slice]
A B C D
c 2 12 22 32
d 3 13 23 33
e 4 14 24 34
As you can see, the two approaches work basically in the same way. From my experience it follows that the main issue you may encounter while using loc
is forgetting that you're supposed to use labels instead of integer positions. Once you grasp the difference, you'll likely find both ways equally significant and useful – just in different scenarios.
Slicing columns
Slicing columns can be done in a similar manner, also using both iloc
and loc
.
Positional slicing using iloc
You can slice columns by index position, using both a range:
>>> df.iloc[:, 1:3]
B C
a 10 20
b 11 21
c 12 22
d 13 23
e 14 24
f 15 25
g 16 26
h 17 27
i 18 28
j 19 29
and a slice
object:
>>> slice_cols = slice(1, 3)
>>> df.iloc[:, slice_cols]
B C
a 10 20
b 11 21
c 12 22
d 13 23
e 14 24
f 15 25
g 16 26
h 17 27
i 18 28
j 19 29
Label slicing using loc
For columns, a more intuitive and definitely more frequent approach is to use labels – that is, column names. This slicing works in a very similar way as that for rows:
# Slicing columns by column names
>>> df.loc[:, 'B':'C']
B C
a 10 20
b 11 21
c 12 22
d 13 23
e 14 24
f 15 25
g 16 26
h 17 27
i 18 28
j 19 29
You can of course also use a slice
object:
>>> slice_cols = slice('B', 'C')
>>> df.loc[:, slice_cols]
B C
a 10 20
b 11 21
c 12 22
d 13 23
e 14 24
f 15 25
g 16 26
h 17 27
i 18 28
j 19 29
Slicing both rows and columns
You can combine row and column slicing to retrieve a specific subset of the original dataframe. You can use either iloc
or loc
indexer – but there's also a way to combine them. I'll show this at the end.
Positional slicing using iloc
Slicing both rows and columns by index position:
>>> df.iloc[2:5, 1:3]
B C
c 12 22
d 13 23
e 14 24
and using slice
objects:
>>> row_slice = slice(2, 5)
>>> col_slice = slice(1, 3)
>>> df.iloc[row_slice, col_slice]
B C
c 12 22
d 13 23
e 14 24
Label slicing using loc
Let's slice both rows and columns by using ranges of labels:
>>> df.loc['c':'e', 'B':'C']
B C
c 12 22
d 13 23
e 14 24
and slice
objects:
>>> row_slice = slice('c', 'e')
>>> col_slice = slice('B', 'C')
>>> df.loc[row_slice, col_slice]
B C
c 12 22
d 13 23
e 14 24
Joining a range and a slice
object
There's no problem with joining a range for columns (rows) with a slice
object for rows (columns):
>>> df.loc['c':'e', col_slice]
B C
c 12 22
d 13 23
e 14 24
>>> df.loc[row_slice, 'B':'C']
B C
c 12 22
d 13 23
e 14 24
Joining loc
, iloc
, ranges and slice
objects
If you need to join loc
and iloc
– for instance, when you need to provide positional index for rows and names for columns – you cannot do it in one use of either loc
or iloc
. But you can join them in whatever configuration you need. Consider the following examples – the same way, you can use any configuration you need:
>>> df.iloc[2:5].loc[:, 'B':'C']
B C
c 12 22
d 13 23
e 14 24
>>> df.iloc[slice(2, 5)].loc[:, slice('B', 'C')]
B C
c 12 22
d 13 23
e 14 24
>>> df.loc[slice('c', 'e'].loc[:, 'B':'C']
B C
c 12 22
d 13 23
e 14 24
>>> df.loc['c':'e'].iloc[:, 1:3]
B C
c 12 22
d 13 23
e 14 24
Slicing MultiIndex dataframes
Pandas supports MultiIndex
for both rows and columns. Simply put, you can treat a MultiIndex
dataframe as a dataframe with multiple levels of indexing, so with a hierarchical structure.
Slicing for such dataframes is a little more complex – no wonder, as such dataframes as more complex themselves, resulting in a more complex structure of the dataframe.
First, let's create a dataframe with a MultiIndex to use in our examples:
>>> import pandas as pd
>>> import numpy as np
>>> arrays = [
... np.array(['bar', 'bar', 'baz',
... 'baz', 'foo', 'foo',
... 'qux', 'qux']),
... np.array(['one', 'two', 'one',
... 'two', 'one', 'two',
... 'one', 'two'])
... ]
>>> df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
>>> df # doctest: +SKIP
0 1 2 3
bar one -0.577668 -1.139747 0.496755 -0.446453
two -0.605031 1.680358 0.883295 1.708469
baz one 1.451471 1.550376 0.696019 0.448952
two -0.426963 0.832027 1.228606 -0.951508
foo one -0.153662 0.119935 -1.082593 0.574666
two 1.469598 1.481410 -0.355338 -0.420283
qux one -1.538192 -0.099278 -0.379863 0.998112
two 0.087206 1.406255 -0.520470 0.703503
(We had to skip the doctest
because in every run of the code, the dataframe will contain different pseudorandom numbers.)
The df
dataframe has two levels of indexing for the rows: the first level contains four indices ('bar'
, 'baz'
, 'foo'
, and 'qux'
), and the second level contains two indices: 'one'
and 'two'
. You can use slicing for each level and for both of them.
Here's how you can select rows by the first level of the index:
>>> df.loc['bar'] # doctest: +SKIP
0 1 2 3
one -0.577668 -1.139747 0.496755 -0.446453
two -0.605031 1.680358 0.883295 1.708469
In this example, df.loc['bar']
selects all rows where the first level of the index is 'bar'
. The same way, you can slice the dataframe:
>>> df.loc['bar':'baz'] # doctest: +SKIP
0 1 2 3
bar one -0.577668 -1.139747 0.496755 -0.446453
two -0.605031 1.680358 0.883295 1.708469
baz one 1.451471 1.550376 0.696019 0.448952
two -0.426963 0.832027 1.228606 -0.951508
You'll get the same effect when using df.loc[slice('bar', 'baz')].
Here comes a little more advanced thing: in order to index by the second index level without specifying the first level, use need to use the .xs()
method:
>>> df.xs('one', level=1) # doctest: +SKIP
0 1 2 3
bar one -0.577668 -1.139747 0.496755 -0.446453
baz one 1.451471 1.550376 0.696019 0.448952
foo one -0.153662 0.119935 -1.082593 0.574666
qux one -1.538192 -0.099278 -0.379863 0.998112
>>> df.xs('two', level=1)# doctest: +SKIP
0 1 2 3
bar two -0.605031 1.680358 0.883295 1.708469
baz two -0.426963 0.832027 1.228606 -0.951508
foo two 1.469598 1.481410 -0.355338 -0.420283
qux two 0.087206 1.406255 -0.520470 0.703503
Note that a regular range won't work here:
>>> df.xs('one':'two', level=1)
File "", line 1
df.xs('one':'two', level=1)
^
SyntaxError: invalid syntax
A slice
object, however, will:
>>> df.xs(slice('one', 'two'), level=1)# doctest: +SKIP
0 1 2 3
bar -0.577668 -1.139747 0.496755 -0.446453
bar -0.605031 1.680358 0.883295 1.708469
baz 1.451471 1.550376 0.696019 0.448952
baz -0.426963 0.832027 1.228606 -0.951508
foo -0.153662 0.119935 -1.082593 0.574666
foo 1.469598 1.481410 -0.355338 -0.420283
qux -1.538192 -0.099278 -0.379863 0.998112
qux 0.087206 1.406255 -0.520470 0.703503
Here, we got the whole dataframe, as we have only two second-level labels, but I'm sure you get the point. Note that the resulting dataframe doesn't have MultiIndex
, so can be difficult to process – we lost the second level whatsoever, which in most cases is rather unfortunate.
You can also slice both levels of the index:
>>> df.loc[('bar':'baz', 'one')] # doctest: +SKIP
0 0.680878
1 1.177134
2 -1.054265
3 -0.681924
Name: (bar, one), dtype: float64
This selects the specific row where the first level of the index is 'bar'
and the second level is 'one'
.
Summary
As you can see, Pandas dataframes support powerful and flexible slicing operations using iloc
for position-based indexing and loc
for label-based indexing. You can slice rows, columns, or both to retrieve subsets of dataframes. You can use ranges and slice
objects – but remember that the latter will mean slower code if you'll use this particular slice only once. In one situation, however, that is, with the .xs()
method, regular ranging doesn't work and you need to use a slice
object.
Since we're talking about Pandas dataframes, you should remember an important limitation of slicing columns. Whenever you a range of a slice for columns, you assume that:
- in the case of positional slicing, these particular columns are always located in the same positions in the dataframe, and
- in the case of label-based slicing, the columns have these particular names indeed, and they are positioned in the dataframe in the same order as the one when you defined the slice.
In data science, these can be dangerous assumptions. Thus, whenever possible, you should prefer providing a sequence of column names instead of their slices. That way, it doesn't really matter whether the dataframe has this particular structure – what matters is that it has these particular columns, and it doesn't matter where they're located. Look:
>>> df_reordered = df.filter(['B', 'C', 'A', 'D'])
>>> cols = ['B', 'C', 'D']
>>> df_reordered.loc[:, cols]
B C D
a 10 20 30
b 11 21 31
c 12 22 32
d 13 23 33
e 14 24 34
f 15 25 35
g 16 26 36
h 17 27 37
i 18 28 38
j 19 29 39
However, when you use a range (or a slice
object), you can be surprised:
>>> df_reordered.loc[:, 'B':'D']
B C A D
a 10 20 0 30
b 11 21 1 31
c 12 22 2 32
d 13 23 3 33
e 14 24 4 34
f 15 25 5 35
g 16 26 6 36
h 17 27 7 37
i 18 28 8 38
j 19 29 9 39
This teaches us an important lesson for data science: Whenever possible, avoid assuming a specific structure for the data. Instead, use an approach that will provide (or do) what you need without making such assumptions.
Whenever possible, avoid assuming a specific structure for the data. Instead, use an approach that will provide (or do) what you need without making such assumptions.
Thus, whenever possible, avoid assuming a specific structure of a dataframe. Instead, provide a sequence of the required rows or columns.
Whenever possible, avoid assuming a specific structure of a dataframe. Instead, provide a sequence of the required rows or columns.
One more thing. As was the case with NumPy arrays, a slice of a Pandas dataframe is a view, not a new dataframe:
>>> import pandas as pd
>>> df = pd.DataFrame(
... {
... 'A': range(3),
... 'B': range(3, 6),
... 'C': range(6, 9),
... 'D': range(9, 12)
... },
... index=['a', 'b', 'c']
... )
>>> df
A B C D
a 0 3 6 9
b 1 4 7 10
c 2 5 8 11
>>> df_slice = df.loc['a':'c', 'B':'D']
>>> df_slice
B C D
a 3 6 9
b 4 7 10
c 5 8 11
>>> df_slice.loc[:, 'B'] = 99
>>> df_slice
B C D
a 99 6 9
b 99 7 10
c 99 8 11
>>> df
A B C D
a 0 99 6 9
b 1 99 7 10
c 2 99 8 11
Slicing in custom sequences
You can implement a slicing mechanism for custom sequence-like objects (data containers). In order to do so, you need to implement in your class the special methods required for indexing. That way, such a class will support slicing in a way that integrates with Python's built-in slicing syntax.
These methods are the __getitem__
and __setitem__
methods. The former should allow for handling slice
objects. If your sequence is mutable, the __setitem__
method should handle them as well.
Creating a sliceable custom sequence
Let's implement a labeled list. To make its name consistent with named tuples, we'll call it NamedList
. The point is to enable the user to index and slice its instances using both positional indices – just like a regular list – but also using labels. You'll see that implementing this mechanism will add some complexity to the class.
This is the example implementation of such a class:
from typing import Union, Any
IndexType = Union[int, slice, str]
class IncorrectLabelsError(Exception):
"""Exception raised for incorrect labels in NamedList."""
class NamedList:
"""A labeled list.
This custom sequence class supports both positional
and label-based indexing.
"""
def __init__(self, values: list[Any], labels: list[str]):
if len(values) != len(labels):
raise IncorrectLabelsError(
'Values and labels must be of the same length.'
)
self.values = values
self.labels = labels
self.label_to_index = {
label: idx
for idx, label in enumerate(labels)
}
def _convert_label_slice_to_index_slice(
self,
label_slice: slice
) -> slice:
"""Convert a label-based slice to an index-based slice."""
start_label = label_slice.start
stop_label = label_slice.stop
step = label_slice.step
start_idx = (
self.label_to_index[start_label]
if start_label is not None else None
)
stop_idx = (
self.label_to_index[stop_label] + 1
if stop_label is not None else None
)
return slice(start_idx, stop_idx, step)
def __getitem__(
self,
index: Union[IndexType, list[str]]
) -> Union[Any, 'NamedList']:
if isinstance(index, int):
return self.values[index]
elif isinstance(index, slice):
if self._is_start_or_stop_str(index):
index = self._convert_label_slice_to_index_slice(index)
return NamedList(self.values[index], self.labels[index])
elif isinstance(index, str):
idx = self.label_to_index[index]
return self.values[idx]
elif isinstance(index, list):
if all(isinstance(i, str) for i in index):
idxs = [self.label_to_index[i] for i in index]
return NamedList([self.values[i] for i in idxs], index)
else:
raise TypeError(
'All elements of the list with labels '
'must be strings.'
)
else:
raise TypeError('Invalid argument type.')
def __setitem__(self, index: IndexType, value: Any) -> None:
if isinstance(index, int):
self.values[index] = value
elif isinstance(index, slice):
if self._is_start_or_stop_str(index):
index = self._convert_label_slice_to_index_slice(index)
if isinstance(value, NamedList):
self.values[index] = value.values
self.labels[index] = value.labels
elif isinstance(value, list):
self.values[index] = value
else:
raise TypeError(
'Assigned value must be a list or NamedList.'
)
elif isinstance(index, str):
idx = self.label_to_index[index]
self.values[idx] = value
else:
raise TypeError('Invalid argument type.')
@staticmethod
def _is_start_or_stop_str(index: slice) -> bool:
start_is_str = isinstance(index.start, str)
stop_is_str = isinstance(index.stop, str)
return start_is_str or stop_is_str
def __repr__(self) -> str:
return (
f'NamedList(values={self.values},'
f' labels={self.labels})'
)
Let's use the class:
>>> values = [10, 20, 30, 40, 50]
>>> labels = ['a', 'b', 'c', 'd', 'e']
>>> nl = NamedList(values, labels)
>>> nl['c']
30
>>> nl[['b', 'e']]
NamedList(values=[20, 50], labels=['b', 'e'])
Now, let's see how it works with slices:
>>> nl[2:4]
NamedList(values=[30, 40], labels=['c', 'd'])
>>> nl[['b', 'e']]
NamedList(values=[20, 50], labels=['b', 'e'])
>>> nl[slice('a', 'c')]
NamedList(values=[10, 20, 30], labels=['a', 'b', 'c'])
And this is how setting works, including slices:
>>> nl['b'] = 25
>>> nl[1] = 15
>>> nl[1:3] = NamedList([35, 45], ['b', 'c'])
>>> nl
NamedList(values=[10, 35, 45, 40, 50], labels=['a', 'b', 'c', 'd', 'e'])
Note that creating a slice of a NamedList
instance creates a new object, not a view:
>>> values = [10, 20, 30, 40, 50]
>>> labels = ['a', 'b', 'c', 'd', 'e']
>>> nl = NamedList(values, labels)
>>> nl_slice = nl[slice('a', 'd')]
>>> nl_slice
NamedList(values=[10, 20, 30, 40], labels=['a', 'b', 'c', 'd'])
>>> nl_slice['b'] = 99
>>> nl_slice
NamedList(values=[10, 99, 30, 40], labels=['a', 'b', 'c', 'd'])
>>> nl
NamedList(values=[10, 20, 30, 40, 50], labels=['a', 'b', 'c', 'd', 'e'])
As you see, updating a slice didn't affect the original NamedList
. Thus, in this context, our named lists behave in the same way as lists and tuples – a slice of an instance creates a new object – and not like NumPy arrays and Pandas dataframes, slicing of which creates views, not new objects.
Let's analyze the code. The class supports both positional and label-based indexing and slicing. This functionality is useful when you need to work with sequences with elements that have meaningful labels.
Note that we implemented a custom exception class, IncorrectLabelsError
, used when a user provides incorrect labels.
The .__init__()
method initializes the NamedList
with both values
and labels
. It checks if the values
and labels
lists have the same length; if not, it raises an IncorrectLabelsError
. The method also creates a dictionary label_to_index
, needed to map labels to their corresponding indices.
In order to convert label slices to positional-index slices, we implemented a ._convert_label_slice_to_index_slice()
method. It maps the start
and stop
labels to their respective indices and creates a new slice
object using these indices. Without this method, the class wouldn't work with label slices – though it would work with index slices. Note that we made this method private, as the user isn't supposed to use it – it's intended for internal use only.
The .__getitem__()
method aims to handles the retrieval of elements using various types of indices:
- Integer index: The method returns the value at the specified position.
- Slice: The method converts label-based slices to index-based slices if needed. If the slice is already positional, it directly slices the
values
andlabels
lists. Eventually, the method returns a newNamedList
with the sliced values and labels. - String label: The method uses the label to find the corresponding index, returning the value at that index.
- List of labels: The method converts the labels to indices, returning a new
NamedList
with the corresponding values and labels. - If the type of the provided index is not supported, the method raises a
TypeError
.
Now, let's have a look at the .__setitem__()
method, which handles assignment of values using various types of indices and slices:
- Integer index: The method sets the value at the specified position.
- Slice: The method converts label-based slices to index-based slices if needed. It supports assigning values from another
NamedList
or a standard list. - String label: The method uses a string label to find the corresponding index, setting the value at that index.
- If the type of the provided index or the type of the assigned value is not supported, the method raises a
TypeError
.
Note the private static method called ._is_start_or_stop_str()
. It's a helped method to avoid repeating the same code used in the .__getitem__()
and .__setitem__()
methods. Thanks to this method, the code of these classes is shorter and cleaner. The method has a self-standing and informative name, which makes the code readable. We could avoid using it by using the following line:
if isinstance(index.start, str) or isinstance(index.stop, str):
I didn't do this because lines that long (76 characters of the code and the line's indent) wouldn't look well in this article. I could split them:
if (
isinstance(index.start, str)
or isinstance(index.stop, str)
):
but I decided that a private static method will make this cleaner. Normally, however, don't be afraid of this long lines, as it fits in to the most strict Python line length limit of 79 characters.
Finally, we implemented the .__repr__()
method, to provide a string representation of an instance.
NOTE: The NamedList
class extends Python's slicing and indexing to support both positional and label-based access to data. The implemented slicing functionality could be particularly useful in data science when data elements have meaningful labels, enabling more intuitive and readable code. However, don't use this class! It's definitely not optimized, and thus it'd be far too slow. I showed it here only to present how we can implement a slicing mechanism for a custom data container.
Conclusion
In this article, we have explored slicing techniques available in Python for various data structures, from basic lists and tuples to more complex NumPy arrays and Pandas dataframes. We've seen how powerful and versatile slicing can be, allowing for efficient and intuitive data manipulation.
We focused specifically on slicing, which involves selecting continuous subsets of elements using the start:stop:step
syntax or slice objects. Slicing is not indexing, which includes methods like Boolean indexing and fancy indexing of NumPy arrays and Pandas dataframes and series. Don't confuse these two techniques, as they can be done differently and are used to achieve different purposes.
When you can use a range instead of a slice object, do it – it'll be faster and often – though not always, as we've seen with three-dimensional NumPy arrays – more readable. Use slice objects when you need to keep a slice and reuse it. Interestingly, we found a situation in which a range didn't work though the corresponding slice object did – it was in the case of the pd.DataFrame.xs()
method, which we used for slicing the second level of a Pandas dataframe with a two-level MultiIndex – or, more generally, for slicing a specific level of a Pandas dataframe with MultiIndex.
Importantly, we've discussed the limitations of slicing. In data science, we should always strive to implement as robust code as possible, and sometimes slicing is not robust and can lead to unexpected behavior. This is because slicing assumes that an array or a dataframe contains columns or rows in a particular order, and changing this order can change the output of slicing. Thus, whenever possible, it's better to use direct column names instead of their slice, unless you're certain that such a slice will always be correct.
In data science, we should always strive to implement as robust code as possible, and sometimes slicing is not robust and can lead to unexpected behavior.
The same scenario for rows is much less dangerous, however, as we're seldom as attached to particular rows as we are to columns. It can, however, happen when rows have meaningful labels – for instance, country names. You can use the following rule: if after transposing a dataframe, you get a meaningful dataframe, row labels are as important as column labels. In such a case, think twice before slicing such rows.
We also discussed how to implement slicing in a custom data container, the NamedList
class. We implemented a mechanism that extends typical Python slicing to support both positional and label-based indexing. However, please note that this class is not optimized for performance and is intended purely for educational purposes. I hope this example helped you grasp how to proceed when you need to implement slicing in your custom data containers.
Thanks for reading, and happy slicing!