Strings 1#

What is a String?#

We have seen and used most of the language constructs Python offers.

The next step in programming is how to represent data.

We have seen so far a number of basic data types: integers, floats, and booleans.

However, we will need more powerful data types to describe and manipulate data. For instance, if we want to calculate the average of a list of integers, the basic data types are not sufficient.

We will start with strings, which is a (non-basic) data type (str) in Python.

A string represents a sequence of characters.

You can consider a sequence as a list of elements, we will see lists later on.

One of the operations that we can perform on strings is selecting one of the characters, via indexing.

bike : str = 'gazelle'
letter = bike[1]
letter
'a'

The second statement selects character number 1 from bike and assigns it to letter.

The expression in square brackets is called an index.

The index indicates which character in the sequence you want (hence the name).

The index 1 does not yield the first letter of gazelle, but the second.

The first letter of a string is obtained by index 0.

letter = bike[0]
letter
'g'

So g is the 0th letter of 'gazelle', a is the 1st letter, z is the 2th letter, and so on.

The following table presents the index of each letter in the string 'gazelle'.

0

1

2

3

4

5

6

g

a

z

e

l

l

e

As an index you can use an expression that contains variables and operators:

i : int = 0
letter = bike[i]
print(letter)
i += 1
letter = bike[i]
print(letter)
i += 1
letter = bike[i]
print(letter)
g
a
z

The value of the index must be an integer.

letter = bike[1.5]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 letter = bike[1.5]

TypeError: string indices must be integers
# Remove this line and add your code here

Length of a String#

There are more operations that we can apply on a sequence, such as calculating the length of a sequence (or string).

len is a built-in function to obtain the length of a sequence, and thus of a string.

len(bike)
len('gazelle')

Given the fact that the first letter is accessed via the index 0, the last letter is accessed via len - 1 instead, as you may have expected.

length : int = len(bike)
bike[length]
bike[length - 1]
# Remove this line and add your code here

Traversal with a for Loop#

Programs often involve processing a string by reading its characters one by one.

Often they start at the beginning, select each character, do something to the selected character, and continue until the end of the string.

This pattern of processing is called a traversal of a sequence (or string).

One way to write a traversal is with a while loop. Note that we need an explicit iterator in order to access all elements of the sequence (or characters of the string).

The next cell contains a correct implementation for iterating over a string with a while loop.

index : int = 0
while index < len(bike):
    letter = bike[index]
    print(letter)
    index += 1

However, using an explicit iterator often introduces serious mistakes in programs. For example, programmers start at the wrong index (1 instead of 0) and terminate to early or to late.

So-called out-of-bound errors, or off-by-one errors are the root cause for serious security threats.

Below you find code that contains multiple out-of-bound errors.

index : int = 1
while index <= len(bike):
    letter = bike[index]
    print(letter)
    index += 1
# Remove this line and add your code here

Another way (and more secure) way of writing a traversal is using a for loop.

A for loop has an implicit iterator, in the cell below represented by the variable letter.

for letter in "sparta":
    print(letter)

Each time the loop is executed, the next character in the string is assigned to the iterator (variable) letter.

The loop continues until no characters are left.

The following example shows how to use concatenation (string addition) and a for loop to generate a list of names in alphabetical order.

prefixes : str = 'JKLMNOPQ'
suffix : str = 'ack'

for letter in prefixes:
    print(letter + suffix)
code = 'Esxzcxxhyezrzy wyzyaxsz zyxbxoxrxxn yinz zyLxexxeuzwaryxdexyzn'

# Remove this line and add your code here

String Slices#

We are now able to select individual characters of a string and to iterate over all characters of a string, but sometimes we want just a part (segment) of a string.

A segment of a string is called a slice.

A slice is obtained by giving a range of indices.

In the next cell, we show how we can obtain the segments Data and Science from the giving string.

ds_str : str = 'Data Science'

ln : int = len(ds_str)
print(ln)
data : str = ds_str[0:4]
science : str = ds_str[5:ln]

print(data)
print(len(data))
print(science)
print(len(science))

The operator [n:m] returns the part of the string from the “n-eth” character to the “m-eth” character, including the first but excluding the last.

If you omit the first index (before the colon), the slice starts at the beginning of the string. If you omit the second index, the slice goes to the end of the string.

Beware, take care of the start and end indices of the string. This is a frequent source of errors.

ds_str : str = 'Data Science'

data : str = ds_str[:4]
science : str = ds_str[5:]

print(data)
print(science)

If the first index is greater than or equal to the second the result is an empty string, represented by two quotation marks:

ds : str = 'Data Science'

data : str = ds[4:4]

data
# Remove this line and add your code here

Strings are Immutable#

What is meant by immutable?

It is not possible to use the [] operator on the left hand side of an assignment.

It is not possible to change an existing string, strings are immutable.

greeting : str = 'Hello Data Scientist'
greeting[0] = 'h'

If you want to change a string you must create a new string.

In the cell below, we create a new string new_greeting by concatenating the letter h with the slice consisting of all characters of the original string except the first character.

The original string is not changed.

print(greeting)
new_greeting : str = 'h' + greeting[1:]
print(new_greeting)
# Remove this line and add your code here

Searching#

Finding a specific element in a long list can be boring, in principle you have to inspect all elements until you find the element you are looking for.

The next cell shows a few lines of code that mimicks this searching for an element by means of looking for a specific letter in a word.

def find(word : str, letter : str) -> int:
    """looks at which position the letter appears first, 
    if the letter does not appear in the string -1 is returned"""
    index : int = 0
    
    while index < len(word):
        if word[index] == letter:
            return index
        index += 1
        
    return -1

find('data science', 'a')

The function find is in fact the inverse of the [] operator.

Instead of taking an index and extracting the corresponding character, it takes a character and finds the index where that character appears.

If the character is not found, the function returns -1.

If word[index] == letter, the function breaks out of the loop and returns immediately.

If the character does not appear in the string, the program exits the loop normally and returns -1.

This pattern of computation —traversing a sequence and returning when we find what we are looking for— is called a search.

Is it possible to write this function using a for loop more compact?

def find(word : str, letter : str) -> int:
    """looks at which position the letter appears first, 
    if the letter does not appear in the string -1 is returned"""
    index : int = 0
    
    for char in word:
        if char == letter:
            break
        index += 1
        
    if index >= len(word):
        return -1
    else:
        return index

find('data science', 'z')
info = 'Ticket reference: 9090873982'

# Remove this line and add your code here

Looping and Counting#

The following program counts the number of times the letter e appears in a string.

word : str = 'gazelle'
count : int = 0

for letter in word:
    if letter == 'e':
        count += 1
        
print(count)

This program demonstrates another pattern of computation called a counter.

The variable count is initialized to 0 and then incremented each time the letter is encountered.

When the loop exits, count contains the result —the total number of e’s in the word gazelle.

# Remove this line and add your code here

The in Operator#

The word in is a boolean operator that takes two strings and returns True if the first appears as a substring in the second.

'zel' in 'gazelle'
'par' in 'gazelle'

For example, the following function prints all the letters from word1 that also appear in word2.

def in_both(word1 : str, word2 : str):
    """prints letters that appear in both words"""
    for letter in word1:
        if letter in word2:
            print(letter)

in_both('trek', 'gazelle')
# Remove this line and add your code here

String Comparison#

An important operation on strings is checking whether strings are equal or not. If you have to search for a certain word in a text or dictionary you will need such an operation.

Python offers a number of relational operators that work on strings, for instance to check whether two strings are equal.

word : str = input('> ')
if word == 'apple':
    print('Hmmm, an apple!')

Other relational operations are useful for putting words in alphabetical order.

word : str = input('> ')
if word < 'apple':
    print('Your word, ' + word + ', comes before apple!')
elif word > 'apple':
    print('Your word, ' + word + ', comes after apple!')
else:
    print('Hmmm, an apple!')

Python does not treat uppercase and lowercase letters the same as people do.

All the uppercase letters come before all the lowercase letters, so:

Your word, Pineapple, comes before apple.

A common way to address this problem is to convert strings to a standard format, such as all lowercase, before doing string comparison.

word1 : str = 'purple'
word2 : str = 'green'
word3 : str = 'red'

# Remove this line and add your code here

String Methods#

Strings are an example of a Python object.

For now, an object is equivalente to a value.

However, it has more information than a normal valuel. An object contains data and a set of methods.

Methods are functions that are built into the object.

Contrary to normal function, methods have a slightly different syntax.

The Python function dir lists all the methods available for an object. Let’s see the methods that an object of type string has.

text : str = 'Data Science'
dir(text)

As you can see, Python provides a whole collection of useful methods on strings.

Calling a method is similar to calling a function, the only difference is that you will first place the name of the variable and then the name of the method separated by a dot. Something like var.method().

For instance, instead of the function syntax upper(word), we use the method syntax word.upper().

word : str = 'gazelle'
new_word : str = word.upper()
new_word

This form of dot notation specifies the name of the method, upper, and the name of the string to apply the method to, word.

The empty parentheses indicate that this method takes no arguments.

A method call is called an invocation; in this case, we would say that we are invoking the method upper on word.

As it turns out, there is a string method named find that is remarkably similar to the function we wrote.

word : str = 'gazelle'
index : int = word.find('z')
index

Actually, the find method is more general than our function; it can find substrings, not just characters.

Furthermore, the method can also be directly invoked on a string object.

index : int = 'sparta'.find('par')
index

The find method can take 1 or 2 optional arguments.

The first optional argument is the index where the search in the string object should start.

The second optional argument is the index where the search in the string object should stop.

name : str = 'bob'
name.find('b', 1, 2)

This search fails because b does not appear in the index range from 1 to 2, not including 2.

Searching up to, but not including, the second index makes find consistent with the slice operator.

name[1:2].find('b')

Format Operator#

With the format operator % we can build a string by replacing parts of it with data stored in variables.

Remember that when % is used with integers it is know as the modulus operator. When playing around with strings we call it the format operator.

The first operand should always be a string containing format sequences. The second argument is one or more variables. If you have more than one variable they should be stores in a tuple (we will talk about this data type later).

A format sequence are markers such as '%d' to format an integer, '%g' to format floats, and '%s' to format strings.

days : int = 365
'A year has %d' % days
who : str = 'Tom'
budget : float = 1.99
days : int = 365

'%s says that he is allowed to spend %g euros every single day of the %d days of the year.' % (who, budget, days)

You can get an error if your don’t write all needed elements to format the string.

day : str = 'Monday'
hour : int = 5
place : str = 'the park'

'Se you on %s at %d in %s' % (day, hour)

Or when you use a wrong format sequence.

day : str = 'Monday'
hour : int = 5
place : str = 'the park'

'Se you on %d at %d in %s' % (day, hour, place)

For more information on the format operator, see https://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting.

A more powerful alternative is the string format method, which you can read about at https://docs.python.org/3/library/stdtypes.html#str.format.

# Remove this line and add your code here

Reading Word Lists#

In one of the next lectures, file manipulation will be discussed, however we need to have a list of words.

This file is in plain text, so you can open it with a text editor, but you can also read it from Python.

The built-in function open takes the name of the file as a parameter and returns a file object you can use to read the content of the file.

words = open('words.txt')

The file object provides several methods for reading, including readline, which reads characters from the file until it gets to a newline and returns the result as a string.

words.readline()

The character \n represents whitespace character newline, that separate this word from the next.

The file object keeps track of where it is in the file, so if you call readline again, you get the next word.

words.readline()

The next word is “aah”.

The fact that every word that we read via readline is terminated by a whitespace is annoying and inconvenient.

We can get rid of these whitespace characters with the string method strip.

line : str = words.readline()
word : str = line.strip()
word

You can also use a file object as part of a for loop.

The following program reads words.txt and prints each word, one per line.

words = open('words.txt')
for word in words:
    s_word : str = word.strip()
    print(s_word)

Simple Examples#

There are a number of exercises/examples in the book.

The first is to print all words with a length greater than 20.

A straightforward solution is to use a for loop to iterate over the list of words, of course we first need to read the file with words.

In the body of the for loop there is a test on the length of the words, after stripping the whitespace.

words = open('words.txt')

for word in words:
    stripped_word : str = word.strip()
    if len(stripped_word) > 20:
        print(stripped_word)

The second example is on counting the number of words without an 'e' character.

Again we use a for loop, but now we need to keep a counter for counting the number of words without an 'e' letter.

words = open('words.txt')

no_e_count : int = 0

for word in words:
    
    if not('e' in word):
        no_e_count += 1

print(no_e_count)

The third example is variation on the second one, we do not only count the number of words without an 'e' character, but also all words in the file.

For course, we could introduce a separate loop for counting the total number of words, but because we have to iterate anyway one for loop will be sufficient.

We need to keep a counter for counting the number of words with an 'e' letter and a counter for counting all words.

words = open('words.txt')

count : int = 0
no_e_count : int = 0

for word in words:

    count += 1
    if not('e' in word):
        no_e_count += 1

print(count)
print(no_e_count)

print("Percentage is:", int((no_e_count/count)*100))

The fourth example is how via a simple check on can determine whether forbidden letters are used in a word.

def avoids(word : str, forbidden : str) -> bool:
    """check whether the forbidden letters do not appear the word"""
    for letter in forbidden:
        if letter in word:
            return False
    return True

avoids("foo", "abcf")
forbidden_letters = input('> ')

words = open('words.txt')

count = 0
for word in words:
    if avoids(word, forbidden_letters):
        count += 1

print(count)
def uses_only(word : str, only : str) -> bool:
    """check whether the all letters in word appear in only"""
    for letter in word:
        if not letter in only:
            return False
    return True

uses_only("foo", "fol")