Learn more Python

Lessons 1-6

Closely following Automate the boring stuff with Python

Lesson 1

Lesson 2

Lesson 3

Lesson 4

Lesson 5

Lesson 6

About us

Kathrin Schuler

Financial Software Developer at Bloomberg


Natalie Keating

Financial Software Developer at Bloomberg


Tucker Vento

System Reliability Engineer at Bloomberg


Learn more Python

Lesson 1

Closely following Automate the boring stuff with Python

Continue where we left off...

Thank you!!

Learn more Python

Lesson 2

Closely following Automate the boring stuff with Python

Let's review strings


                >>> "This is a string"
                'This is a string'
                >>> 'This is also a string'
                'This is also a string'
                >>> "This string will not end'
                SyntaxError: EOL while scanning string literal
                
If you need to use single quotes within the string, it should be enclosed in double quotes, and vice versa. What if you need to use both?

Escape Characters

Escape characters allow us to use characters that are otherwise impossible to put into a string. An escape character consists of a backslash (\) followed by the character you want to add to the string.

                        >>> 'Say hi to Bob\'s mother.'
                        "Say hi to Bob's mother."
                        >>> 'Say hi to Bob\'s "mother."'
                        'Say hi to Bob\'s "mother."'
                        

Escape Characters

Escape character Prints as
\' Single quote
\" Double quote
\t Tab
\n Newline (line break)
\\ Backslash (\)

Raw Strings

If you place an r before the beginning quotation mark of a string, it will be considered a raw string This causes Python to ignore escape characters, and print any backslashes that appear in the string.

                        >>> print('That is Alice\'s cat.')
                        That is Alice's cat.
                        >>> print(r'That is Alice\'s cat.')
                        That is Alice\'s cat.
                        

Triple Quotes

The \n escape character is helpful for formatting long strings, but is an artifact of older programming languages. Python provides multiline strings, which are surrounded by either three single quotes or three double quotes.
Try writing the following code:

                        print('''Dear Alice,

                        Eve's cat has been arrested for catnapping, cat burglary, and extortion.

                        Sincerely,
                        Bob''')

Multiline Comments

Remember that we use the # character to mark the beginning of a comment for the remainder of a line.
A multiline string is often used for comments that span multiple lines. The following is valid and acceptable Python code:

                        def spam():
                            """This is a multiline comment to document
                            the purpose of the spam() function.
                            
                            It's to print spam."""
                            print('Hello!')

Indexing in Lists

Recall that the items in a list can be referred to using 'indices' or 'indexes'.

                    >>> my_list = ['Hello', 'world', '!']
                    >>> my_list[0]
                    'Hello'
                    >>> my_list[1:]
                    ['world', '!']

Indexing in Strings

Strings are 'indexed' in the same manner as lists.
That means you can access the individual characters in the string by referring to their position within the string.
Let's consider the string "Hello world!":

                        '  H  e  l  l  o     w  o  r  l  d  !  '
                        #  0  1  2  3  4  5  6  7  8  9  10 11

                        >>> my_str = 'Hello world!'
                        >>> my_str[0]
                        'H'
                        >>> my_str[6]
                        'w'
                        >>> my_str[0:6]
                        'Hello '
                        >>> world = my_str[6:]
                        >>> world + my_str
                        'world!Hello world!'

Check if a value exists

Just like in a list, you can check for the existence of a value within a string using the in and not in keywords. This will evaluate to a boolean value; either True or False. The check is literal, for the exact string, and case-sensitive.

                        >>> 'Hello' in 'Hello World'
                        True
                        >>> 'Hello' in 'Hello'
                        True
                        >>> 'HELLO' in 'Hello World'
                        False
                        >>> '' in 'spam'
                        True
                        >>> 'cats' not in 'cats and dogs'
                        False

INTRODUCE OBJECTS AND THEIR METHODS

Case Manipulation

Python provides helpful methods on string objects to convert between cases. upper() and lower().will return a new string with every character converted to the desired case. This can be very helpful when trying to make case-insensitive comparisons.

                        >>> spam = 'Hello world!'
                        >>> spam = spam.upper()
                        >>> spam
                        'HELLO WORLD!'
                        >>> spam = spam.lower()
                        >>> spam
                        'hello world!'

Case Manipulation

The upper() and lower() methods are especially useful when trying to make case-insensitive comparisons.

                        feeling = input('How are you? ')
                        if feeling.lower() == 'great':
                            print('I feel great too.')
                        else:
                            print('I hope the rest of your day is good.')

Case Verification

If you do want to be case-sensitive, there are also the methods isupper() and islower(). These return a boolean value of True or False to indicate whether the entire string is uppercase or lowercase, respectively.

                        >>> spam = 'Hello world!'
                        >>> spam.islower()
                        False
                        >>> spam.isupper()
                        False
                        >>> 'HELLO'.isupper()
                        True
                        >>> 'abc12345'.islower()
                        True
                        >>> '12345'.islower()
                        False
                        >>> '12345'.isupper()
                        False

isX Methods

Along with isupper() and islower(), there are several other string-based methods that begin with the word is. These methods all return a boolean value that describes the nature of the string.
isX method Returns True if the string only contains...
isalpha() letters
isalnum() letters and numbers
isdecimal() numbers
isspace() spaces, tabs, and new-lines
istitle() words that begin with an uppercase letter, followed only by lowercase letters

isX Practice

Verify the following in your interpreter:

                        >>> 'hello'.isalpha()
                        True
                        >>> 'hello123'.isalpha()
                        False
                        >>> 'hello123'.isalnum()
                        True
                        >>> 'hello'.isalnum()
                        True
                        >>> '123'.isdecimal()
                        True
                        >>> '    '.isspace()
                        True
                        >>> 'This Is Title Case'.istitle()
                        True
                        >>> 'This Is Title Case 123'.istitle()
                        True
                        >>> 'This Is not Title Case'.istitle()
                        False
                        >>> 'This Is NOT Title Case Either'.istitle()
                        False

Time for Practice!

Can you write a program that asks users for their age and a password, continuing to prompt them until it is provided in the correct format? The age must be only numbers, and the password can only have letters and numbers.

                        "Enter your age:"
                        >>> one hundred
                        "Please enter a number for your age."
                        "Enter your age:"
                        >>> 100
                        "Select a new password (letters and numbers only):"
                        >>> password!
                        "Passwords can only have letters and numbers."
                        "Select a new password (letters and numbers only):"
                        >>> password1

startswith() and endswith()

The startswith() and endswith() methods return True only if the string value that they are called on begins (or ends) with the string passed to the method.

                        >>> 'Hello world!'.startswith('Hello')
                        True
                        >>> 'Hello world!'.endswith('world!')
                        True
                        >>> 'abc123'.startswith('abcdef')
                        False
                        >>> 'abc123'.endswith('12')
                        False
                        >>> 'Hello world!'.startswith('Hello world!')
                        True
                        >>> 'Hello world!'.endswith('Hello world!')
                        True

Joining Strings

While we can concatenate strings using addition (+) this can become tedious for large groupings of strings, or if you do not know what the strings will look like in advance.
For this purpose, we can use the join() method. The join() method is called on a string, gets passed a list of strings, and returns a string. The returned string is the concatenation of each string that was in the passed-in list. Let's practice!

                        >>> ', '.join(['cats', 'rats', 'bats'])
                        'cats, rats, bats'
                        >>> ' '.join(['My', 'name', 'is', 'Simon'])
                        'My name is Simon'
                        >>> 'ABC'.join(['My', 'name', 'is', 'Simon'])
                        'MyABCnameABCisABCSimon'
Notice that the string that we call join() on is inserted in between each string of the list that we passed in. This can allow us to create coherent sentences, parse-able strings, and even strictly formatted files like csv.

Splitting Strings

Remember that join() is called on a string value and is passed a list. The split() method does the opposite: it is called on a string value and returns a list of strings.

                        >>> 'My name is Simon'.split()
                        ['My', 'name', 'is', 'Simon']
By default, the string is split wherever whitespace characters (space, tab, newline) are found. These whitespace characters are removed from the list that is returned. You can pass something called a delimiter string to the split() method to specify a different string/character to split on.

                        >>> 'MyABCnameABCisABCSimon'.split('ABC')
                        ['My', 'name', 'is', 'Simon']
                        >>> 'My name is Simon'.split('m')
                        ['My na', 'e is Si', 'on']

Thank you!!

Learn more Python

Lesson 3

Closely following Automate the boring stuff with Python

Pattern Matching with Regular Expressions

What is a Regular Expression?

You may be familiar with searching for text by pressing CTRL-F and typing in the words you’re looking for. Regular expressions go one step further: They allow you to specify a pattern of text to search for.

Examples are:
  • Phone numbers: If you live in the US you know it will be 3 digits followed by a hyphen, followed by 4 more digits and maybe an area code (415-555-6789)
  • Email addresses: You know it will be some characters, followed by the @ sign, followed by characters, a dot and some ending (john_smith@gmail.com)


Knowing regular expressions can mean the difference between solving a problem in 3 steps and solving it in 3,000 steps.

Notepad++ and other (decent) text editors support search and replacing by regular expression.
We will start by writing a program to find text patterns without using regular expressions and then see how to use regular expressions to make the code much less bloated.

Detecting a phone number

... without Regex! You know the pattern: three numbers, a hyphen, three numbers, a hyphen, and four numbers.

                        def isPhoneNumber(text):
                            if len(text) != 12:
                                return False
                            for i in range(0, 3):
                                if not text[i].isdecimal():
                                    return False
                            if text[3] != '-':
                                return False
                            for i in range(4, 7):
                                if not text[i].isdecimal():
                                    return False
                            if text[7] != '-':
                                return False
                            for i in range(8, 12):
                                if not text[i].isdecimal():
                                    return False
                            return True

                     
We can call this function now with a valid phone number, and an invalid one:

                       print('415-555-4242 is a phone number:')
                       print(isPhoneNumber('415-555-4242'))
                       print('Moshi moshi is a phone number:')
                       print(isPhoneNumber('Moshi moshi'))

                       415-555-4242 is a phone number:
                       True
                       Moshi moshi is a phone number:
                       False
Imagine how much more code you would need to add to find all the phone numbers in a text.

The functionality of this code is very limited...

What if phone numbers in the text are formatted like (415) 555-1234 or 415.555.1234?
isPhoneNumber() would return False even though it is a phone number!
...and adding more code to cover all those cases would make the function much less readable

Regular Expressions

in short RegEx

Regular expressions are descriptions for a pattern of text. For example, a \d in a regex stands for a digit character—that is, any single numeral 0 to 9. The regex \d\d\d-\d\d\d-\d\d\d\d is used by Python to match the same text the previous isPhoneNumber() function did: a string of three numbers, a hyphen, three more numbers, another hyphen, and four numbers.

RegExes can be much more powerful:

For example we can add {3} after the \d to repeat it 3 times. This leads us to a slightly shorter regex that also matches the phone number format:

\d{3}-\d{3}-\d{4}

Regex in Python


                    import re #This is the regex module
                
Always remember to import this module. Otherwise you'll get error messages that re is not defined.
First, we need to create a regex object. This is done as follows:

                     phoneNumberRegex = re.compile('\d{3}-\d{3}-\d{4}')
                
Then we can search for the first match using the regex object:

                     # search returns the FIRST match.
                     >>> text = "Susan's number is 415-555-3344 and mine is 123-333-4455"
                     >>> matches = phoneNumberRegex.search(text)
                     >>> phone_number = matches.group()
                       '415-555-1234'
                
Alternatively we can find all matches:

                     # findall returns ALL matches as a list
                     >>> all_matches = phoneNumberRegex.findall(text)
                     >>> all_matches
                       ['415-555-1234', '123-333-4455']
                

Review

  1. Import the regex module with import re
  2. Create a Regex object with the re.compile() function. (Remember to use a string.)
  3. Pass the string you want to search into the Regex object’s search()/findall() method. This returns a Match object for search() and a list for findall()
  4. If you used search(), call the Match object’s group() method to return a string of the actual matched text.


You can also test your regex here. Make sure you select Python!

More Patterns

Groups

Say you want to separate the area code from the rest of the phone number. Adding parentheses will create groups in the regex: (\d\d\d)-(\d\d\d-\d\d\d\d). Then you can use the group() match object method to grab the matching text from just one group.

                     >>> text = "Susan's phone number is 415-555-3344 and mine is 123-333-4455"
                     >>> phoneNumberRegex = re.compile('(\d\d\d)-(\d\d\d-\d\d\d\d)')
                     >>> matches = phoneNumberRegex.search(text)
                     >>> phone_number = matches.group(1)  # Index 1 returns the first group of parenthesis
                       '415'
                     >>> phone_number = matches.group(2)  # Second group of parentheses
                       '555-3344'
                     >>> phone_number = matches.group(0)  # passing 0 or nothing will return the whole matched string
                       '415-555-3344'
                
Similarly when you want to use the findall() function

                     >>> text = "Susan's phone number is 415-555-3344 and mine is 123-333-4455"
                     >>> phoneNumberRegex = re.compile('(\d{3})-(\d{3}-\d{4})')
                     >>> matches = phoneNumberRegex.findall(text)
                     >>> matches
                       [('415', '555-3344'), ('123', '333-4455')]
                     >>> matches[0][1]
                       '555-3344'
                
Notice that the code above does not return the full phone number. If you are using groups and you still want the full match to be returned you need to wrap everything into a group:

                     >>> text = "Susan's phone number is 415-555-3344 and mine is 123-333-4455"
                     >>> phoneNumberRegex = re.compile('((\d{3})-(\d{3}-\d{4}))')
                     >>> matches = phoneNumberRegex.findall(text)
                     >>> matches
                       [('415-555-3344', '415', '555-3344'), ('123-333-4455', '123', '333-4455')]
                

Pipe (the or matching)

Assuming you allow different characters at a single place, for example you'd like to match batman, batmobile, batwoman and batcopter. For that we can use the | (pipe character):

                     >>> text = "Batman and Batwoman were driving a Batmobile"
                     >>> phoneNumberRegex = re.compile('Bat(man|woman|mobile|copter)')
                     >>> matches = phoneNumberRegex.search(text)
                     >>> matches.group()
                       'Batman'
                
In case you want to use the findall() function (Remember to wrap it in a group if you want the full match)

                     >>> text = "Batman and Batwoman were driving a Batmobile"
                     >>> phoneNumberRegex = re.compile('(Bat(man|woman|mobile|copter))')
                     >>> matches = phoneNumberRegex.findall(text)
                     >>> matches
                       [('Batman', 'man'), ('Batwoman', 'woman'), ('Batmobile', 'mobile')]
                

Thank you!!

Learn more Python

Lesson 4

Closely following Automate the boring stuff with Python

Solving your problems

Thank you!!

Learn more Python

Lesson 5

Closely following Automate the boring stuff with Python

Webscraping

Quick Introduction to HTML

Hypertext Markup Language (HTML) is the format that web pages are written in. You can find some beginner tutorials here:

http://htmldog.com/guides/html/beginner/
https://www.w3schools.com/whatis/whatis_html.asp
https://developer.mozilla.org/en-US/learn/html/

HTML Elements

Every HTML web page is built using building blocks, the so called "HTML Elements". Those elements are opened using <> and closed using </>. They can enclose some text. There can also be attributes that define the element further.

                          
                          

This is a paragraph

This is a heading

This is a heading too

This is a link

How to see the source HTML of a webpage

You’ll need to look at the HTML source of the web page that your program will work with. To do this, right-click (or CTRL-click on OS X) any web page in your web browser, and select View Source or View page source to see the HTML text of the page. view_page_source
Another option is to use Developer tools (F12 in Chrome) and inspect the elements

Your first Webscraper

1. Import your dependencies

requests is a Python library to get the content of a URL (amongst other things). We will use this to get our HTML file. Beautiful Soup is a Python library for extracting data from HTML and XML files.

                          import sys # basic system library in python, used to exit the script prematurely
                          import requests
                          from bs4 import BeautifulSoup
                          from datetime import datetime  # We will use this to print the time of the scraping
                        

2. Set the URL and proxies

A proxy server is a gateway between the internet and you. All the content will go through the proxy to the internet and back to you. If you are in a corporate network you will need a proxy to access the internet. Ask your instructors if you need it! In this example we will try to get Apples stock price at the moment of scraping from marketwatch.com

                          # Proxy dict of the form:
                          # proxy_dict = {
                          #     'http': 'http://myproxy:12345',
                          #     'https': 'https://myproxy:12345'
                          # }
                          proxy_dict = {} # Set proxies here if needed!

                          # The URL you want to scrape
                          url = "https://www.marketwatch.com/investing/stock/aapl"
                        

3. Get the HTML content

In order to parse the web page we need to get all of its content first. We will be using the requests module for this.

                          # Get the content of the page using the requests library
                          resp = requests.get(url, stream=True, proxies=proxy_dict)
                          if resp.status_code != 200: # 200 means success
                              print("ERROR! Could not get data from URL. Response: {}".format(resp))
                              sys.exit(1) # exit prematurely because we could not get data
                          else:
                              raw_html = resp.content
                        
Full list of status codes here (Funny bonus: read about status 418).

4. Inspect your data

Now you can inspect
raw_html
to make sure the request worked.

                        raw_html
                    
raw_html

5. Create an HTML Parser

We'll use BeautifulSoup to parse our HTML file:

                        soup = BeautifulSoup(raw_html, 'html.parser')
                    

6. Parse the elements

Now we will need to know which element/tag we want to use. We can use the page inspect tool for this. From the inspection we know the element we want to use looks like this:

                        165.32
                    
Different elements can have different attributes (for example id, class,..etc). Based on this we can select those using BeautifulSoup.
Selector Matches
soup.select('div') All elements named <div>
soup.select('#author') Element with an 'id' attribute = 'author'
soup.select('.notice') All element that use the 'class' attribute = 'notice'
soup.select('div span') All elements named <span> within a <div> element
soup.select('input[name]') All elements named <input> that have a 'name' attribute (with any value)
soup.select('input[name=Kathrin]') All elements named <input> that have a 'name' attribute = 'Kathrin'

The various selector patterns can be combined to make sophisticated matches. For example, soup.select('p #author') will match any element that has an id attribute of author, as long as it is also inside a <p> element.
In our case we want to get the tag called bg-quote with the field = 'Last'. The way to achieve this is using the select statement as follows:

                        element_with_price = soup.select('bg-quote[field=Last]')
                    

When you inspect element_with_price you will notice that this returns a list. This makes sense because our select could match more than one element! In our case the select is detailed enough to only return one, therefore we can just access the first element of the list instead:

                        element_with_price = soup.select('bg-quote[field=Last]')[0]
                    

7. Retrieve the price value

Sometimes you will want to scrape the value of an attribute, and sometimes the text between the tags:

                        

some text

In the example above you might be interested in either the field ("some_field") or the text (some text) between the <p> tags.

BeautifulSoup offers us the functionality to get exactly that:
text = element.get_text()  # Will return 'some text'
                    
To get an attribute value we would use get(..) instead of get_text()

                        field_name = element.get('field')  # Will return 'some_field'
In our case we are interested in the text between the tags (The actual price!).

                        price_text = element_with_price.get_text()
                    
Attention: This will always return a string. If you want the actual numerical value you will need to convert this to a float!

8. Finalize

Now you can do some more processing with your data or just simply print it out

                        print("The Apple Share price at {} was {}$".format(datetime.now(), price_text))
                    

Thank you!!

Learn more Python

Lesson 6

Closely following Automate the boring stuff with Python

Data processing with pandas

  • pandas is Python's data analysis library. It should make working with and manipulating datasets in Python easier.
  • Inside pandas is two data stuctures: Series and DataFrame

Some imports

                        
                            import pandas as pd
                            import numpy as np
                            import matplotlib.pyplot as plt
                        
                    
  • numpy is a powerful scientific computing library
  • pandas implementation often needs numpy so you should import both
  • matplotlib.pyplot is for plotting data
  • The import abbreviations are conventions

Series

Use for one dimensional data (e.g. a time series or a column of data). Try it out!
                        
                            import random
                            data = random.sample(range(10), 10)
                            ts = pd.Series(data)
                        
                    
  • Here, we generate a list of random numbers as dummy data
  • Notice that pandas indexes the data for you as the integer 0..N

Indexing

Try it out!
                        
                            ts[6]
                            print(type(ts[6]))
                            ts[3:8]
                            print(type(ts[3:8]))
                            ts[ts>7]
                        

                    
Add a value to the Series as follows.
                        
                            ts[10]=4
                        
                    

Masks

Pandas data structures can be indexed using a Boolean array. This is know as creating a mask.
                        
                            mask = ts > 7
                            ts[mask]
                        
                    
In one line
                        
                            ts[ts > 7]
                            ts[(ts > 3) & (ts < 7)]
                            ts[~(ts==6)]
                        
                    
  • & = and
  • | = or
  • ~ = not

DatetimeIndex

We don't need to use numbers as the index. We could use a list of strings, for example. But what if the date represents a time series?
                        
                            start = datetime(2018, 1, 1)
                            end = datetime(2018, 1, 10)
                            date_range = pd.date_range(start,end)
                            date_range
                            ts = pd.Series(data,index=(date_range))
                        
                    

Plotting

Try it out!
                        
                            ts.plot()
                        
                    

DataFrame

Tabular data structure comprised of rows and columns (e.g. spreadsheet, database table)
                        
                            data = {'red': random.sample(range(10), 10),
                                    'blue': random.sample(range(10), 10),
                                    'green': random.sample(range(10), 10)}
                            df = pd.DataFrame(data)
                        
                    
The keys of the dictionary have become the column name of the DataFrame. Much like Series the index defaulted to 0...N. But, this can be set by passing in an "index=" argument.

Indexing by column

We index by the column first, and then the row.
                        
                            value = df['blue'][5]
                        
                    
Can extract a column. A column in a pandas DataFrame is pandas Series.
                        
                            column = df['red']
                        
                    
Can select multiple columns by using a list of the relevant column names. This will return a DataFrame.
                        
                            multiple_columns = df[['red', 'blue']]
                        
                    

Indexing by row

Select rows using .loc[], which gets rows by index. There is also iloc[], which gets rows by position (and hence only takes an integer). A row is a pandas Series.
                        
                            df.loc[5]
                        
                    
Select multiple rows by passing a list to loc
                        
                            df.loc[[3,6,9]]
                        
                    

Importing data

                        
                            df = pd.read_csv('datasets/dataset1.csv')
                        
                    
Inspect the data....
                        
                            df.columns
                            df.shape
                            df.head(10)
                            df.sample(10)
                        
                    
  • The first row has became the names of the columns in the dataframe
  • pandas has generated an index

Filter data

Filter columns
                        
                            df[['Ticker', 'MarketCap']]
                            df.loc[:, ['Ticker','MarketCap']]
                            df.iloc[:, [0,3]]
                            df.filter(['Ticker','MarketCap'], axis=1)
                        
                    
Filter rows
                        
                            df.loc[[345,400], :]
                            df.iloc[[345,400], :]
                            df.filter(items=[345, 400], axis=0)
                        
                    

Calculations

sum
                        
                            df['MarketCap'].sum()
                        
                    
mean
                        
                            df['TotalReturnYTD'].mean()
                        
                    
count
                        
                            df['TotalReturnYTD'].count()
                        
                    
unique values
                        
                            df['CountryOfDomicile'].nunique()
                        
                    

Time for practise

  • What's the total market cap for BRITAIN companies?
  • What's the average return YTD for FINLAND companies?
  • How many companies have a return YTD greater than 1?
  • How many companies have a market cap over £100 BLN?
  • How many companies in SPAIN?
  • How many companies have a return YTD between -10% and +10%?
  • Create a new dataframe called "gb" with British companies and their market caps.

Thank you!!

Thank you!!