Closely following Automate the boring stuff with Python
Financial Software Developer at Bloomberg
Financial Software Developer at Bloomberg
System Reliability Engineer at Bloomberg
Closely following Automate the boring stuff with Python
Closely following Automate the boring stuff with Python
>>> "This is a string"
'This is a string'
>>> 'This is also a string'
'This is also a string'
>>> "This string will not end'
SyntaxError: EOL while scanning string literal
>>> 'Say hi to Bob\'s mother.'
"Say hi to Bob's mother."
>>> 'Say hi to Bob\'s "mother."'
'Say hi to Bob\'s "mother."'
Escape character | Prints as |
---|---|
\' | Single quote |
\" | Double quote |
\t | Tab |
\n | Newline (line break) |
\\ | Backslash (\) |
>>> print('That is Alice\'s cat.')
That is Alice's cat.
>>> print(r'That is Alice\'s cat.')
That is Alice\'s cat.
print('''Dear Alice,
Eve's cat has been arrested for catnapping, cat burglary, and extortion.
Sincerely,
Bob''')
def spam():
"""This is a multiline comment to document
the purpose of the spam() function.
It's to print spam."""
print('Hello!')
>>> my_list = ['Hello', 'world', '!']
>>> my_list[0]
'Hello'
>>> my_list[1:]
['world', '!']
' H e l l o w o r l d ! '
# 0 1 2 3 4 5 6 7 8 9 10 11
>>> my_str = 'Hello world!'
>>> my_str[0]
'H'
>>> my_str[6]
'w'
>>> my_str[0:6]
'Hello '
>>> world = my_str[6:]
>>> world + my_str
'world!Hello world!'
>>> 'Hello' in 'Hello World'
True
>>> 'Hello' in 'Hello'
True
>>> 'HELLO' in 'Hello World'
False
>>> '' in 'spam'
True
>>> 'cats' not in 'cats and dogs'
False
>>> spam = 'Hello world!'
>>> spam = spam.upper()
>>> spam
'HELLO WORLD!'
>>> spam = spam.lower()
>>> spam
'hello world!'
feeling = input('How are you? ')
if feeling.lower() == 'great':
print('I feel great too.')
else:
print('I hope the rest of your day is good.')
>>> spam = 'Hello world!'
>>> spam.islower()
False
>>> spam.isupper()
False
>>> 'HELLO'.isupper()
True
>>> 'abc12345'.islower()
True
>>> '12345'.islower()
False
>>> '12345'.isupper()
False
isX method | Returns True if the string only contains... |
---|---|
isalpha() | letters |
isalnum() | letters and numbers |
isdecimal() | numbers |
isspace() | spaces, tabs, and new-lines |
istitle() | words that begin with an uppercase letter, followed only by lowercase letters |
>>> 'hello'.isalpha()
True
>>> 'hello123'.isalpha()
False
>>> 'hello123'.isalnum()
True
>>> 'hello'.isalnum()
True
>>> '123'.isdecimal()
True
>>> ' '.isspace()
True
>>> 'This Is Title Case'.istitle()
True
>>> 'This Is Title Case 123'.istitle()
True
>>> 'This Is not Title Case'.istitle()
False
>>> 'This Is NOT Title Case Either'.istitle()
False
"Enter your age:"
>>> one hundred
"Please enter a number for your age."
"Enter your age:"
>>> 100
"Select a new password (letters and numbers only):"
>>> password!
"Passwords can only have letters and numbers."
"Select a new password (letters and numbers only):"
>>> password1
>>> 'Hello world!'.startswith('Hello')
True
>>> 'Hello world!'.endswith('world!')
True
>>> 'abc123'.startswith('abcdef')
False
>>> 'abc123'.endswith('12')
False
>>> 'Hello world!'.startswith('Hello world!')
True
>>> 'Hello world!'.endswith('Hello world!')
True
>>> ', '.join(['cats', 'rats', 'bats'])
'cats, rats, bats'
>>> ' '.join(['My', 'name', 'is', 'Simon'])
'My name is Simon'
>>> 'ABC'.join(['My', 'name', 'is', 'Simon'])
'MyABCnameABCisABCSimon'
>>> 'My name is Simon'.split()
['My', 'name', 'is', 'Simon']
>>> 'MyABCnameABCisABCSimon'.split('ABC')
['My', 'name', 'is', 'Simon']
>>> 'My name is Simon'.split('m')
['My na', 'e is Si', 'on']
Closely following Automate the boring stuff with Python
Knowing regular expressions can mean the difference between solving a problem in 3 steps and solving it in 3,000 steps.
def isPhoneNumber(text):
if len(text) != 12:
return False
for i in range(0, 3):
if not text[i].isdecimal():
return False
if text[3] != '-':
return False
for i in range(4, 7):
if not text[i].isdecimal():
return False
if text[7] != '-':
return False
for i in range(8, 12):
if not text[i].isdecimal():
return False
return True
print('415-555-4242 is a phone number:')
print(isPhoneNumber('415-555-4242'))
print('Moshi moshi is a phone number:')
print(isPhoneNumber('Moshi moshi'))
415-555-4242 is a phone number:
True
Moshi moshi is a phone number:
False
The functionality of this code is very limited...
isPhoneNumber()
would return False even though it is a phone number!
in short RegEx
\d
in a regex stands for a digit character—that is, any single
numeral 0 to 9.
The regex \d\d\d-\d\d\d-\d\d\d\d
is used by Python to match the same text the
previous isPhoneNumber()
function did:
a string of three numbers, a hyphen, three more numbers, another hyphen, and four numbers.
RegExes can be much more powerful:
For example we can add{3}
after the \d
to repeat it 3 times.
This leads us to a slightly shorter regex that also matches the phone number format:
\d{3}-\d{3}-\d{4}
import re #This is the regex module
Always remember to import this module. Otherwise you'll get error messages that re is not
defined.
phoneNumberRegex = re.compile('\d{3}-\d{3}-\d{4}')
# search returns the FIRST match.
>>> text = "Susan's number is 415-555-3344 and mine is 123-333-4455"
>>> matches = phoneNumberRegex.search(text)
>>> phone_number = matches.group()
'415-555-1234'
# findall returns ALL matches as a list
>>> all_matches = phoneNumberRegex.findall(text)
>>> all_matches
['415-555-1234', '123-333-4455']
import re
re.compile()
function. (Remember to use a
string.)
search()/findall()
method. This returns a Match object for search()
and a list for
findall()
search()
, call the Match object’s group()
method to
return a string of the actual matched text.
(\d\d\d)-(\d\d\d-\d\d\d\d)
.
Then you can use the group()
match object method to grab the matching text from just
one group.
>>> text = "Susan's phone number is 415-555-3344 and mine is 123-333-4455"
>>> phoneNumberRegex = re.compile('(\d\d\d)-(\d\d\d-\d\d\d\d)')
>>> matches = phoneNumberRegex.search(text)
>>> phone_number = matches.group(1) # Index 1 returns the first group of parenthesis
'415'
>>> phone_number = matches.group(2) # Second group of parentheses
'555-3344'
>>> phone_number = matches.group(0) # passing 0 or nothing will return the whole matched string
'415-555-3344'
findall()
function
>>> text = "Susan's phone number is 415-555-3344 and mine is 123-333-4455"
>>> phoneNumberRegex = re.compile('(\d{3})-(\d{3}-\d{4})')
>>> matches = phoneNumberRegex.findall(text)
>>> matches
[('415', '555-3344'), ('123', '333-4455')]
>>> matches[0][1]
'555-3344'
>>> text = "Susan's phone number is 415-555-3344 and mine is 123-333-4455"
>>> phoneNumberRegex = re.compile('((\d{3})-(\d{3}-\d{4}))')
>>> matches = phoneNumberRegex.findall(text)
>>> matches
[('415-555-3344', '415', '555-3344'), ('123-333-4455', '123', '333-4455')]
ormatching)
pipecharacter):
>>> text = "Batman and Batwoman were driving a Batmobile"
>>> phoneNumberRegex = re.compile('Bat(man|woman|mobile|copter)')
>>> matches = phoneNumberRegex.search(text)
>>> matches.group()
'Batman'
>>> text = "Batman and Batwoman were driving a Batmobile"
>>> phoneNumberRegex = re.compile('(Bat(man|woman|mobile|copter))')
>>> matches = phoneNumberRegex.findall(text)
>>> matches
[('Batman', 'man'), ('Batwoman', 'woman'), ('Batmobile', 'mobile')]
Closely following Automate the boring stuff with Python
Closely following Automate the boring stuff with Python
Hypertext Markup Language (HTML) is the format that web pages are written in.
You can find some beginner tutorials here:
http://htmldog.com/guides/html/beginner/
https://www.w3schools.com/whatis/whatis_html.asp
https://developer.mozilla.org/en-US/learn/html/
This is a paragraph
This is a heading
This is a heading too
This is a link
import sys # basic system library in python, used to exit the script prematurely
import requests
from bs4 import BeautifulSoup
from datetime import datetime # We will use this to print the time of the scraping
# Proxy dict of the form:
# proxy_dict = {
# 'http': 'http://myproxy:12345',
# 'https': 'https://myproxy:12345'
# }
proxy_dict = {} # Set proxies here if needed!
# The URL you want to scrape
url = "https://www.marketwatch.com/investing/stock/aapl"
# Get the content of the page using the requests library
resp = requests.get(url, stream=True, proxies=proxy_dict)
if resp.status_code != 200: # 200 means success
print("ERROR! Could not get data from URL. Response: {}".format(resp))
sys.exit(1) # exit prematurely because we could not get data
else:
raw_html = resp.content
Full list of status codes here (Funny bonus: read about
status 418).
raw_htmlto make sure the request worked.
raw_html
soup = BeautifulSoup(raw_html, 'html.parser')
165.32
Selector | Matches |
---|---|
soup.select('div') | All elements named <div> |
soup.select('#author') | Element with an 'id' attribute = 'author' |
soup.select('.notice') | All element that use the 'class' attribute = 'notice' |
soup.select('div span') | All elements named <span> within a <div> element |
soup.select('input[name]') | All elements named <input> that have a 'name' attribute (with any value) |
soup.select('input[name=Kathrin]') | All elements named <input> that have a 'name' attribute = 'Kathrin' |
soup.select('p #author')
will match any element that has an id
attribute of author,
as long as it is also inside a <p> element.
bg-quote
with the field
=
'Last'.
The way to achieve this is using the select statement as follows:
element_with_price = soup.select('bg-quote[field=Last]')
element_with_price
you will notice that this returns a list.
This makes sense because our select could match more than one element! In our case the select is
detailed enough to only
return one, therefore we can just access the first element of the list instead:
element_with_price = soup.select('bg-quote[field=Last]')[0]
some text
In the example above you might be interested in either the field ("some_field") or
the text (some text) between the <p> tags.
text = element.get_text() # Will return 'some text'
get(..)
instead of get_text()
field_name = element.get('field') # Will return 'some_field'
price_text = element_with_price.get_text()
Attention: This will always return a string. If you want the actual numerical value you will need
to
convert this to a float!
print("The Apple Share price at {} was {}$".format(datetime.now(), price_text))
Closely following Automate the boring stuff with Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
numpy
is a powerful scientific computing librarypandas
implementation often needs numpy
so you should import bothmatplotlib.pyplot
is for plotting data
import random
data = random.sample(range(10), 10)
ts = pd.Series(data)
pandas
indexes the data for you as the integer 0..N
ts[6]
print(type(ts[6]))
ts[3:8]
print(type(ts[3:8]))
ts[ts>7]
Add a value to the Series as follows.
ts[10]=4
mask = ts > 7
ts[mask]
In one line
ts[ts > 7]
ts[(ts > 3) & (ts < 7)]
ts[~(ts==6)]
start = datetime(2018, 1, 1)
end = datetime(2018, 1, 10)
date_range = pd.date_range(start,end)
date_range
ts = pd.Series(data,index=(date_range))
ts.plot()
data = {'red': random.sample(range(10), 10),
'blue': random.sample(range(10), 10),
'green': random.sample(range(10), 10)}
df = pd.DataFrame(data)
The keys of the dictionary have become the column name of the DataFrame.
Much like Series the index defaulted to 0...N.
But, this can be set by passing in an "index=
" argument.
value = df['blue'][5]
Can extract a column. A column in a pandas DataFrame is pandas Series.
column = df['red']
Can select multiple columns by using a list of the relevant column names. This will return a
DataFrame.
multiple_columns = df[['red', 'blue']]
.loc[]
, which gets rows by index.
There is also iloc[]
, which gets rows by position (and hence only takes an integer).
A row is a pandas Series.
df.loc[5]
Select multiple rows by passing a list to loc
df.loc[[3,6,9]]
df = pd.read_csv('datasets/dataset1.csv')
Inspect the data....
df.columns
df.shape
df.head(10)
df.sample(10)
df[['Ticker', 'MarketCap']]
df.loc[:, ['Ticker','MarketCap']]
df.iloc[:, [0,3]]
df.filter(['Ticker','MarketCap'], axis=1)
Filter rows
df.loc[[345,400], :]
df.iloc[[345,400], :]
df.filter(items=[345, 400], axis=0)
df['MarketCap'].sum()
mean
df['TotalReturnYTD'].mean()
count
df['TotalReturnYTD'].count()
unique values
df['CountryOfDomicile'].nunique()