Breakthrough Tech Lab 2
Part 2 of a series of notebooks for teaching ML to early college students in an 8 week summer lab session.
Lab 2 - Working with Structured Data¶
The target of this lab session is to analyze and understand a large dataset efficiently. The dataset we will work with is a dataset of cities in the US and their climates. The module will discuss the challenges of loading data, finding the parts we are interested in, and visualizing data output.
The main technical tool we will be working with is a library known
as Pandas
. Despite the silly name, Pandas is a super popular
library for data analysis. It is used in many technology companies
for loading and manipulating data.
Review¶
Before we get started let us review some of the Python code that we saw last class.
We first saw a bunch of different types such as numbers, strings, and lists
number1 = 50.5
string1 = "New York"
list1 = ["Queens", "Brooklyn", "Manhattan", "Staten Island", "The Bronx"]
We then saw some more complex types likes dates and counters.
import datetime
date1 = datetime.datetime.now()
date1
As there are so many different types in Python, we discussed how important it was to use Google and StackOverflow to find examples.
from collections import Counter
counter = Counter(["A", "B", "A", "A", "C", "C", "B", "A", "A", "A"])
counter.most_common()
Next, we focused on if
and for
the two most important control
elements in python.
if
lets us decide which block of code to runfor
lets us run the same code for each element in a list
for val in range(10):
print(val)
Finally we discussed the special case of strings. There are many useful ways to find values in strings and create new strings.
"first " + "second"
str1 = "first|second"
str1.split("|")
Review Exercise¶
Print only the values in this list that are greater than 20.
list1 = [5, 50, 15, 60, 4, 80]
#ππππ FILLME
pass
Unit A¶
This week is all about data tables
. Data tables are a common way
of representing facts anywhere from newspaper articles to scientific
studies.
For instance, as a running example let us consider this table from Wikipedia.
You may have used datatables before in spreadsheets. For example we can put that wikipedia table in a spreadsheet.
In this spreadsheet we can do lots of things.
π©βπStudent question: Do you know how to do the following?
- Change the column names
- Delete a row
- Make a graph
- Add a new column
What about more advanced ideas. Can you?
- Sort by a column?
- Add a new column that changes the previous column?
- Take the sum of a row?
- Find the highest value in a row?
In this lab we will work with real-world data to learn how to calculate important properties.
Pandas¶
The data that we are working with is located in the file "Cities.csv". You can get this file from the internet by running this command.
This file is raw data as a text file. We can see the output in raw form.
We can see that "csv" stands for "comma separated values" as each element of the file is split using a comma.
Pandas is as a super-powered spreadsheet.
import pandas as pd
To load data in the library we use the following command. Here df
refers to the "DataFrame" which is what Pandas calls a spreadsheet.
df = pd.read_csv("https://srush.github.io/BT-AI/notebooks/Cities.csv")
df
Just like in a spreadsheet Pandas has multiple columns representing the underlying elements in the data. These each have a name here.
df.columns
To see just the elements in a single column we can use square brackets to see just only column.
df["City"]
π©βπStudent Question: Can you print out another column in the table?
#ππππ FILLME
pass
Alternatively if we want to see just a single row we can use the loc
command.
df.loc[1]
If we want to select several rows we can also pass in a list.
list_of_rows = [1, 5, 6]
df.loc[list_of_rows]
π©βπStudent Question: Can you print out the rows of Philadelphia and Los Angeles?
#ππππ FILLME
pass
Filters¶
These commands are relatively basic though and easy to do in a standard spreadsheet. The main power of Pandas comes from the ability to select rows based on complex filters.
For instance, if you were in a spreadsheet, how would you select only the rows that correspond to cities in Mexico? It's possible but a bit challenging.
In Pandas, we first create a filter. This is kind of like an if statement that gets applied to every row. It creates a variable that remembers which rows passed the filter test.
Filtering
- Decide on the conditional statements in your filter.
- Define a
filter
varaible for your dataframe . - Apply filter and rename the dataframe.
Step 1. Our filter is that we want the Country column to be Mexico
Step 2. We create a filter variable with this conditional. Notice that the filter has a 1 for every city in Mexico and a 0 otherwise.
filter = df["Country"] == "Mexico"
filter
Step 3. We then apply the filter to select the rows that we would like to keep around.
cities_in_mexico_df = df.loc[filter]
cities_in_mexico_df
We need to be careful to give this a new name. It does not change the original dataframe it just shows us the rows we asked for.
Filtering is a really important step because it lets us calculate other properties.
For example, we can then count the number of cities in Mexico.
total_cities_in_mexico = cities_in_mexico_df["City"].count()
total_cities_in_mexico
Or we can count the population of the biggest cities in Mexico.
total_population_in_mexico = cities_in_mexico_df["Population"].sum()
total_population_in_mexico
Filters can also be more complex. You can check for any of the different properties you might check for in a standard if statement.
For instance, here we want to keep both cities in the US and in Canada. The
symbol |
means either-or
.
filter = (df["Country"] == "United States") | (df["Country"] == "Canada")
us_or_canada_df = df.loc[filter]
us_or_canada_df
π©βπStudent Question: How many of the cities are in the US or Canada?
#ππππ FILLME
pass
Here is a list of the different statements that we commonly use.
Filter | Symbol |
---|---|
Or | | |
And | & |
Not | ~ |
Equal | == |
Less | < |
Greater | > |
Greater | > |
In | .str.contains |
Is one of | .isin |
Note: I didn't know many of these by heart. Don't be afraid to google "how to filter by ... in pandas" if you get stuck.
Group Exercise A¶
Question 1¶
Filters can be of many different types. For instance, when working with numerical fields we can have filters based on greater-than and less-than comparisons.
Write a filter that keeps only cities with greater than a million people.
#ππππ FILLME
pass
How many are there?
#ππππ FILLME
pass
(Be sure to print it out to check that it worked!)
Question 2¶
Several cities in North America include the word "City" in their name. Write a filter to find the cities that have "City" in their name.
#ππππ FILLME
pass
What is the smallest city on this list?
#ππππ FILLME
pass
Question 3¶
Most of the cities on the list are in Canada, Mexico or the US.
Can you write a filter to find the cities that are not in any of these countries?
#ππππ FILLME
pass
What is the largest city in this list?
#ππππ FILLME
pass
Question 4¶
We can also apply filters that look for two properties at the same time.
Can you write a filter to find the cities in the US of over a million people?
#ππππ FILLME
pass
How many are there?
#ππππ FILLME
pass
Unit B¶
In this unit we will look at three more advanced Pandas functions. Unlike filters, which just remove rows, these will allow use to manipute our data to compute new properties and even new columns.
Group By's¶
We saw above how to compute the total number of cities in Mexico on
our list. We did this by first filtering and then "aggregating" by
calling count()
. Here is a reminder.
filter = df["Country"] == "Mexico"
cities_in_mexico_df = df.loc[filter]
total_cities_in_mexico = cities_in_mexico_df["City"].count()
total_cities_in_mexico
However, what if we also want to know the number of cities in Canada and US and all the other countries on our list. We can do this with a group-by operation
GroupBy
- GroupBy - Determine the subset of data to use
- Aggregation - Compute a property over the group
Step 1. Group By
grouped = df.groupby(["Country"])
Step 2. Aggregate
count_of_cities = grouped["City"].count()
count_of_cities
Here is another example. This one computes the population of the largest city in each country.
max_pop = grouped["Population"].max()
max_pop
π©βπ Student Question: Can you compute the city with the minimum population on the list for each country?
#ππππ FILLME
pass
Manipulating Tables¶
Another useful aspect of tables is is to add in new columns. Adding new columns allows us to group by additional properties, create advanced filters, or make pretty graphs.
The easiest way to add a new column in pandas is to write a function that tells us how to create the new column from the other columns in the table.
In order to add a new column, we need to write a function. If you remember last class, a function looked something like this.
# Returns if the country is in US or Canada
def in_us_or_canada(country):
if country == "United States":
return "US/Canada"
if country == "Canada":
return "US/Canada"
return "Not US/Canada"
Now we can add a new column by setting that column equal to
the country. We do this by calling Pandas map
with the function
and the column of interest. This line of code will call our function
for each row of the Country column. Notice how it creates a new column.
df["US_or_Canada"] = df["Country"].map(in_us_or_canada)
df
df.columns
We can then use this column in a group by.
grouped = df.groupby(["US_or_Canada"])
count_of_cities = grouped["City"].count()
count_of_cities
A similar technique can be used to manipulate the data in a column to change certain values. For instance, we might want to remove the final " City" from cities like "New York"
def change_name(str1):
return str1.replace(" City", "")
df["City"] = df["City"].map(change_name)
df
Joining Together Tables¶
Pandas becomes much more powerful when we start to have many different tables that relate to each other. For this example we will consider another table that provides the locations about these cities. You can see that here:
Lets load this table into a new variable.
all_cities_df = pd.read_csv("https://srush.github.io/BT-AI/notebooks/AllCities.csv")
all_cities_df
This table has most of the cities in our dataset. But there are also a lot of other cities in this table outside of North America.
filter = all_cities_df["Country"] == "Germany"
europe_df = all_cities_df.loc[filter]
europe_df
In order to use this new information let's merge since it in to our table. We just need to tell pandas which are the shared columns between the two tables.
df = df.merge(all_cities_df, on=["City", "Country"])
df
Group Exercise B¶
Question 1¶
The following are the official abbreviation codes for the cities in our data table.
abbrev = {
"United States": "US",
"Mexico" : "MX",
"Canada" : "CA",
"Haiti" : "HAT",
"Jamaica" : "JM",
"Cuba" : "CU",
"Honduras" : "HO",
"Nicaragua" : "NR",
"Dominican Republic" : "DR",
"Guatemala" : "G",
}
Can you add a new column to the table called "Abbrev" that lists the abbreviation code for that city?
#ππππ FILLME
pass
Question 2¶
Our table has the Latitude and Longitude of all the major North American Cities.
Can you find out where New York is located? How about Detroit, Las Vegas, and Portland?
Question 3¶
Currently in the table the latitude and longitude are represented as string types, because they have N / S and E / W in their values. These two functions will fix that issue.
def latitude_to_number(latitude_string):
str1 = latitude_string
if str1[-1] == "N":
return float(str1[:-1])
else:
return -float(str1[:-1])
def longitude_to_number(longitude_string):
str1 = longitude_string.replace("W", "")
return -float(str1)
lat = latitude_to_number("190N")
lat
Can you use these functions to fix the Latitude and Longitude columns to instead use numeric values?
#ππππ FILLME
pass
Question 4¶
After completing question 3 use group by and compute the Latitude of most southern city in each country of the table.
#ππππ FILLME
pass
Visualization¶
Next class we will dive deeper into plotting and visualization. But let's finish with a little demo to show off all the tables we created.
First we import some libraries
import altair as alt
from vega_datasets import data
states = alt.topo_feature(data.us_10m.url, feature='states')
background = alt.Chart(states).mark_geoshape().project('albersUsa')
Now we can plot
states = alt.topo_feature(data.world_110m.url, feature='countries')
chart = alt.Chart(states).mark_geoshape(
fill='lightgray',
stroke='white'
).properties(
width=500,
height=300
).project('orthographic', rotate= [95, -42, 0])
if False:
points = alt.Chart(df).mark_circle().encode(
longitude='Longitude',
latitude='Latitude',
size="Population",
tooltip=['City','Population']
)
chart += points
chart