Breakthrough Tech Lab 2


Part 2 of a series of notebooks for teaching ML to early college students in an 8 week summer lab session.

Lab 2 - Working with Structured Data

The target of this lab session is to analyze and understand a large dataset efficiently. The dataset we will work with is a dataset of cities in the US and their climates. The module will discuss the challenges of loading data, finding the parts we are interested in, and visualizing data output.

pandas

The main technical tool we will be working with is a library known as Pandas. Despite the silly name, Pandas is a super popular library for data analysis. It is used in many technology companies for loading and manipulating data.

Review

Before we get started let us review some of the Python code that we saw last class.

python

We first saw a bunch of different types such as numbers, strings, and lists

In [1]:
number1 = 50.5
string1 = "New York"
list1 = ["Queens", "Brooklyn", "Manhattan", "Staten Island", "The Bronx"]

We then saw some more complex types likes dates and counters.

In [2]:
import datetime
date1 = datetime.datetime.now()
date1
Out[2]:
datetime.datetime(2021, 6, 10, 13, 50, 57, 271951)

As there are so many different types in Python, we discussed how important it was to use Google and StackOverflow to find examples.

In [3]:
from collections import Counter
counter = Counter(["A", "B", "A", "A", "C", "C", "B", "A", "A", "A"])
counter.most_common()
Out[3]:
[('A', 6), ('B', 2), ('C', 2)]

Next, we focused on if and for the two most important control elements in python.

  • if lets us decide which block of code to run
  • for lets us run the same code for each element in a list
In [4]:
for val in range(10):
    print(val)
0
1
2
3
4
5
6
7
8
9

Finally we discussed the special case of strings. There are many useful ways to find values in strings and create new strings.

In [5]:
"first " + "second"
Out[5]:
'first second'
In [6]:
str1 = "first|second"
In [7]:
str1.split("|")
Out[7]:
['first', 'second']

Review Exercise

Print only the values in this list that are greater than 20.

In [8]:
list1 = [5, 50, 15, 60, 4, 80]
In [9]:
#📝📝📝📝 FILLME
pass

Unit A

This week is all about data tables. Data tables are a common way of representing facts anywhere from newspaper articles to scientific studies.

For instance, as a running example let us consider this table from Wikipedia.

new york

You may have used datatables before in spreadsheets. For example we can put that wikipedia table in a spreadsheet.

In this spreadsheet we can do lots of things.

👩‍🎓Student question: Do you know how to do the following?

  • Change the column names
  • Delete a row
  • Make a graph
  • Add a new column

What about more advanced ideas. Can you?

  • Sort by a column?
  • Add a new column that changes the previous column?
  • Take the sum of a row?
  • Find the highest value in a row?

In this lab we will work with real-world data to learn how to calculate important properties.

Pandas

The data that we are working with is located in the file "Cities.csv". You can get this file from the internet by running this command.

This file is raw data as a text file. We can see the output in raw form.

We can see that "csv" stands for "comma separated values" as each element of the file is split using a comma.

Pandas is as a super-powered spreadsheet.

In [10]:
import pandas as pd

To load data in the library we use the following command. Here df refers to the "DataFrame" which is what Pandas calls a spreadsheet.

In [11]:
df = pd.read_csv("https://srush.github.io/BT-AI/notebooks/Cities.csv")
df
Out[11]:
Rank City Country Population
0 0 Mexico City Mexico 8918653
1 1 New York City United States 8550405
2 2 Los Angeles United States 3971883
3 3 Toronto Canada 2826498
4 4 Chicago United States 2720546
... ... ... ... ...
90 90 Surrey Canada 526004
91 91 Ciudad López Mateos Mexico 523296
92 92 Tultitlán Mexico 520557
93 93 Fresno United States 520052
94 94 Carrefour Haiti 501768

95 rows × 4 columns

Just like in a spreadsheet Pandas has multiple columns representing the underlying elements in the data. These each have a name here.

In [12]:
df.columns
Out[12]:
Index(['Rank', 'City', 'Country', 'Population'], dtype='object')

To see just the elements in a single column we can use square brackets to see just only column.

In [13]:
df["City"]
Out[13]:
0             Mexico City
1           New York City
2             Los Angeles
3                 Toronto
4                 Chicago
             ...         
90                 Surrey
91    Ciudad López Mateos
92              Tultitlán
93                 Fresno
94              Carrefour
Name: City, Length: 95, dtype: object

👩‍🎓Student Question: Can you print out another column in the table?

In [14]:
#📝📝📝📝 FILLME
pass

Alternatively if we want to see just a single row we can use the loc command.

In [15]:
df.loc[1]
Out[15]:
Rank                      1
City          New York City
Country       United States
Population          8550405
Name: 1, dtype: object

If we want to select several rows we can also pass in a list.

In [16]:
list_of_rows = [1, 5, 6]
df.loc[list_of_rows]
Out[16]:
Rank City Country Population
1 1 New York City United States 8550405
5 5 Houston United States 2296224
6 6 Havana Cuba 2117625

👩‍🎓Student Question: Can you print out the rows of Philadelphia and Los Angeles?

In [17]:
#📝📝📝📝 FILLME
pass

Filters

These commands are relatively basic though and easy to do in a standard spreadsheet. The main power of Pandas comes from the ability to select rows based on complex filters.

For instance, if you were in a spreadsheet, how would you select only the rows that correspond to cities in Mexico? It's possible but a bit challenging.

In Pandas, we first create a filter. This is kind of like an if statement that gets applied to every row. It creates a variable that remembers which rows passed the filter test.

Filtering

  1. Decide on the conditional statements in your filter.
  2. Define a filter varaible for your dataframe .
  3. Apply filter and rename the dataframe.

Step 1. Our filter is that we want the Country column to be Mexico

Step 2. We create a filter variable with this conditional. Notice that the filter has a 1 for every city in Mexico and a 0 otherwise.

In [18]:
filter = df["Country"] == "Mexico"
filter
Out[18]:
0      True
1     False
2     False
3     False
4     False
      ...  
90    False
91     True
92     True
93    False
94    False
Name: Country, Length: 95, dtype: bool

Step 3. We then apply the filter to select the rows that we would like to keep around.

In [19]:
cities_in_mexico_df = df.loc[filter]
cities_in_mexico_df
Out[19]:
Rank City Country Population
0 0 Mexico City Mexico 8918653
8 8 Ecatepec de Morelos Mexico 1677678
12 12 Guadalajara Mexico 1460148
13 13 Puebla Mexico 1437939
15 15 Juárez Mexico 1382753
16 16 León Mexico 1349224
18 18 Tijuana Mexico 1298475
21 21 Zapopan Mexico 1179681
22 22 Monterrey Mexico 1109171
24 24 Nezahualcóyotl Mexico 1039867
29 29 Naucalpan Mexico 970012
33 33 Mérida Mexico 892363
34 34 Querétaro Mexico 878931
35 35 Toluca Mexico 873536
37 37 Chihuahua Mexico 867736
43 43 Hermosillo Mexico 819999
44 44 Saltillo Mexico 791074
45 45 Aguascalientes Mexico 785942
47 47 San Luis Potosí Mexico 759854
48 48 Veracruz Mexico 752171
51 51 Culiacán Mexico 720798
53 53 Mexicali Mexico 717628
54 54 Cancún Mexico 711682
55 55 Acapulco Mexico 710261
56 56 Tlalnepantla Mexico 700734
58 58 Guadalupe Mexico 682880
61 61 Chimalhuacán Mexico 679811
66 66 Tlaquepaque Mexico 664193
69 69 Torreón Mexico 653150
72 72 Reynosa Mexico 631611
78 78 Morelia Mexico 608190
80 80 Tuxtla Gutiérrez Mexico 598710
81 81 Apodaca Mexico 597207
82 82 Durango Mexico 590893
87 87 Tonalá Mexico 536111
89 89 Cuautitlán Izcalli Mexico 531041
91 91 Ciudad López Mateos Mexico 523296
92 92 Tultitlán Mexico 520557

We need to be careful to give this a new name. It does not change the original dataframe it just shows us the rows we asked for.

Filtering is a really important step because it lets us calculate other properties.

For example, we can then count the number of cities in Mexico.

In [20]:
total_cities_in_mexico = cities_in_mexico_df["City"].count()
total_cities_in_mexico
Out[20]:
38

Or we can count the population of the biggest cities in Mexico.

In [21]:
total_population_in_mexico = cities_in_mexico_df["Population"].sum()
total_population_in_mexico
Out[21]:
40623960

Filters can also be more complex. You can check for any of the different properties you might check for in a standard if statement.

For instance, here we want to keep both cities in the US and in Canada. The symbol | means either-or.

In [22]:
filter = (df["Country"] == "United States") | (df["Country"] == "Canada")
us_or_canada_df = df.loc[filter]
us_or_canada_df
Out[22]:
Rank City Country Population
1 1 New York City United States 8550405
2 2 Los Angeles United States 3971883
3 3 Toronto Canada 2826498
4 4 Chicago United States 2720546
5 5 Houston United States 2296224
7 7 Montreal Canada 1753034
9 9 Philadelphia United States 1567442
10 10 Phoenix United States 1563025
11 11 San Antonio United States 1469845
14 14 San Diego United States 1394928
17 17 Dallas United States 1300092
19 19 Calgary Canada 1230915
25 25 San Jose United States 1026908
30 30 Ottawa Canada 956710
31 31 Austin United States 931830
32 32 Edmonton Canada 899447
36 36 Jacksonville United States 868031
38 38 San Francisco United States 864816
39 39 Indianapolis United States 853173
40 40 Columbus United States 850106
41 41 Fort Worth United States 833319
42 42 Charlotte United States 827097
46 46 Mississauga Canada 761300
52 52 Winnipeg Canada 718400
57 57 Seattle United States 684451
59 59 Denver United States 682545
60 60 El Paso United States 681124
62 62 Detroit United States 677116
63 63 Washington, D.C. United States 672228
65 65 Boston United States 667137
67 67 Memphis United States 655770
68 68 Nashville United States 654610
70 70 Vancouver Canada 648608
71 71 Portland United States 632309
73 73 Oklahoma City United States 631346
74 74 Las Vegas United States 623747
75 75 Baltimore United States 621849
76 76 Brampton Canada 620700
77 77 Louisville United States 615366
79 79 Milwaukee United States 600155
84 84 Albuquerque United States 559121
85 85 Hamilton Canada 556359
86 86 Quebec City Canada 540994
88 88 Tucson United States 531641
90 90 Surrey Canada 526004
93 93 Fresno United States 520052

👩‍🎓Student Question: How many of the cities are in the US or Canada?

In [23]:
#📝📝📝📝 FILLME
pass

Here is a list of the different statements that we commonly use.

Filter Symbol
Or |
And &
Not ~
Equal ==
Less <
Greater >
Greater >
In .str.contains
Is one of .isin

Note: I didn't know many of these by heart. Don't be afraid to google "how to filter by ... in pandas" if you get stuck.

Group Exercise A

Question 1

Filters can be of many different types. For instance, when working with numerical fields we can have filters based on greater-than and less-than comparisons.

Write a filter that keeps only cities with greater than a million people.

In [24]:
#📝📝📝📝 FILLME
pass

How many are there?

In [25]:
#📝📝📝📝 FILLME
pass

(Be sure to print it out to check that it worked!)

Question 2

Several cities in North America include the word "City" in their name. Write a filter to find the cities that have "City" in their name.

In [26]:
#📝📝📝📝 FILLME
pass

What is the smallest city on this list?

In [27]:
#📝📝📝📝 FILLME
pass

Question 3

Most of the cities on the list are in Canada, Mexico or the US.

Can you write a filter to find the cities that are not in any of these countries?

In [28]:
#📝📝📝📝 FILLME
pass

What is the largest city in this list?

In [29]:
#📝📝📝📝 FILLME
pass

Question 4

We can also apply filters that look for two properties at the same time.

Can you write a filter to find the cities in the US of over a million people?

In [30]:
#📝📝📝📝 FILLME
pass

How many are there?

In [31]:
#📝📝📝📝 FILLME
pass

Unit B

In this unit we will look at three more advanced Pandas functions. Unlike filters, which just remove rows, these will allow use to manipute our data to compute new properties and even new columns.

Group By's

We saw above how to compute the total number of cities in Mexico on our list. We did this by first filtering and then "aggregating" by calling count(). Here is a reminder.

In [32]:
filter = df["Country"] == "Mexico"
cities_in_mexico_df = df.loc[filter]
total_cities_in_mexico = cities_in_mexico_df["City"].count()
total_cities_in_mexico
Out[32]:
38

However, what if we also want to know the number of cities in Canada and US and all the other countries on our list. We can do this with a group-by operation

GroupBy

  1. GroupBy - Determine the subset of data to use
  2. Aggregation - Compute a property over the group

Step 1. Group By

In [33]:
grouped = df.groupby(["Country"])

Step 2. Aggregate

In [34]:
count_of_cities = grouped["City"].count()
count_of_cities
Out[34]:
Country
Canada                12
Cuba                   1
Dominican Republic     2
Guatemala              2
Haiti                  2
Honduras               2
Jamaica                1
Mexico                38
Nicaragua              1
United States         34
Name: City, dtype: int64

Here is another example. This one computes the population of the largest city in each country.

In [35]:
max_pop = grouped["Population"].max()
max_pop
Out[35]:
Country
Canada                2826498
Cuba                  2117625
Dominican Republic    1007997
Guatemala              994078
Haiti                  987310
Honduras              1190230
Jamaica                669627
Mexico                8918653
Nicaragua             1048134
United States         8550405
Name: Population, dtype: int64

👩‍🎓 Student Question: Can you compute the city with the minimum population on the list for each country?

In [36]:
#📝📝📝📝 FILLME
pass

Manipulating Tables

Another useful aspect of tables is is to add in new columns. Adding new columns allows us to group by additional properties, create advanced filters, or make pretty graphs.

The easiest way to add a new column in pandas is to write a function that tells us how to create the new column from the other columns in the table.

In order to add a new column, we need to write a function. If you remember last class, a function looked something like this.

In [37]:
# Returns if the country is in US or Canada
def in_us_or_canada(country):
    if country == "United States":
        return "US/Canada"
    if country == "Canada":
        return "US/Canada"
    return "Not US/Canada"

Now we can add a new column by setting that column equal to the country. We do this by calling Pandas map with the function and the column of interest. This line of code will call our function for each row of the Country column. Notice how it creates a new column.

In [38]:
df["US_or_Canada"] = df["Country"].map(in_us_or_canada)
df
Out[38]:
Rank City Country Population US_or_Canada
0 0 Mexico City Mexico 8918653 Not US/Canada
1 1 New York City United States 8550405 US/Canada
2 2 Los Angeles United States 3971883 US/Canada
3 3 Toronto Canada 2826498 US/Canada
4 4 Chicago United States 2720546 US/Canada
... ... ... ... ... ...
90 90 Surrey Canada 526004 US/Canada
91 91 Ciudad López Mateos Mexico 523296 Not US/Canada
92 92 Tultitlán Mexico 520557 Not US/Canada
93 93 Fresno United States 520052 US/Canada
94 94 Carrefour Haiti 501768 Not US/Canada

95 rows × 5 columns

In [39]:
df.columns
Out[39]:
Index(['Rank', 'City', 'Country', 'Population', 'US_or_Canada'], dtype='object')

We can then use this column in a group by.

In [40]:
grouped = df.groupby(["US_or_Canada"])
count_of_cities = grouped["City"].count()
count_of_cities
Out[40]:
US_or_Canada
Not US/Canada    49
US/Canada        46
Name: City, dtype: int64

A similar technique can be used to manipulate the data in a column to change certain values. For instance, we might want to remove the final " City" from cities like "New York"

In [41]:
def change_name(str1):
    return str1.replace(" City", "")
In [42]:
df["City"] = df["City"].map(change_name)
df
Out[42]:
Rank City Country Population US_or_Canada
0 0 Mexico Mexico 8918653 Not US/Canada
1 1 New York United States 8550405 US/Canada
2 2 Los Angeles United States 3971883 US/Canada
3 3 Toronto Canada 2826498 US/Canada
4 4 Chicago United States 2720546 US/Canada
... ... ... ... ... ...
90 90 Surrey Canada 526004 US/Canada
91 91 Ciudad López Mateos Mexico 523296 Not US/Canada
92 92 Tultitlán Mexico 520557 Not US/Canada
93 93 Fresno United States 520052 US/Canada
94 94 Carrefour Haiti 501768 Not US/Canada

95 rows × 5 columns

Joining Together Tables

Pandas becomes much more powerful when we start to have many different tables that relate to each other. For this example we will consider another table that provides the locations about these cities. You can see that here:

Lets load this table into a new variable.

In [43]:
all_cities_df = pd.read_csv("https://srush.github.io/BT-AI/notebooks/AllCities.csv")
all_cities_df
Out[43]:
Id City Country Longitude Latitude
0 0 A Coruña Spain 8.73W 42.59N
1 1 Aachen Germany 6.34E 50.63N
2 2 Aalborg Denmark 10.33E 57.05N
3 3 Aba Nigeria 8.07E 5.63N
4 4 Abadan Iran 48.00E 29.74N
... ... ... ... ... ...
3505 3505 Århus Denmark 10.33E 57.05N
3506 3506 Çorlu Turkey 27.69E 40.99N
3507 3507 Çorum Turkey 34.08E 40.99N
3508 3508 Öskemen Kazakhstan 82.39E 50.63N
3509 3509 Ürümqi China 87.20E 44.20N

3510 rows × 5 columns

This table has most of the cities in our dataset. But there are also a lot of other cities in this table outside of North America.

In [44]:
filter = all_cities_df["Country"] == "Germany" 
europe_df = all_cities_df.loc[filter]
europe_df
Out[44]:
Id City Country Longitude Latitude
1 1 Aachen Germany 6.34E 50.63N
187 187 Augsburg Germany 10.66E 47.42N
338 338 Bergisch Gladbach Germany 6.34E 50.63N
340 340 Berlin Germany 13.14E 52.24N
370 370 Bielefeld Germany 7.88E 52.24N
... ... ... ... ... ...
3343 3343 Wiesbaden Germany 8.87E 50.63N
3349 3349 Witten Germany 7.88E 52.24N
3351 3351 Wolfsburg Germany 10.51E 52.24N
3364 3364 Wuppertal Germany 6.34E 50.63N
3369 3369 Würzburg Germany 9.80E 49.03N

81 rows × 5 columns

In order to use this new information let's merge since it in to our table. We just need to tell pandas which are the shared columns between the two tables.

In [45]:
df = df.merge(all_cities_df, on=["City", "Country"])
df
Out[45]:
Rank City Country Population US_or_Canada Id Longitude Latitude
0 0 Mexico Mexico 8918653 Not US/Canada 1955 98.96W 20.09N
1 1 New York United States 8550405 US/Canada 2126 74.56W 40.99N
2 2 Los Angeles United States 3971883 US/Canada 1775 118.70W 34.56N
3 3 Toronto Canada 2826498 US/Canada 3140 80.50W 44.20N
4 4 Chicago United States 2720546 US/Canada 608 87.27W 42.59N
... ... ... ... ... ... ... ... ...
79 87 Tonalá Mexico 536111 Not US/Canada 3132 104.08W 20.09N
80 88 Tucson United States 531641 US/Canada 3171 111.20W 31.35N
81 89 Cuautitlán Izcalli Mexico 531041 Not US/Canada 719 98.96W 20.09N
82 93 Fresno United States 520052 US/Canada 960 119.34W 36.17N
83 94 Carrefour Haiti 501768 Not US/Canada 542 72.68W 18.48N

84 rows × 8 columns

Group Exercise B

Question 1

The following are the official abbreviation codes for the cities in our data table.

In [46]:
abbrev = {
    "United States": "US",
    "Mexico" : "MX",
    "Canada" : "CA",
    "Haiti" : "HAT",
    "Jamaica" : "JM",
    "Cuba" : "CU",
    "Honduras" : "HO",
    "Nicaragua" : "NR",
    "Dominican Republic" : "DR",
    "Guatemala" : "G",
    }

Can you add a new column to the table called "Abbrev" that lists the abbreviation code for that city?

In [47]:
#📝📝📝📝 FILLME
pass

Question 2

Our table has the Latitude and Longitude of all the major North American Cities.

Can you find out where New York is located? How about Detroit, Las Vegas, and Portland?

Question 3

Currently in the table the latitude and longitude are represented as string types, because they have N / S and E / W in their values. These two functions will fix that issue.

In [48]:
def latitude_to_number(latitude_string):
    str1 = latitude_string
    if str1[-1] == "N":
        return float(str1[:-1])        
    else:
        return -float(str1[:-1])    
In [49]:
def longitude_to_number(longitude_string):
    str1 = longitude_string.replace("W", "")
    return -float(str1)
In [50]:
lat = latitude_to_number("190N")
lat
Out[50]:
190.0

Can you use these functions to fix the Latitude and Longitude columns to instead use numeric values?

In [51]:
#📝📝📝📝 FILLME
pass

Question 4

After completing question 3 use group by and compute the Latitude of most southern city in each country of the table.

In [52]:
#📝📝📝📝 FILLME
pass

Visualization

Next class we will dive deeper into plotting and visualization. But let's finish with a little demo to show off all the tables we created.

First we import some libraries

In [53]:
import altair as alt
from vega_datasets import data
In [54]:
states = alt.topo_feature(data.us_10m.url, feature='states')
background = alt.Chart(states).mark_geoshape().project('albersUsa')

Now we can plot

In [55]:
states = alt.topo_feature(data.world_110m.url, feature='countries')
chart = alt.Chart(states).mark_geoshape(
        fill='lightgray',
        stroke='white'
    ).properties(
        width=500,
        height=300
    ).project('orthographic', rotate= [95, -42, 0])
if False:
    points = alt.Chart(df).mark_circle().encode(
        longitude='Longitude',
        latitude='Latitude',
        size="Population",
        tooltip=['City','Population']
    )
    chart += points
chart
Out[55]: