1 An Age Guessing Game

Open in Google Colab:

In the following game we will try to guess the age of the person in the image. The images are taken from the UTKFace dataset, a large-scale face dataset. The age of the persons in our game ranges from 15 to 90 years.

Scan the following QR code to open the game in your mobile device or click here:

https://uni-sofia.vercel.app/surveys/UKTfaces.

1.1 Data Import

Now that we have played the game, let’s import the data and see what we can learn from it. The following block of code will download the data, restructure it and display the first few rows. In these introductory classes you do not need to understand the code, but you can run it and see the results.

The data consists of the following columns:

FileId (character): The name of the image file.
UserId (character): A unique identifier for the user.
Guess (numeric): The age guessed by the user.
Age (numeric): The actual age of the person in the image.
Gender (character): The gender of the person in the image.
Position (numeric): The position of the image in the game (0-16), 0 being the first image shown to the user, 1 the second, and so on. The images were shown in random order except for the first image which was always the same.
TimeStart (numeric): The time when the user saw the image (milliseconds from 1970-01-01).
TimeEnd (numeric): The time when the user submitted the guess (milliseconds from 1970-01-01).

Each row corresponds to a guess made by a user for an image.

	FileId	UserId	Guess	Age	Race	Gender	Position	TimeStart	TimeEnd
0	72_1_0_20170110141531648_jpg	2TkJQyslSFb5GXbdv1aCc2QAZO53	58	72	White	F	10	1728456243117	1728456250343
1	62_0_0_20170111210223707_jpg	2TkJQyslSFb5GXbdv1aCc2QAZO53	70	62	White	M	9	1728456238837	1728456243116
2	75_0_1_20170111205346848_jpg	2TkJQyslSFb5GXbdv1aCc2QAZO53	57	75	Other	M	7	1728456228654	1728456235495
3	21_1_2_20170105183505385_jpg	2TkJQyslSFb5GXbdv1aCc2QAZO53	19	21	Other	F	16	1728456288579	1728456295650
4	28_0_1_20170112211810813_jpg	2TkJQyslSFb5GXbdv1aCc2QAZO53	30	28	Other	M	11	1728456250345	1728456254708

1.2 First Steps with Python

Before we start working with the data, let’s make sure that we understand how to use Python and the Jupyter Notebook. The following block of code will show you how to perform basic arithmetic operations in Python, print the results, and store them in variables.

Running the Python Code

To run the code in the following cells, click on the cell and press Ctrl+Enter or click on the Run button (upper left side of the cell).

After running the cell, check the output below the cell. The last statement in the cell will be printed. If you want to print other values, use the print() function.

Variables in Visual Studio Code

Visual Studio Code allows you to see the values of the variables that you have defined in the code. Find the Variables tab on the top side of the notebook window and click on it. You will see the variables that you have defined and their values.

Variables in Google Colab

Google Colab also allows you to see the values of the variables that you have defined in the code. Click on the {x} icon on the right side of page.

# Creates a new variable called x and assigns it the value of 2 to the power of 35

x = 2 ** 35

# Creates a new variable called y and assigns it the value of 8.12
y = 8.12

# Prints the result of dividing x by y
print(x / y)

# Prints the result of multiplying x by y (it is printed, because it is the last line of the cell)
x * y

4231494872.9064045

279001075548.16

1.3 Data Description

The first thing we want to know about a new dataset is its size. How many rows and columns does it have? We can use the shape attribute of the DataFrame to find out.

# Data shape is a tuple (think of it as a list of numbers for now): the first number is the number of rows, the second number is the number of columns
dt.shape

(714, 9)

# You can access the individual elements of the tuple using square brackets

# The first number is the number of rows
dt.shape[0]
print("There are", dt.shape[0], "rows in the dataset.")

# The second number is the number of columns
dt.shape[1]
print("There are", dt.shape[1], "columns in the dataset.")

There are 714 rows in the dataset.
There are 9 columns in the dataset.

# # Trying to access the third element of the tuple will result in an error (IndexError), because there are only two elements in the list (technically a tuple, but for now you can think of it as a list)
# dt.shape[2]

# What are the columns in the dataset?

# The columns are stored in the columns attribute of the DataFrame. Again, you will a list of column names (technically a pandas Index object, but for now you can think of it as a list)
dt.columns

Index(['FileId', 'UserId', 'Guess', 'Age', 'Race', 'Gender', 'Position',
       'TimeStart', 'TimeEnd'],
      dtype='object')

# How many unique users are there in the dataset? You can access the columns in the using the brackets notation

# The unique function returns the unique values in a Series

dt["UserId"].unique()

array(['2TkJQyslSFb5GXbdv1aCc2QAZO53', '4W0FEKDKyjNZUuCf9uBNs8rfjjq1',
       '8Dv06RmVjaVJE4ZFNRMK5f1oQOp1', '8fPccGO10Bb2la4FkBqgl1rVWHg1',
       '8jKBM5SOv0fOly6vCBdHXtxBPQ23', '9j9Y0NUvW0MJRsoa266Y1kMZSuY2',
       'CPjrqG93VHSL9IGx30p4ZSXd4Qr1', 'CbS6ZrS8dyd0awFJRvhc7OBK5YA2',
       'DuGjUzQlaUOS2bOkD2nzCq3TaMI3', 'GmKPuKAbgoYQCPmlg8RwPNsBHM23',
       'HqIR4yBnDLax4jjFITVp6hVvqx52', 'Lec4YA560vapZcD5fIP8MJm5sl42',
       'MxAZfVjX3eZRJXHxkdj8TrkCldD2', 'NPtpMro4nOOJTqTNpt8PPHmbiFk2',
       'OBczO7vxLvN2FiLBkqnQvCM8NdF2', 'PVb3jSAxyMXAuQ2PeIpdqQ0ILp53',
       'PphGn6oTxNPO1u8YiwrruLNTzXE2', 'QEskanUwcUNT2gyBQFvSaVHeI4D3',
       'SQC3UTi7a1P7CGyqhy7mBe8Zgmj2', 'Tvwjrx7GfaaBRYS6TGadjBjO9HF2',
       'TzOCmmyl29OjE7JyQjqJqg7lD3u1', 'UCEzPzCCb7YfusqZVuUmGPIMFVV2',
       'UpgLIyOuiVSd3y8VQ2gnJ6ViCxw2', 'VeyVpTOLy7NR0V5ZCnbLipceCEG2',
       'W8MAPl0tmKbJAJ3abhcnQrqG5no1', 'WLi6ASLPTccgRUo1KG6ThcH80Nx1',
       'Y1fElir0oXeayPWZb5vLKph5lUo2', 'b0lWZmr0VDa5EFS4AJliWXDqFKP2',
       'e1mjyMk2cOR5XLP9YiO9OPEuapi1', 'fhvPybcsEhO9yui4zF4MYJiowGf2',
       'hASSWNNQmIZY92TSasx7bK4V5VE2', 'kDlJXIQowGTAiE4USAhMfqdgtHi1',
       'mZGwkLZFMQPbTisMJrmnNZCUHAB2', 'stRoVijSb9Rv7DV5f9Z39ccaHIN2',
       'tur4WIEQ4ocZzgF8lmsSCPK3HnW2', 'vOuQwWBYrBYKTzBPG4LESFZ5MBJ2',
       'w6UifmFZfRXtIIuAi76VfB7WiuF3', 'w7pLK2CfZoMtjHE17nJya29OiQd2',
       'wpzHrTsY6hdzWWw8a7DC2hjiDwD2', 'xOisU25aSoWx5B7tzUN6yz5bK8Z2',
       'xv627zbJwNSbB648KX6XfQyCO3d2', 'ybJOzEmZp7ZYeN5vtODtY1Ovu5d2'],
      dtype=object)

dt["UserId"].nunique()

# Sometimes it is useful to sort the unique values. The np.sort function sorts the values in ascending order (from the smallest to the largest)
# The function is part of the numpy library, which is why we have imported it at the beginning of the notebook (import numpy as np)

np.sort(dt["UserId"].unique())

array(['2TkJQyslSFb5GXbdv1aCc2QAZO53', '4W0FEKDKyjNZUuCf9uBNs8rfjjq1',
       '8Dv06RmVjaVJE4ZFNRMK5f1oQOp1', '8fPccGO10Bb2la4FkBqgl1rVWHg1',
       '8jKBM5SOv0fOly6vCBdHXtxBPQ23', '9j9Y0NUvW0MJRsoa266Y1kMZSuY2',
       'CPjrqG93VHSL9IGx30p4ZSXd4Qr1', 'CbS6ZrS8dyd0awFJRvhc7OBK5YA2',
       'DuGjUzQlaUOS2bOkD2nzCq3TaMI3', 'GmKPuKAbgoYQCPmlg8RwPNsBHM23',
       'HqIR4yBnDLax4jjFITVp6hVvqx52', 'Lec4YA560vapZcD5fIP8MJm5sl42',
       'MxAZfVjX3eZRJXHxkdj8TrkCldD2', 'NPtpMro4nOOJTqTNpt8PPHmbiFk2',
       'OBczO7vxLvN2FiLBkqnQvCM8NdF2', 'PVb3jSAxyMXAuQ2PeIpdqQ0ILp53',
       'PphGn6oTxNPO1u8YiwrruLNTzXE2', 'QEskanUwcUNT2gyBQFvSaVHeI4D3',
       'SQC3UTi7a1P7CGyqhy7mBe8Zgmj2', 'Tvwjrx7GfaaBRYS6TGadjBjO9HF2',
       'TzOCmmyl29OjE7JyQjqJqg7lD3u1', 'UCEzPzCCb7YfusqZVuUmGPIMFVV2',
       'UpgLIyOuiVSd3y8VQ2gnJ6ViCxw2', 'VeyVpTOLy7NR0V5ZCnbLipceCEG2',
       'W8MAPl0tmKbJAJ3abhcnQrqG5no1', 'WLi6ASLPTccgRUo1KG6ThcH80Nx1',
       'Y1fElir0oXeayPWZb5vLKph5lUo2', 'b0lWZmr0VDa5EFS4AJliWXDqFKP2',
       'e1mjyMk2cOR5XLP9YiO9OPEuapi1', 'fhvPybcsEhO9yui4zF4MYJiowGf2',
       'hASSWNNQmIZY92TSasx7bK4V5VE2', 'kDlJXIQowGTAiE4USAhMfqdgtHi1',
       'mZGwkLZFMQPbTisMJrmnNZCUHAB2', 'stRoVijSb9Rv7DV5f9Z39ccaHIN2',
       'tur4WIEQ4ocZzgF8lmsSCPK3HnW2', 'vOuQwWBYrBYKTzBPG4LESFZ5MBJ2',
       'w6UifmFZfRXtIIuAi76VfB7WiuF3', 'w7pLK2CfZoMtjHE17nJya29OiQd2',
       'wpzHrTsY6hdzWWw8a7DC2hjiDwD2', 'xOisU25aSoWx5B7tzUN6yz5bK8Z2',
       'xv627zbJwNSbB648KX6XfQyCO3d2', 'ybJOzEmZp7ZYeN5vtODtY1Ovu5d2'],
      dtype=object)

# The nunique function returns the number of unique values in a Series

dt["UserId"].nunique()

1.4 Exercise

List the unique values of the FileId column. How many unique images are there in the dataset?

# Write your code here and

1.5 Selecting Subsets of the Data

There are multiple ways to choose a part of the data. We can select rows by their index, columns by their name, or both. We can also use logical conditions to filter the data. Here we want to select only the rows belonging to one of the users.

# The == operator checks if two values are equal. It returns a boolean value (True or False). In this case, we are checking if the UserId is equal to "2TkJQyslSFb5GXbdv1aCc2QAZO53"
# for each row in the UserId column of the dataframe dt. As the dataframe is quite large, you will see the first 5 and the last 5 rows.

dt["UserId"] == "2TkJQyslSFb5GXbdv1aCc2QAZO53"

0       True
1       True
2       True
3       True
4       True
       ...  
709    False
710    False
711    False
712    False
713    False
Name: UserId, Length: 714, dtype: bool

# We can call the .sum() method of the result to count the number of True values. This will give us the number of rows where the UserId is equal to "2TkJQyslSFb5GXbdv1aCc2QAZO53"

(dt["UserId"] == "2TkJQyslSFb5GXbdv1aCc2QAZO53").sum()

# Putting this boolean series in square brackets returns only the rows where the condition is True

mydt = dt[dt["UserId"] == "8jKBM5SOv0fOly6vCBdHXtxBPQ23"].copy()
mydt

	FileId	UserId	Guess	Age	Race	Gender	Position	TimeStart	TimeEnd
68	72_1_0_20170110141531648_jpg	8jKBM5SOv0fOly6vCBdHXtxBPQ23	69	72	White	F	7	1728466078201	1728466083168
69	62_0_0_20170111210223707_jpg	8jKBM5SOv0fOly6vCBdHXtxBPQ23	72	62	White	M	4	1728466058849	1728466065965
70	75_0_1_20170111205346848_jpg	8jKBM5SOv0fOly6vCBdHXtxBPQ23	69	75	Other	M	14	1728466118959	1728466125058
71	21_1_2_20170105183505385_jpg	8jKBM5SOv0fOly6vCBdHXtxBPQ23	32	21	Other	F	9	1728466093455	1728466098292
72	28_0_1_20170112211810813_jpg	8jKBM5SOv0fOly6vCBdHXtxBPQ23	36	28	Other	M	12	1728466106791	1728466112456
73	30_0_4_20170117202914440_jpg	8jKBM5SOv0fOly6vCBdHXtxBPQ23	43	30	Other	M	0	1728465970667	1728466037743
74	62_1_0_20170110175644800_jpg	8jKBM5SOv0fOly6vCBdHXtxBPQ23	68	62	White	F	5	1728466065967	1728466075283
75	16_0_0_20170110231841292_jpg	8jKBM5SOv0fOly6vCBdHXtxBPQ23	17	16	White	M	16	1728466134829	1728466141246
76	18_1_0_20170109214608184_jpg	8jKBM5SOv0fOly6vCBdHXtxBPQ23	22	18	White	F	6	1728466075284	1728466078199
77	75_0_3_20170111202756116_jpg	8jKBM5SOv0fOly6vCBdHXtxBPQ23	57	75	Other	M	1	1728466037744	1728466045288
78	56_0_0_20170111201143803_jpg	8jKBM5SOv0fOly6vCBdHXtxBPQ23	56	56	White	M	15	1728466125060	1728466134828
79	90_1_2_20170110183708997_jpg	8jKBM5SOv0fOly6vCBdHXtxBPQ23	78	90	Other	F	3	1728466053299	1728466058847
80	76_1_2_20170110182935621_jpg	8jKBM5SOv0fOly6vCBdHXtxBPQ23	74	76	Other	F	13	1728466112458	1728466118957
81	26_1_1_20170116232657066_jpg	8jKBM5SOv0fOly6vCBdHXtxBPQ23	29	26	Other	F	11	1728466102073	1728466106789
82	16_0_0_20170110231617005_jpg	8jKBM5SOv0fOly6vCBdHXtxBPQ23	28	16	White	M	10	1728466098294	1728466102072
83	25_0_1_20170113184508496_jpg	8jKBM5SOv0fOly6vCBdHXtxBPQ23	39	25	Other	M	8	1728466083169	1728466093453
84	16_1_0_20170109214013596_jpg	8jKBM5SOv0fOly6vCBdHXtxBPQ23	16	16	White	F	2	1728466045290	1728466053297

1.6 Select Your Data

Create a new DataFrame called mydt containing only the rows where the UserId is equal to your unique identifier (it is shown on the results page of the game). To avoid typing the whole identifier, look at the sorted unique values of the UserId and copy the one that corresponds to you.

1.7 Counting Things

A lot of the time we will want to compute some simple counts. In our example, we may want to count the number of guesses made by each user, the number of guesses that overestimated the age, or the number of correct guesses. We can accomplish this using logical conditions and the sum method of series (or the np.sum function).

1.8 Counting Overestimations

How many times did you overestimate the age of the person in the images?

# Creates a new column called "Overestimates" that is True if the "Guess" is greater than the "Age" and False otherwise
mydt["Overestimates"] = (mydt["Guess"] > mydt["Age"])

# Creates a new column called "Underestimates" that is True if the "Guess" is less than the "Age" and False otherwise
mydt["Underestimates"] = (mydt["Guess"] < mydt["Age"])

# Creates a new column called "NoError" that is True if the "Guess" is equal to the "Age" and False otherwise
mydt["NoError"] = (mydt["Guess"] == mydt["Age"])

# Sums the values in the "Overestimates" column to get the number of overestimates
mydt["NoError"].sum()

# Exercise: In how many images did you underestimate the age?
# Write your code here

# Why does using the .sum() method actually work here?

x_logical = np.array([True, False, True, True, False])
x_logical

array([ True, False,  True,  True, False])

# Test it by changing the values (True/False) in the assignment of x_logical and running the cell again (you also need to run the cell with the definition of x_logical)
x_logical.sum()

# Exercise: How many female images have you looked at during the guessing game?

# Exercise: How many male images have you guessed?

# Compare your results with the output of the following code

mydt["Gender"].value_counts()

Gender
M    9
F    8
Name: count, dtype: int64