1  An Age Guessing Game

Open in Google Colab: Open in Colab

In the following game we will try to guess the age of the person in the image. The images are taken from the UTKFace dataset, a large-scale face dataset. The age of the persons in our game ranges from 15 to 90 years.

Scan the following QR code to open the game in your mobile device or click here:

https://uni-sofia.vercel.app/surveys/UKTfaces.

1.1 Data Import

Now that we have played the game, let’s import the data and see what we can learn from it. The following block of code will download the data, restructure it and display the first few rows. In these introductory classes you do not need to understand the code, but you can run it and see the results.

The data consists of the following columns:

  • FileId (character): The name of the image file.
  • UserId (character): A unique identifier for the user.
  • Guess (numeric): The age guessed by the user.
  • Age (numeric): The actual age of the person in the image.
  • Gender (character): The gender of the person in the image.
  • Position (numeric): The position of the image in the game (0-16), 0 being the first image shown to the user, 1 the second, and so on. The images were shown in random order except for the first image which was always the same.
  • TimeStart (numeric): The time when the user saw the image (milliseconds from 1970-01-01).
  • TimeEnd (numeric): The time when the user submitted the guess (milliseconds from 1970-01-01).

Each row corresponds to a guess made by a user for an image.

FileId UserId Guess Age Race Gender Position TimeStart TimeEnd
0 72_1_0_20170110141531648_jpg 2TkJQyslSFb5GXbdv1aCc2QAZO53 58 72 White F 10 1728456243117 1728456250343
1 62_0_0_20170111210223707_jpg 2TkJQyslSFb5GXbdv1aCc2QAZO53 70 62 White M 9 1728456238837 1728456243116
2 75_0_1_20170111205346848_jpg 2TkJQyslSFb5GXbdv1aCc2QAZO53 57 75 Other M 7 1728456228654 1728456235495
3 21_1_2_20170105183505385_jpg 2TkJQyslSFb5GXbdv1aCc2QAZO53 19 21 Other F 16 1728456288579 1728456295650
4 28_0_1_20170112211810813_jpg 2TkJQyslSFb5GXbdv1aCc2QAZO53 30 28 Other M 11 1728456250345 1728456254708

1.2 First Steps with Python

Before we start working with the data, let’s make sure that we understand how to use Python and the Jupyter Notebook. The following block of code will show you how to perform basic arithmetic operations in Python, print the results, and store them in variables.

Running the Python Code

To run the code in the following cells, click on the cell and press Ctrl+Enter or click on the Run button (upper left side of the cell).

After running the cell, check the output below the cell. The last statement in the cell will be printed. If you want to print other values, use the print() function.

Variables in Visual Studio Code

Visual Studio Code allows you to see the values of the variables that you have defined in the code. Find the Variables tab on the top side of the notebook window and click on it. You will see the variables that you have defined and their values.

Variables in Google Colab

Google Colab also allows you to see the values of the variables that you have defined in the code. Click on the {x} icon on the right side of page.

# Creates a new variable called x and assigns it the value of 2 to the power of 35

x = 2 ** 35

# Creates a new variable called y and assigns it the value of 8.12
y = 8.12

# Prints the result of dividing x by y
print(x / y)

# Prints the result of multiplying x by y (it is printed, because it is the last line of the cell)
x * y
4231494872.9064045
279001075548.16

1.3 Data Description

The first thing we want to know about a new dataset is its size. How many rows and columns does it have? We can use the shape attribute of the DataFrame to find out.

# Data shape is a tuple (think of it as a list of numbers for now): the first number is the number of rows, the second number is the number of columns
dt.shape
(714, 9)
# You can access the individual elements of the tuple using square brackets

# The first number is the number of rows
dt.shape[0]
print("There are", dt.shape[0], "rows in the dataset.")

# The second number is the number of columns
dt.shape[1]
print("There are", dt.shape[1], "columns in the dataset.")
There are 714 rows in the dataset.
There are 9 columns in the dataset.
# # Trying to access the third element of the tuple will result in an error (IndexError), because there are only two elements in the list (technically a tuple, but for now you can think of it as a list)
# dt.shape[2]
# What are the columns in the dataset?

# The columns are stored in the columns attribute of the DataFrame. Again, you will a list of column names (technically a pandas Index object, but for now you can think of it as a list)
dt.columns
Index(['FileId', 'UserId', 'Guess', 'Age', 'Race', 'Gender', 'Position',
       'TimeStart', 'TimeEnd'],
      dtype='object')
# How many unique users are there in the dataset? You can access the columns in the using the brackets notation

# The unique function returns the unique values in a Series

dt["UserId"].unique()
array(['2TkJQyslSFb5GXbdv1aCc2QAZO53', '4W0FEKDKyjNZUuCf9uBNs8rfjjq1',
       '8Dv06RmVjaVJE4ZFNRMK5f1oQOp1', '8fPccGO10Bb2la4FkBqgl1rVWHg1',
       '8jKBM5SOv0fOly6vCBdHXtxBPQ23', '9j9Y0NUvW0MJRsoa266Y1kMZSuY2',
       'CPjrqG93VHSL9IGx30p4ZSXd4Qr1', 'CbS6ZrS8dyd0awFJRvhc7OBK5YA2',
       'DuGjUzQlaUOS2bOkD2nzCq3TaMI3', 'GmKPuKAbgoYQCPmlg8RwPNsBHM23',
       'HqIR4yBnDLax4jjFITVp6hVvqx52', 'Lec4YA560vapZcD5fIP8MJm5sl42',
       'MxAZfVjX3eZRJXHxkdj8TrkCldD2', 'NPtpMro4nOOJTqTNpt8PPHmbiFk2',
       'OBczO7vxLvN2FiLBkqnQvCM8NdF2', 'PVb3jSAxyMXAuQ2PeIpdqQ0ILp53',
       'PphGn6oTxNPO1u8YiwrruLNTzXE2', 'QEskanUwcUNT2gyBQFvSaVHeI4D3',
       'SQC3UTi7a1P7CGyqhy7mBe8Zgmj2', 'Tvwjrx7GfaaBRYS6TGadjBjO9HF2',
       'TzOCmmyl29OjE7JyQjqJqg7lD3u1', 'UCEzPzCCb7YfusqZVuUmGPIMFVV2',
       'UpgLIyOuiVSd3y8VQ2gnJ6ViCxw2', 'VeyVpTOLy7NR0V5ZCnbLipceCEG2',
       'W8MAPl0tmKbJAJ3abhcnQrqG5no1', 'WLi6ASLPTccgRUo1KG6ThcH80Nx1',
       'Y1fElir0oXeayPWZb5vLKph5lUo2', 'b0lWZmr0VDa5EFS4AJliWXDqFKP2',
       'e1mjyMk2cOR5XLP9YiO9OPEuapi1', 'fhvPybcsEhO9yui4zF4MYJiowGf2',
       'hASSWNNQmIZY92TSasx7bK4V5VE2', 'kDlJXIQowGTAiE4USAhMfqdgtHi1',
       'mZGwkLZFMQPbTisMJrmnNZCUHAB2', 'stRoVijSb9Rv7DV5f9Z39ccaHIN2',
       'tur4WIEQ4ocZzgF8lmsSCPK3HnW2', 'vOuQwWBYrBYKTzBPG4LESFZ5MBJ2',
       'w6UifmFZfRXtIIuAi76VfB7WiuF3', 'w7pLK2CfZoMtjHE17nJya29OiQd2',
       'wpzHrTsY6hdzWWw8a7DC2hjiDwD2', 'xOisU25aSoWx5B7tzUN6yz5bK8Z2',
       'xv627zbJwNSbB648KX6XfQyCO3d2', 'ybJOzEmZp7ZYeN5vtODtY1Ovu5d2'],
      dtype=object)
dt["UserId"].nunique()
42
# Sometimes it is useful to sort the unique values. The np.sort function sorts the values in ascending order (from the smallest to the largest)
# The function is part of the numpy library, which is why we have imported it at the beginning of the notebook (import numpy as np)

np.sort(dt["UserId"].unique())
array(['2TkJQyslSFb5GXbdv1aCc2QAZO53', '4W0FEKDKyjNZUuCf9uBNs8rfjjq1',
       '8Dv06RmVjaVJE4ZFNRMK5f1oQOp1', '8fPccGO10Bb2la4FkBqgl1rVWHg1',
       '8jKBM5SOv0fOly6vCBdHXtxBPQ23', '9j9Y0NUvW0MJRsoa266Y1kMZSuY2',
       'CPjrqG93VHSL9IGx30p4ZSXd4Qr1', 'CbS6ZrS8dyd0awFJRvhc7OBK5YA2',
       'DuGjUzQlaUOS2bOkD2nzCq3TaMI3', 'GmKPuKAbgoYQCPmlg8RwPNsBHM23',
       'HqIR4yBnDLax4jjFITVp6hVvqx52', 'Lec4YA560vapZcD5fIP8MJm5sl42',
       'MxAZfVjX3eZRJXHxkdj8TrkCldD2', 'NPtpMro4nOOJTqTNpt8PPHmbiFk2',
       'OBczO7vxLvN2FiLBkqnQvCM8NdF2', 'PVb3jSAxyMXAuQ2PeIpdqQ0ILp53',
       'PphGn6oTxNPO1u8YiwrruLNTzXE2', 'QEskanUwcUNT2gyBQFvSaVHeI4D3',
       'SQC3UTi7a1P7CGyqhy7mBe8Zgmj2', 'Tvwjrx7GfaaBRYS6TGadjBjO9HF2',
       'TzOCmmyl29OjE7JyQjqJqg7lD3u1', 'UCEzPzCCb7YfusqZVuUmGPIMFVV2',
       'UpgLIyOuiVSd3y8VQ2gnJ6ViCxw2', 'VeyVpTOLy7NR0V5ZCnbLipceCEG2',
       'W8MAPl0tmKbJAJ3abhcnQrqG5no1', 'WLi6ASLPTccgRUo1KG6ThcH80Nx1',
       'Y1fElir0oXeayPWZb5vLKph5lUo2', 'b0lWZmr0VDa5EFS4AJliWXDqFKP2',
       'e1mjyMk2cOR5XLP9YiO9OPEuapi1', 'fhvPybcsEhO9yui4zF4MYJiowGf2',
       'hASSWNNQmIZY92TSasx7bK4V5VE2', 'kDlJXIQowGTAiE4USAhMfqdgtHi1',
       'mZGwkLZFMQPbTisMJrmnNZCUHAB2', 'stRoVijSb9Rv7DV5f9Z39ccaHIN2',
       'tur4WIEQ4ocZzgF8lmsSCPK3HnW2', 'vOuQwWBYrBYKTzBPG4LESFZ5MBJ2',
       'w6UifmFZfRXtIIuAi76VfB7WiuF3', 'w7pLK2CfZoMtjHE17nJya29OiQd2',
       'wpzHrTsY6hdzWWw8a7DC2hjiDwD2', 'xOisU25aSoWx5B7tzUN6yz5bK8Z2',
       'xv627zbJwNSbB648KX6XfQyCO3d2', 'ybJOzEmZp7ZYeN5vtODtY1Ovu5d2'],
      dtype=object)
# The nunique function returns the number of unique values in a Series

dt["UserId"].nunique()
42

1.4 Exercise

List the unique values of the FileId column. How many unique images are there in the dataset?

# Write your code here and 

1.5 Selecting Subsets of the Data

There are multiple ways to choose a part of the data. We can select rows by their index, columns by their name, or both. We can also use logical conditions to filter the data. Here we want to select only the rows belonging to one of the users.

# The == operator checks if two values are equal. It returns a boolean value (True or False). In this case, we are checking if the UserId is equal to "2TkJQyslSFb5GXbdv1aCc2QAZO53"
# for each row in the UserId column of the dataframe dt. As the dataframe is quite large, you will see the first 5 and the last 5 rows.

dt["UserId"] == "2TkJQyslSFb5GXbdv1aCc2QAZO53"
0       True
1       True
2       True
3       True
4       True
       ...  
709    False
710    False
711    False
712    False
713    False
Name: UserId, Length: 714, dtype: bool
# We can call the .sum() method of the result to count the number of True values. This will give us the number of rows where the UserId is equal to "2TkJQyslSFb5GXbdv1aCc2QAZO53"

(dt["UserId"] == "2TkJQyslSFb5GXbdv1aCc2QAZO53").sum()
17
# Putting this boolean series in square brackets returns only the rows where the condition is True

mydt = dt[dt["UserId"] == "8jKBM5SOv0fOly6vCBdHXtxBPQ23"].copy()
mydt
FileId UserId Guess Age Race Gender Position TimeStart TimeEnd
68 72_1_0_20170110141531648_jpg 8jKBM5SOv0fOly6vCBdHXtxBPQ23 69 72 White F 7 1728466078201 1728466083168
69 62_0_0_20170111210223707_jpg 8jKBM5SOv0fOly6vCBdHXtxBPQ23 72 62 White M 4 1728466058849 1728466065965
70 75_0_1_20170111205346848_jpg 8jKBM5SOv0fOly6vCBdHXtxBPQ23 69 75 Other M 14 1728466118959 1728466125058
71 21_1_2_20170105183505385_jpg 8jKBM5SOv0fOly6vCBdHXtxBPQ23 32 21 Other F 9 1728466093455 1728466098292
72 28_0_1_20170112211810813_jpg 8jKBM5SOv0fOly6vCBdHXtxBPQ23 36 28 Other M 12 1728466106791 1728466112456
73 30_0_4_20170117202914440_jpg 8jKBM5SOv0fOly6vCBdHXtxBPQ23 43 30 Other M 0 1728465970667 1728466037743
74 62_1_0_20170110175644800_jpg 8jKBM5SOv0fOly6vCBdHXtxBPQ23 68 62 White F 5 1728466065967 1728466075283
75 16_0_0_20170110231841292_jpg 8jKBM5SOv0fOly6vCBdHXtxBPQ23 17 16 White M 16 1728466134829 1728466141246
76 18_1_0_20170109214608184_jpg 8jKBM5SOv0fOly6vCBdHXtxBPQ23 22 18 White F 6 1728466075284 1728466078199
77 75_0_3_20170111202756116_jpg 8jKBM5SOv0fOly6vCBdHXtxBPQ23 57 75 Other M 1 1728466037744 1728466045288
78 56_0_0_20170111201143803_jpg 8jKBM5SOv0fOly6vCBdHXtxBPQ23 56 56 White M 15 1728466125060 1728466134828
79 90_1_2_20170110183708997_jpg 8jKBM5SOv0fOly6vCBdHXtxBPQ23 78 90 Other F 3 1728466053299 1728466058847
80 76_1_2_20170110182935621_jpg 8jKBM5SOv0fOly6vCBdHXtxBPQ23 74 76 Other F 13 1728466112458 1728466118957
81 26_1_1_20170116232657066_jpg 8jKBM5SOv0fOly6vCBdHXtxBPQ23 29 26 Other F 11 1728466102073 1728466106789
82 16_0_0_20170110231617005_jpg 8jKBM5SOv0fOly6vCBdHXtxBPQ23 28 16 White M 10 1728466098294 1728466102072
83 25_0_1_20170113184508496_jpg 8jKBM5SOv0fOly6vCBdHXtxBPQ23 39 25 Other M 8 1728466083169 1728466093453
84 16_1_0_20170109214013596_jpg 8jKBM5SOv0fOly6vCBdHXtxBPQ23 16 16 White F 2 1728466045290 1728466053297

1.6 Select Your Data

Create a new DataFrame called mydt containing only the rows where the UserId is equal to your unique identifier (it is shown on the results page of the game). To avoid typing the whole identifier, look at the sorted unique values of the UserId and copy the one that corresponds to you.

1.7 Counting Things

A lot of the time we will want to compute some simple counts. In our example, we may want to count the number of guesses made by each user, the number of guesses that overestimated the age, or the number of correct guesses. We can accomplish this using logical conditions and the sum method of series (or the np.sum function).

1.8 Counting Overestimations

How many times did you overestimate the age of the person in the images?

# Creates a new column called "Overestimates" that is True if the "Guess" is greater than the "Age" and False otherwise
mydt["Overestimates"] = (mydt["Guess"] > mydt["Age"])

# Creates a new column called "Underestimates" that is True if the "Guess" is less than the "Age" and False otherwise
mydt["Underestimates"] = (mydt["Guess"] < mydt["Age"])

# Creates a new column called "NoError" that is True if the "Guess" is equal to the "Age" and False otherwise
mydt["NoError"] = (mydt["Guess"] == mydt["Age"])

# Sums the values in the "Overestimates" column to get the number of overestimates
mydt["NoError"].sum()
2
# Exercise: In how many images did you underestimate the age?
# Write your code here
# Why does using the .sum() method actually work here?

x_logical = np.array([True, False, True, True, False])
x_logical
array([ True, False,  True,  True, False])
# Test it by changing the values (True/False) in the assignment of x_logical and running the cell again (you also need to run the cell with the definition of x_logical)
x_logical.sum()
3
# Exercise: How many female images have you looked at during the guessing game?
# Exercise: How many male images have you guessed?
# Compare your results with the output of the following code

mydt["Gender"].value_counts()
Gender
M    9
F    8
Name: count, dtype: int64