| FileId | UserId | Guess | Age | Race | Gender | Position | TimeStart | TimeEnd | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 72_1_0_20170110141531648_jpg | 2TkJQyslSFb5GXbdv1aCc2QAZO53 | 58 | 72 | White | F | 10 | 1728456243117 | 1728456250343 |
| 1 | 62_0_0_20170111210223707_jpg | 2TkJQyslSFb5GXbdv1aCc2QAZO53 | 70 | 62 | White | M | 9 | 1728456238837 | 1728456243116 |
| 2 | 75_0_1_20170111205346848_jpg | 2TkJQyslSFb5GXbdv1aCc2QAZO53 | 57 | 75 | Other | M | 7 | 1728456228654 | 1728456235495 |
| 3 | 21_1_2_20170105183505385_jpg | 2TkJQyslSFb5GXbdv1aCc2QAZO53 | 19 | 21 | Other | F | 16 | 1728456288579 | 1728456295650 |
| 4 | 28_0_1_20170112211810813_jpg | 2TkJQyslSFb5GXbdv1aCc2QAZO53 | 30 | 28 | Other | M | 11 | 1728456250345 | 1728456254708 |
1 An Age Guessing Game
In the following game we will try to guess the age of the person in the image. The images are taken from the UTKFace dataset, a large-scale face dataset. The age of the persons in our game ranges from 15 to 90 years.
Scan the following QR code to open the game in your mobile device or click here:
https://uni-sofia.vercel.app/surveys/UKTfaces.

1.1 Data Import
Now that we have played the game, let’s import the data and see what we can learn from it. The following block of code will download the data, restructure it and display the first few rows. In these introductory classes you do not need to understand the code, but you can run it and see the results.
The data consists of the following columns:
- FileId (character): The name of the image file.
- UserId (character): A unique identifier for the user.
- Guess (numeric): The age guessed by the user.
- Age (numeric): The actual age of the person in the image.
- Gender (character): The gender of the person in the image.
- Position (numeric): The position of the image in the game (0-16), 0 being the first image shown to the user, 1 the second, and so on. The images were shown in random order except for the first image which was always the same.
- TimeStart (numeric): The time when the user saw the image (milliseconds from 1970-01-01).
- TimeEnd (numeric): The time when the user submitted the guess (milliseconds from 1970-01-01).
Each row corresponds to a guess made by a user for an image.
1.2 First Steps with Python
Before we start working with the data, let’s make sure that we understand how to use Python and the Jupyter Notebook. The following block of code will show you how to perform basic arithmetic operations in Python, print the results, and store them in variables.
To run the code in the following cells, click on the cell and press Ctrl+Enter or click on the Run button (upper left side of the cell).
After running the cell, check the output below the cell. The last statement in the cell will be printed. If you want to print other values, use the print() function.
Visual Studio Code allows you to see the values of the variables that you have defined in the code. Find the Variables tab on the top side of the notebook window and click on it. You will see the variables that you have defined and their values.
Google Colab also allows you to see the values of the variables that you have defined in the code. Click on the {x} icon on the right side of page.
# Creates a new variable called x and assigns it the value of 2 to the power of 35
x = 2 ** 35
# Creates a new variable called y and assigns it the value of 8.12
y = 8.12
# Prints the result of dividing x by y
print(x / y)
# Prints the result of multiplying x by y (it is printed, because it is the last line of the cell)
x * y4231494872.9064045
279001075548.16
1.3 Data Description
The first thing we want to know about a new dataset is its size. How many rows and columns does it have? We can use the shape attribute of the DataFrame to find out.
# Data shape is a tuple (think of it as a list of numbers for now): the first number is the number of rows, the second number is the number of columns
dt.shape(714, 9)
# You can access the individual elements of the tuple using square brackets
# The first number is the number of rows
dt.shape[0]
print("There are", dt.shape[0], "rows in the dataset.")
# The second number is the number of columns
dt.shape[1]
print("There are", dt.shape[1], "columns in the dataset.")There are 714 rows in the dataset.
There are 9 columns in the dataset.
# # Trying to access the third element of the tuple will result in an error (IndexError), because there are only two elements in the list (technically a tuple, but for now you can think of it as a list)
# dt.shape[2]# What are the columns in the dataset?
# The columns are stored in the columns attribute of the DataFrame. Again, you will a list of column names (technically a pandas Index object, but for now you can think of it as a list)
dt.columnsIndex(['FileId', 'UserId', 'Guess', 'Age', 'Race', 'Gender', 'Position',
'TimeStart', 'TimeEnd'],
dtype='object')
# How many unique users are there in the dataset? You can access the columns in the using the brackets notation
# The unique function returns the unique values in a Series
dt["UserId"].unique()array(['2TkJQyslSFb5GXbdv1aCc2QAZO53', '4W0FEKDKyjNZUuCf9uBNs8rfjjq1',
'8Dv06RmVjaVJE4ZFNRMK5f1oQOp1', '8fPccGO10Bb2la4FkBqgl1rVWHg1',
'8jKBM5SOv0fOly6vCBdHXtxBPQ23', '9j9Y0NUvW0MJRsoa266Y1kMZSuY2',
'CPjrqG93VHSL9IGx30p4ZSXd4Qr1', 'CbS6ZrS8dyd0awFJRvhc7OBK5YA2',
'DuGjUzQlaUOS2bOkD2nzCq3TaMI3', 'GmKPuKAbgoYQCPmlg8RwPNsBHM23',
'HqIR4yBnDLax4jjFITVp6hVvqx52', 'Lec4YA560vapZcD5fIP8MJm5sl42',
'MxAZfVjX3eZRJXHxkdj8TrkCldD2', 'NPtpMro4nOOJTqTNpt8PPHmbiFk2',
'OBczO7vxLvN2FiLBkqnQvCM8NdF2', 'PVb3jSAxyMXAuQ2PeIpdqQ0ILp53',
'PphGn6oTxNPO1u8YiwrruLNTzXE2', 'QEskanUwcUNT2gyBQFvSaVHeI4D3',
'SQC3UTi7a1P7CGyqhy7mBe8Zgmj2', 'Tvwjrx7GfaaBRYS6TGadjBjO9HF2',
'TzOCmmyl29OjE7JyQjqJqg7lD3u1', 'UCEzPzCCb7YfusqZVuUmGPIMFVV2',
'UpgLIyOuiVSd3y8VQ2gnJ6ViCxw2', 'VeyVpTOLy7NR0V5ZCnbLipceCEG2',
'W8MAPl0tmKbJAJ3abhcnQrqG5no1', 'WLi6ASLPTccgRUo1KG6ThcH80Nx1',
'Y1fElir0oXeayPWZb5vLKph5lUo2', 'b0lWZmr0VDa5EFS4AJliWXDqFKP2',
'e1mjyMk2cOR5XLP9YiO9OPEuapi1', 'fhvPybcsEhO9yui4zF4MYJiowGf2',
'hASSWNNQmIZY92TSasx7bK4V5VE2', 'kDlJXIQowGTAiE4USAhMfqdgtHi1',
'mZGwkLZFMQPbTisMJrmnNZCUHAB2', 'stRoVijSb9Rv7DV5f9Z39ccaHIN2',
'tur4WIEQ4ocZzgF8lmsSCPK3HnW2', 'vOuQwWBYrBYKTzBPG4LESFZ5MBJ2',
'w6UifmFZfRXtIIuAi76VfB7WiuF3', 'w7pLK2CfZoMtjHE17nJya29OiQd2',
'wpzHrTsY6hdzWWw8a7DC2hjiDwD2', 'xOisU25aSoWx5B7tzUN6yz5bK8Z2',
'xv627zbJwNSbB648KX6XfQyCO3d2', 'ybJOzEmZp7ZYeN5vtODtY1Ovu5d2'],
dtype=object)
dt["UserId"].nunique()42
# Sometimes it is useful to sort the unique values. The np.sort function sorts the values in ascending order (from the smallest to the largest)
# The function is part of the numpy library, which is why we have imported it at the beginning of the notebook (import numpy as np)
np.sort(dt["UserId"].unique())array(['2TkJQyslSFb5GXbdv1aCc2QAZO53', '4W0FEKDKyjNZUuCf9uBNs8rfjjq1',
'8Dv06RmVjaVJE4ZFNRMK5f1oQOp1', '8fPccGO10Bb2la4FkBqgl1rVWHg1',
'8jKBM5SOv0fOly6vCBdHXtxBPQ23', '9j9Y0NUvW0MJRsoa266Y1kMZSuY2',
'CPjrqG93VHSL9IGx30p4ZSXd4Qr1', 'CbS6ZrS8dyd0awFJRvhc7OBK5YA2',
'DuGjUzQlaUOS2bOkD2nzCq3TaMI3', 'GmKPuKAbgoYQCPmlg8RwPNsBHM23',
'HqIR4yBnDLax4jjFITVp6hVvqx52', 'Lec4YA560vapZcD5fIP8MJm5sl42',
'MxAZfVjX3eZRJXHxkdj8TrkCldD2', 'NPtpMro4nOOJTqTNpt8PPHmbiFk2',
'OBczO7vxLvN2FiLBkqnQvCM8NdF2', 'PVb3jSAxyMXAuQ2PeIpdqQ0ILp53',
'PphGn6oTxNPO1u8YiwrruLNTzXE2', 'QEskanUwcUNT2gyBQFvSaVHeI4D3',
'SQC3UTi7a1P7CGyqhy7mBe8Zgmj2', 'Tvwjrx7GfaaBRYS6TGadjBjO9HF2',
'TzOCmmyl29OjE7JyQjqJqg7lD3u1', 'UCEzPzCCb7YfusqZVuUmGPIMFVV2',
'UpgLIyOuiVSd3y8VQ2gnJ6ViCxw2', 'VeyVpTOLy7NR0V5ZCnbLipceCEG2',
'W8MAPl0tmKbJAJ3abhcnQrqG5no1', 'WLi6ASLPTccgRUo1KG6ThcH80Nx1',
'Y1fElir0oXeayPWZb5vLKph5lUo2', 'b0lWZmr0VDa5EFS4AJliWXDqFKP2',
'e1mjyMk2cOR5XLP9YiO9OPEuapi1', 'fhvPybcsEhO9yui4zF4MYJiowGf2',
'hASSWNNQmIZY92TSasx7bK4V5VE2', 'kDlJXIQowGTAiE4USAhMfqdgtHi1',
'mZGwkLZFMQPbTisMJrmnNZCUHAB2', 'stRoVijSb9Rv7DV5f9Z39ccaHIN2',
'tur4WIEQ4ocZzgF8lmsSCPK3HnW2', 'vOuQwWBYrBYKTzBPG4LESFZ5MBJ2',
'w6UifmFZfRXtIIuAi76VfB7WiuF3', 'w7pLK2CfZoMtjHE17nJya29OiQd2',
'wpzHrTsY6hdzWWw8a7DC2hjiDwD2', 'xOisU25aSoWx5B7tzUN6yz5bK8Z2',
'xv627zbJwNSbB648KX6XfQyCO3d2', 'ybJOzEmZp7ZYeN5vtODtY1Ovu5d2'],
dtype=object)
# The nunique function returns the number of unique values in a Series
dt["UserId"].nunique()42
1.4 Exercise
List the unique values of the FileId column. How many unique images are there in the dataset?
# Write your code here and 1.5 Selecting Subsets of the Data
There are multiple ways to choose a part of the data. We can select rows by their index, columns by their name, or both. We can also use logical conditions to filter the data. Here we want to select only the rows belonging to one of the users.
# The == operator checks if two values are equal. It returns a boolean value (True or False). In this case, we are checking if the UserId is equal to "2TkJQyslSFb5GXbdv1aCc2QAZO53"
# for each row in the UserId column of the dataframe dt. As the dataframe is quite large, you will see the first 5 and the last 5 rows.
dt["UserId"] == "2TkJQyslSFb5GXbdv1aCc2QAZO53"0 True
1 True
2 True
3 True
4 True
...
709 False
710 False
711 False
712 False
713 False
Name: UserId, Length: 714, dtype: bool
# We can call the .sum() method of the result to count the number of True values. This will give us the number of rows where the UserId is equal to "2TkJQyslSFb5GXbdv1aCc2QAZO53"
(dt["UserId"] == "2TkJQyslSFb5GXbdv1aCc2QAZO53").sum()17
# Putting this boolean series in square brackets returns only the rows where the condition is True
mydt = dt[dt["UserId"] == "8jKBM5SOv0fOly6vCBdHXtxBPQ23"].copy()
mydt| FileId | UserId | Guess | Age | Race | Gender | Position | TimeStart | TimeEnd | |
|---|---|---|---|---|---|---|---|---|---|
| 68 | 72_1_0_20170110141531648_jpg | 8jKBM5SOv0fOly6vCBdHXtxBPQ23 | 69 | 72 | White | F | 7 | 1728466078201 | 1728466083168 |
| 69 | 62_0_0_20170111210223707_jpg | 8jKBM5SOv0fOly6vCBdHXtxBPQ23 | 72 | 62 | White | M | 4 | 1728466058849 | 1728466065965 |
| 70 | 75_0_1_20170111205346848_jpg | 8jKBM5SOv0fOly6vCBdHXtxBPQ23 | 69 | 75 | Other | M | 14 | 1728466118959 | 1728466125058 |
| 71 | 21_1_2_20170105183505385_jpg | 8jKBM5SOv0fOly6vCBdHXtxBPQ23 | 32 | 21 | Other | F | 9 | 1728466093455 | 1728466098292 |
| 72 | 28_0_1_20170112211810813_jpg | 8jKBM5SOv0fOly6vCBdHXtxBPQ23 | 36 | 28 | Other | M | 12 | 1728466106791 | 1728466112456 |
| 73 | 30_0_4_20170117202914440_jpg | 8jKBM5SOv0fOly6vCBdHXtxBPQ23 | 43 | 30 | Other | M | 0 | 1728465970667 | 1728466037743 |
| 74 | 62_1_0_20170110175644800_jpg | 8jKBM5SOv0fOly6vCBdHXtxBPQ23 | 68 | 62 | White | F | 5 | 1728466065967 | 1728466075283 |
| 75 | 16_0_0_20170110231841292_jpg | 8jKBM5SOv0fOly6vCBdHXtxBPQ23 | 17 | 16 | White | M | 16 | 1728466134829 | 1728466141246 |
| 76 | 18_1_0_20170109214608184_jpg | 8jKBM5SOv0fOly6vCBdHXtxBPQ23 | 22 | 18 | White | F | 6 | 1728466075284 | 1728466078199 |
| 77 | 75_0_3_20170111202756116_jpg | 8jKBM5SOv0fOly6vCBdHXtxBPQ23 | 57 | 75 | Other | M | 1 | 1728466037744 | 1728466045288 |
| 78 | 56_0_0_20170111201143803_jpg | 8jKBM5SOv0fOly6vCBdHXtxBPQ23 | 56 | 56 | White | M | 15 | 1728466125060 | 1728466134828 |
| 79 | 90_1_2_20170110183708997_jpg | 8jKBM5SOv0fOly6vCBdHXtxBPQ23 | 78 | 90 | Other | F | 3 | 1728466053299 | 1728466058847 |
| 80 | 76_1_2_20170110182935621_jpg | 8jKBM5SOv0fOly6vCBdHXtxBPQ23 | 74 | 76 | Other | F | 13 | 1728466112458 | 1728466118957 |
| 81 | 26_1_1_20170116232657066_jpg | 8jKBM5SOv0fOly6vCBdHXtxBPQ23 | 29 | 26 | Other | F | 11 | 1728466102073 | 1728466106789 |
| 82 | 16_0_0_20170110231617005_jpg | 8jKBM5SOv0fOly6vCBdHXtxBPQ23 | 28 | 16 | White | M | 10 | 1728466098294 | 1728466102072 |
| 83 | 25_0_1_20170113184508496_jpg | 8jKBM5SOv0fOly6vCBdHXtxBPQ23 | 39 | 25 | Other | M | 8 | 1728466083169 | 1728466093453 |
| 84 | 16_1_0_20170109214013596_jpg | 8jKBM5SOv0fOly6vCBdHXtxBPQ23 | 16 | 16 | White | F | 2 | 1728466045290 | 1728466053297 |
1.6 Select Your Data
Create a new DataFrame called mydt containing only the rows where the UserId is equal to your unique identifier (it is shown on the results page of the game). To avoid typing the whole identifier, look at the sorted unique values of the UserId and copy the one that corresponds to you.
1.7 Counting Things
A lot of the time we will want to compute some simple counts. In our example, we may want to count the number of guesses made by each user, the number of guesses that overestimated the age, or the number of correct guesses. We can accomplish this using logical conditions and the sum method of series (or the np.sum function).
1.8 Counting Overestimations
How many times did you overestimate the age of the person in the images?
# Creates a new column called "Overestimates" that is True if the "Guess" is greater than the "Age" and False otherwise
mydt["Overestimates"] = (mydt["Guess"] > mydt["Age"])
# Creates a new column called "Underestimates" that is True if the "Guess" is less than the "Age" and False otherwise
mydt["Underestimates"] = (mydt["Guess"] < mydt["Age"])
# Creates a new column called "NoError" that is True if the "Guess" is equal to the "Age" and False otherwise
mydt["NoError"] = (mydt["Guess"] == mydt["Age"])
# Sums the values in the "Overestimates" column to get the number of overestimates
mydt["NoError"].sum()2
# Exercise: In how many images did you underestimate the age?
# Write your code here# Why does using the .sum() method actually work here?
x_logical = np.array([True, False, True, True, False])
x_logicalarray([ True, False, True, True, False])
# Test it by changing the values (True/False) in the assignment of x_logical and running the cell again (you also need to run the cell with the definition of x_logical)
x_logical.sum()3
# Exercise: How many female images have you looked at during the guessing game?
# Exercise: How many male images have you guessed?
# Compare your results with the output of the following code
mydt["Gender"].value_counts()Gender
M 9
F 8
Name: count, dtype: int64
