Algorithmics & Data Structures — University of Geneva
Learning Objectives¶
By the end of this seminar you will be able to:
Understand the project scope, deliverables, and grading
Download and load the smartwatch dataset into Python
Explore the data programmatically using basic statistics
Implement your first algorithm on real data (find most active day)
Identify three analyses you will implement for your project (one per level)
Part 1: The Project¶
1. Goal¶
Working with real data is much more rewarding than synthetic examples. In this project you will work with physiological data collected by the Quality of Life Lab from twelve individuals using Withings ScanWatch smartwatches. The data contains step counts measured per hour and per day over several months.
You will apply the algorithms introduced in this course — sorting, searching, data structures, recursion, dynamic programming, graphs — to answer concrete questions derived from the dataset.
The focus is on:
Choosing appropriate data structures for the problem at hand
Implementing algorithms correctly — not just calling
pandasbuilt-insArguing why your solution is efficient (time and space complexity)
Writing clean, testable Python code with a documented README
Deliverables and grading¶
The project contributes 30% of your final grade:
| Deliverable | Weight | Deadline |
|---|---|---|
Code repository + README.md | 15% | 19 May 2026, 23:59 |
| Peer code review | 15% | 31 May 2026, 23:59 |
| Total | 30% | — |
Detailed grading rubric¶
Code repository + README (15%)¶
| Criterion | Weight inside 15% | Full-credit expectation |
|---|---|---|
| Correctness | 5% | Correct outputs on normal and edge cases |
| Algorithmic reasoning | 4% | Clear data-structure/algorithm choice + justified complexity |
| Tests | 2% | Meaningful tests, including edge cases |
| README quality | 3% | Clear explanation for each analysis |
| Code quality | 1% | Readable, structured, PEP-8-friendly code |
Peer code review (15%)¶
| Criterion | Weight inside 15% | Full-credit expectation |
|---|---|---|
| Coverage | 5% | Reviews all required analyses and README |
| Technical accuracy | 5% | Correct comments on correctness, complexity, tests |
| Actionable feedback | 3% | Specific suggestions with concrete fixes |
| Professionalism | 2% | Respectful tone and clear structure |
Definition of Done (minimum requirements)¶
Your submission is considered complete only if all are true:
At least 3 analyses: 1 Beginner + 1 Intermediate + 1 Advanced
Each analysis implemented as your own Python function(s)
Tests for core functions (including edge cases)
README section per analysis: problem, algorithm, why, complexity
Reproducible run instructions in the repository
researchdata/is in.gitignore(no private data committed)
2. Setup with the Provided Dataset (Required)¶
This setup is required for everyone. All project tasks and grading can be completed using this data only.
2.0 Quickstart (see below for more details)¶
Create a new folder as “your name - ALG project”
Create/activate your virtual environment
Download and extract. The raw data will be available in
researchdata/Add
researchdata/and the other yareta-extracted folders to a.gitignoreRun Exercise 1 and confirm you can load one CSV
Run Exercise 2 and generate
merged.csv
2.1 Download the data¶
The smartwatch dataset from 12 anonymised participants is available on Yareta:
https://
yareta .unige .ch /archives /19bf7f3c -7d07 -48be -8c03 -9f754f766906
Download the archive and extract it into your project directory:
project/
└── researchdata/
├── 1487.csv
├── 1488.csv
└── ... (12 files total)⚠️ Add
researchdata/to your.gitignore. The data is private — never commit it to git.
Before you start coding, read the provided documentation to understand the data:
Who contributed the data (organisation and name)?
What is the sample size?
How many days of data have been collected per participant?
2.2 Load the data into Python¶
The files are CSV format. Use pandas to load them:
import pandas as pd
# Load a single participant's data
df = pd.read_csv("researchdata/1487.csv")
print("Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
print("\nColumn names:", df.columns.tolist())
print("\nData types:")
print(df.dtypes)Part 2: Exercises¶
Exercise 1 — Load the dataset and print basic statistics¶
Load one participant file (e.g. 1487.csv) and answer these questions using pandas:
How many rows and columns does it have?
What are the column names and their data types?
Print the output of
df.describe()— what do you notice about the step counts?Are there any missing values? (
df.isnull().sum())
Write a short (3-4 sentence) observation about what you see in the data.
Expected evidence before moving on:
screenshot or pasted output of shape, columns,
describe(), missing valuesyour 3-4 sentence observation
import pandas as pd
# Load one participant
df = pd.read_csv("researchdata/1487.csv")
# --- YOUR CODE HERE ---
# 1. Print shape
# 2. Print column names and data types
# 3. Print describe()
# 4. Check for missing values
(Write your 3–4 sentence observation here.)
Exercise 2 — Flatten all participant files into merged.csv¶
Load all 12 CSV files from the researchdata/ directory, add a participant_id column (the filename without .csv), and concatenate them into a single merged.csv file.
After saving, verify:
How many rows does
merged.csvhave?How many unique participants are there?
Expected evidence before moving on:
merged.csvexists in your project rootprinted total rows and participant count (should be 12 participants)
import pandas as pd
import os
# TODO: Load all CSV files from researchdata/,
# add participant_id column, concatenate and save to merged.csv.
dfs = []
for filename in os.listdir("researchdata"):
if filename.endswith(".csv"):
# --- YOUR CODE HERE ---
pass
# Concatenate all DataFrames
merged = pd.concat(dfs, ignore_index=True)
# Save
merged.to_csv("merged.csv", index=False)
print(f"Total rows: {len(merged):,}")
print(f"Participants: {merged['participant_id'].nunique()}")
print(f"Columns: {merged.columns.tolist()}")Exercise 3 — Explore the data manually¶
Using merged.csv, compute the following without calling df.max(), df.min(), or df.mean().
Use explicit loops to practice algorithm thinking:
Which participant has the highest total step count (sum of all their daily steps)?
What is the average daily step count across all rows (all participants combined)?
What is the minimum daily step count, ignoring days with 0 steps?
Hint: use a dict to accumulate totals per participant for question 1.
Expected evidence before moving on:
printed results for all three questions
short note (1-2 lines) on why your loop approach is O(n)
import pandas as pd
merged = pd.read_csv("merged.csv")
# --- YOUR CODE HERE ---
# 1. Most active participant (total steps)
# 2. Average daily step count
# 3. Minimum daily step count (excluding 0-step days)
Exercise 4 — find_most_active_day()¶
Implement a function that returns the date with the highest total step count across all participants.
Constraints:
Do not use
df.idxmax(),df.sort_values(), ordf.groupby()— implement the search manually.Your function should run in O(n) time where n is the number of rows.
What to document in your README (practice now):
Problem: find the most active day across all participants
Algorithm: linear scan, tracking max with a running variable
Why appropriate: single pass through data, O(n) time, O(d) space (d = distinct dates)
Complexity: O(n) time, O(d) space
Expected evidence before moving on:
function output and pandas cross-check both shown
one short README-style paragraph using the 4 bullet structure above
import pandas as pd
merged = pd.read_csv("merged.csv")
def find_most_active_day(df: pd.DataFrame) -> tuple:
"""
Return the date with the highest total step count across all participants.
Do NOT use df.idxmax() or df.sort_values().
Implement using a loop and a dictionary for date aggregation.
Parameters
----------
df : pd.DataFrame
Must contain 'date' and 'steps' columns.
Returns
-------
tuple : (date_str, total_steps) for the most active day
Complexity
----------
Time: O(n) — one pass to aggregate, one pass to find max
Space: O(d) — where d is the number of distinct dates
"""
# Step 1: Aggregate total steps per date
totals: dict = {}
for _, row in df.iterrows():
date = row["date"]
steps = row["steps"]
# --- YOUR CODE HERE ---
pass
# Step 2: Find the date with the maximum total
# --- YOUR CODE HERE ---
best_date = None
best_steps = -1
return best_date, best_steps
date, steps = find_most_active_day(merged)
print(f"Most active day: {date}")
print(f"Total steps: {steps:,}")
# Cross-check with pandas (should agree)
check = merged.groupby("date")["steps"].sum().idxmax()
print(f"\nCross-check (pandas): {check} → {'✓ matches' if date == check else '✗ mismatch'}")Exercise 5 — Longest streak of active days¶
Implement a function that finds the longest consecutive streak of days where a participant’s step count is at or above a given threshold (default: 8 000 steps).
This is a classic dynamic programming pattern:
Let
streak[i]= length of the longest active streak ending at dayiTransition:
streak[i] = streak[i-1] + 1ifsteps[i] >= threshold, else0Answer:
max(streak[i] for all i)
You only need to keep track of the current streak — no array needed.
What to document in your README:
Problem: longest streak of days above a step threshold
Algorithm: DP with a running counter (variant of Kadane’s algorithm)
Why appropriate: O(n) time, O(1) extra space — optimal for a sequential scan
Complexity: O(n) time, O(1) space
Expected evidence before moving on:
all 4 provided tests pass
one sentence explaining why this is DP and not brute force
def longest_streak(steps_per_day: list, threshold: int = 8000) -> int:
"""
Return the length of the longest consecutive streak of days where
step count >= threshold.
Parameters
----------
steps_per_day : list of int
Daily step counts in chronological order.
threshold : int
Minimum steps to count as an active day (default: 8000).
Returns
-------
int : length of the longest streak
Complexity
----------
Time: O(n)
Space: O(1)
"""
# --- YOUR CODE HERE ---
pass
# --- Tests ---
# Test 1: provided example
data1 = [9000, 7500, 8200, 8100, 9500, 6000, 8800, 9100, 7000, 8300]
result1 = longest_streak(data1, threshold=8000)
print(f"Test 1: {result1} (expected 3 — days 2,3,4: 8200, 8100, 9500)")
# Test 2: all above threshold
data2 = [9000, 8500, 8100, 9500]
result2 = longest_streak(data2, threshold=8000)
print(f"Test 2: {result2} (expected 4 — all days active)")
# Test 3: none above threshold
data3 = [5000, 6000, 7000]
result3 = longest_streak(data3, threshold=8000)
print(f"Test 3: {result3} (expected 0 — no active days)")
# Test 4: empty input
result4 = longest_streak([], threshold=8000)
print(f"Test 4: {result4} (expected 0 — empty input)")3. Project Questions You Can Choose to Solve¶
You must complete at least 3 analyses total:
1 Beginner
1 Intermediate
1 Advanced
You may choose from the examples below or propose your own questions.
3.1 What defines each level?¶
The level is defined by the algorithmic technique used, not by the topic.
Beginner (Weeks 0-4 skills): loops, conditionals, lists/dicts, basic sorting, simple counting/aggregation.
Intermediate (Weeks 5-7 skills): structured algorithm design (e.g., sliding window, hashing for fast lookup, sorting + search, baseline vs improved approach).
Advanced (Weeks 8+ skills): graph algorithms, dynamic programming, or more advanced optimization/modeling choices with clear justification.
3.2 Beginner Level (start implementing now)¶
Goal: Show correct problem solving with core Python and basic algorithmic thinking.
Requirements (all must be met):
Solve a clearly stated question using explicit algorithmic logic (not only library one-liners such as
max(),min()orpd.sort_values()).Use at least one of: linear scan, counting with dict, manual aggregation, simple sorting or list comprehensions.
Report complexity informally (e.g., “single pass over rows -> O(n)”).
Example questions (choose one or design your own):
Which date has the highest total step count?
Which participant has the highest average daily steps?
On how many days does each participant exceed a chosen threshold (e.g., 10,000 steps)?
Rank participants by total steps using a simple baseline sort (e.g., bubble sort).
Find each participant’s most active day (maximum value with linear scan).
3.3 Intermediate Level (after algorithmics starts, after ~week 6)¶
Goal: Compare at least two approaches and show why one is better.
Requirements (all must be met):
Implement one intermediate technique (e.g., sliding window, hash-based lookup, sorting + binary search).
Include a baseline approach and an improved approach.
Explain complexity of both and justify the improvement.
Example questions (choose one or design your own):
Detect 7-day trends in activity using a sliding window moving average.
Detect unusually active/inactive days using a rule and efficient lookup strategy.
Rank users by multiple metrics and compare two ranking implementations.
Find repeated activity patterns over fixed windows (baseline nested loops vs optimized method).
Query “top-k days” per participant (full sort baseline vs more efficient strategy).
3.4 Advanced Level (later in the course, after ~week 8)¶
Goal: Model the problem with a stronger algorithmic paradigm.
Requirements (all must be met):
Use at least one advanced method (graph traversal, shortest path, dynamic programming, etc.).
Justify why this modeling choice fits the question.
State time and space complexity clearly.
Example questions (choose one or design your own):
Longest active streak under a chosen definition (DP-style formulation).
Build a participant similarity graph and find communities/components (BFS/DFS).
Model activity-state transitions and compute shortest/least-cost transitions (Dijkstra-style).
Detect longest stable pattern segment under constraints (DP or optimized state tracking).
Compare graph-based vs non-graph baseline for a relationship question.
3.5 Propose Your Own Question (encouraged)¶
You are strongly encouraged to propose your own analysis question.
To classify your question, check:
Technique: Which algorithmic idea from the course does it require?
Complexity: Can you explain runtime/space in a meaningful way?
Evidence: Can you show results on the provided wearable dataset?
Comparison: (Intermediate/Advanced) Can you compare against a simpler baseline?
If these are clear, your custom question is valid.
Example README¶
You can find a sample-submission wit a possible layout of a README file here.
4. Optional: Use Your Own Withings Data¶
This section is not required. It allows you to run your algorithms on your own activity data and compare yourself with the cohort.
Step 1: Install the Withings App¶
Download Withings Health Mate from the App Store or Google Play and create an account. Ensure synchronisation with your phone sensors is enabled.
Step 2: OAuth Authentication¶
Withings requires OAuth 2.0 to access your personal data. Run the script below to complete the authentication flow. You will need to provide a callback URL — use this hosted page:
https://withings-token-show.lovable.app/
Step 3: Run the authentication script¶
# withings_auth.py — save this to a file and run: python withings_auth.py
# This cell only prints the script content; it does not execute OAuth.
import json, random, webbrowser, requests
CLIENT_ID = "549043381e026880160c3f4c34984c8d7677df0d9acf73cc4e629c2fb938e3fe"
CONSUMER_SECRET = "ed922958cb33597d6c8f536b8c9c779460904a12d9edc9436a1f8fb7ea84b04b"
CALLBACK_URL = "https://withings-token-show.lovable.app/"
AUTH_URL = "https://account.withings.com/oauth2_user/authorize2"
TOKEN_URL = "https://wbsapi.withings.net/v2/oauth2"
SCOPE = "user.activity"
CONFIG_FILE = "config.json"
def main():
url = (f"{AUTH_URL}?response_type=code&client_id={CLIENT_ID}"
f"&redirect_uri={CALLBACK_URL}&scope={SCOPE}"
f"&state={random.randint(0, 999999999)}")
print("Opening browser for Withings login...")
webbrowser.open(url)
code = input("\nPaste the code you received: ").strip()
data = {"action": "requesttoken", "grant_type": "authorization_code",
"client_id": CLIENT_ID, "client_secret": CONSUMER_SECRET,
"code": code, "redirect_uri": CALLBACK_URL}
resp = requests.post(TOKEN_URL, data=data)
resp.raise_for_status()
tokens = resp.json()["body"]
config = {"client_id": CLIENT_ID, "consumer_secret": CONSUMER_SECRET,
"access_token": tokens["access_token"],
"refresh_token": tokens["refresh_token"],
"userid": tokens["userid"]}
with open(CONFIG_FILE, "w") as f:
json.dump(config, f, indent=2)
print(f"Tokens saved to {CONFIG_FILE}")
print("Important: add config.json to your .gitignore!")
if __name__ == "__main__":
main()
Summary¶
| Topic | Key takeaway |
|---|---|
| Grading | 30% total: 15% code + 15% peer review with explicit rubrics |
| Definition of done | 3 analyses (B/I/A) + tests + README + reproducible repo |
| Dataset | Withings ScanWatch, 12 participants, local researchdata/ folder |
| README | For each analysis: problem · algorithm · why · complexity |
| Data privacy | Never commit researchdata/ files to git |
| Peer review | Week 12, specific and actionable technical feedback |
Action items before next seminar¶
Download dataset, complete Exercise 1, and record observations
Build
merged.csvfrom all participants (Exercise 2)Complete at least one Beginner analysis with tests
Write one README-style analysis section using the worked example
University of Geneva — Algorithmics & Data Structures