Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Seminar 4: Introduction to the Project

Algorithmics & Data Structures — University of Geneva


Learning Objectives

By the end of this seminar you will be able to:

  • Understand the project scope, deliverables, and grading

  • Download and load the smartwatch dataset into Python

  • Explore the data programmatically using basic statistics

  • Implement your first algorithm on real data (find most active day)

  • Identify three analyses you will implement for your project (one per level)


Part 1: The Project

1. Goal

Working with real data is much more rewarding than synthetic examples. In this project you will work with physiological data collected by the Quality of Life Lab from twelve individuals using Withings ScanWatch smartwatches. The data contains step counts measured per hour and per day over several months.

You will apply the algorithms introduced in this course — sorting, searching, data structures, recursion, dynamic programming, graphs — to answer concrete questions derived from the dataset.

The focus is on:

  • Choosing appropriate data structures for the problem at hand

  • Implementing algorithms correctly — not just calling pandas built-ins

  • Arguing why your solution is efficient (time and space complexity)

  • Writing clean, testable Python code with a documented README

Deliverables and grading

The project contributes 30% of your final grade:

DeliverableWeightDeadline
Code repository + README.md15%19 May 2026, 23:59
Peer code review15%31 May 2026, 23:59
Total30%

Detailed grading rubric

Code repository + README (15%)

CriterionWeight inside 15%Full-credit expectation
Correctness5%Correct outputs on normal and edge cases
Algorithmic reasoning4%Clear data-structure/algorithm choice + justified complexity
Tests2%Meaningful tests, including edge cases
README quality3%Clear explanation for each analysis
Code quality1%Readable, structured, PEP-8-friendly code

Peer code review (15%)

CriterionWeight inside 15%Full-credit expectation
Coverage5%Reviews all required analyses and README
Technical accuracy5%Correct comments on correctness, complexity, tests
Actionable feedback3%Specific suggestions with concrete fixes
Professionalism2%Respectful tone and clear structure

Definition of Done (minimum requirements)

Your submission is considered complete only if all are true:

  • At least 3 analyses: 1 Beginner + 1 Intermediate + 1 Advanced

  • Each analysis implemented as your own Python function(s)

  • Tests for core functions (including edge cases)

  • README section per analysis: problem, algorithm, why, complexity

  • Reproducible run instructions in the repository

  • researchdata/ is in .gitignore (no private data committed)


2. Setup with the Provided Dataset (Required)

This setup is required for everyone. All project tasks and grading can be completed using this data only.

2.0 Quickstart (see below for more details)

  1. Create a new folder as “your name - ALG project”

  2. Create/activate your virtual environment

  3. Download and extract. The raw data will be available in researchdata/

  4. Add researchdata/ and the other yareta-extracted folders to a .gitignore

  5. Run Exercise 1 and confirm you can load one CSV

  6. Run Exercise 2 and generate merged.csv

2.1 Download the data

The smartwatch dataset from 12 anonymised participants is available on Yareta:

https://yareta.unige.ch/archives/19bf7f3c-7d07-48be-8c03-9f754f766906

Download the archive and extract it into your project directory:

project/
  └── researchdata/
        ├── 1487.csv
        ├── 1488.csv
        └── ...  (12 files total)

⚠️ Add researchdata/ to your .gitignore. The data is private — never commit it to git.

Before you start coding, read the provided documentation to understand the data:

  • Who contributed the data (organisation and name)?

  • What is the sample size?

  • How many days of data have been collected per participant?

2.2 Load the data into Python

The files are CSV format. Use pandas to load them:

import pandas as pd

# Load a single participant's data
df = pd.read_csv("researchdata/1487.csv")

print("Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())

print("\nColumn names:", df.columns.tolist())
print("\nData types:")
print(df.dtypes)

Part 2: Exercises


Exercise 1 — Load the dataset and print basic statistics

Load one participant file (e.g. 1487.csv) and answer these questions using pandas:

  1. How many rows and columns does it have?

  2. What are the column names and their data types?

  3. Print the output of df.describe() — what do you notice about the step counts?

  4. Are there any missing values? (df.isnull().sum())

Write a short (3-4 sentence) observation about what you see in the data.

Expected evidence before moving on:

  • screenshot or pasted output of shape, columns, describe(), missing values

  • your 3-4 sentence observation

import pandas as pd

# Load one participant
df = pd.read_csv("researchdata/1487.csv")

# --- YOUR CODE HERE ---
# 1. Print shape

# 2. Print column names and data types

# 3. Print describe()

# 4. Check for missing values

(Write your 3–4 sentence observation here.)


Exercise 2 — Flatten all participant files into merged.csv

Load all 12 CSV files from the researchdata/ directory, add a participant_id column (the filename without .csv), and concatenate them into a single merged.csv file.

After saving, verify:

  • How many rows does merged.csv have?

  • How many unique participants are there?

Expected evidence before moving on:

  • merged.csv exists in your project root

  • printed total rows and participant count (should be 12 participants)

import pandas as pd
import os

# TODO: Load all CSV files from researchdata/,
# add participant_id column, concatenate and save to merged.csv.

dfs = []

for filename in os.listdir("researchdata"):
    if filename.endswith(".csv"):
        # --- YOUR CODE HERE ---
        pass

# Concatenate all DataFrames
merged = pd.concat(dfs, ignore_index=True)

# Save
merged.to_csv("merged.csv", index=False)

print(f"Total rows:       {len(merged):,}")
print(f"Participants:     {merged['participant_id'].nunique()}")
print(f"Columns:          {merged.columns.tolist()}")

Exercise 3 — Explore the data manually

Using merged.csv, compute the following without calling df.max(), df.min(), or df.mean(). Use explicit loops to practice algorithm thinking:

  1. Which participant has the highest total step count (sum of all their daily steps)?

  2. What is the average daily step count across all rows (all participants combined)?

  3. What is the minimum daily step count, ignoring days with 0 steps?

Hint: use a dict to accumulate totals per participant for question 1.

Expected evidence before moving on:

  • printed results for all three questions

  • short note (1-2 lines) on why your loop approach is O(n)

import pandas as pd

merged = pd.read_csv("merged.csv")

# --- YOUR CODE HERE ---

# 1. Most active participant (total steps)


# 2. Average daily step count


# 3. Minimum daily step count (excluding 0-step days)

Exercise 4 — find_most_active_day()

Implement a function that returns the date with the highest total step count across all participants.

Constraints:

  • Do not use df.idxmax(), df.sort_values(), or df.groupby() — implement the search manually.

  • Your function should run in O(n) time where n is the number of rows.

What to document in your README (practice now):

  • Problem: find the most active day across all participants

  • Algorithm: linear scan, tracking max with a running variable

  • Why appropriate: single pass through data, O(n) time, O(d) space (d = distinct dates)

  • Complexity: O(n) time, O(d) space

Expected evidence before moving on:

  • function output and pandas cross-check both shown

  • one short README-style paragraph using the 4 bullet structure above

import pandas as pd

merged = pd.read_csv("merged.csv")


def find_most_active_day(df: pd.DataFrame) -> tuple:
    """
    Return the date with the highest total step count across all participants.

    Do NOT use df.idxmax() or df.sort_values().
    Implement using a loop and a dictionary for date aggregation.

    Parameters
    ----------
    df : pd.DataFrame
        Must contain 'date' and 'steps' columns.

    Returns
    -------
    tuple : (date_str, total_steps) for the most active day

    Complexity
    ----------
    Time:  O(n) — one pass to aggregate, one pass to find max
    Space: O(d) — where d is the number of distinct dates
    """
    # Step 1: Aggregate total steps per date
    totals: dict = {}
    for _, row in df.iterrows():
        date = row["date"]
        steps = row["steps"]
        # --- YOUR CODE HERE ---
        pass

    # Step 2: Find the date with the maximum total
    # --- YOUR CODE HERE ---
    best_date = None
    best_steps = -1

    return best_date, best_steps


date, steps = find_most_active_day(merged)
print(f"Most active day: {date}")
print(f"Total steps:     {steps:,}")

# Cross-check with pandas (should agree)
check = merged.groupby("date")["steps"].sum().idxmax()
print(f"\nCross-check (pandas): {check}  →  {'✓ matches' if date == check else '✗ mismatch'}")

Exercise 5 — Longest streak of active days

Implement a function that finds the longest consecutive streak of days where a participant’s step count is at or above a given threshold (default: 8 000 steps).

This is a classic dynamic programming pattern:

  • Let streak[i] = length of the longest active streak ending at day i

  • Transition: streak[i] = streak[i-1] + 1 if steps[i] >= threshold, else 0

  • Answer: max(streak[i] for all i)

You only need to keep track of the current streak — no array needed.

What to document in your README:

  • Problem: longest streak of days above a step threshold

  • Algorithm: DP with a running counter (variant of Kadane’s algorithm)

  • Why appropriate: O(n) time, O(1) extra space — optimal for a sequential scan

  • Complexity: O(n) time, O(1) space

Expected evidence before moving on:

  • all 4 provided tests pass

  • one sentence explaining why this is DP and not brute force

def longest_streak(steps_per_day: list, threshold: int = 8000) -> int:
    """
    Return the length of the longest consecutive streak of days where
    step count >= threshold.

    Parameters
    ----------
    steps_per_day : list of int
        Daily step counts in chronological order.
    threshold : int
        Minimum steps to count as an active day (default: 8000).

    Returns
    -------
    int : length of the longest streak

    Complexity
    ----------
    Time:  O(n)
    Space: O(1)
    """
    # --- YOUR CODE HERE ---
    pass


# --- Tests ---
# Test 1: provided example
data1 = [9000, 7500, 8200, 8100, 9500, 6000, 8800, 9100, 7000, 8300]
result1 = longest_streak(data1, threshold=8000)
print(f"Test 1: {result1}  (expected 3 — days 2,3,4: 8200, 8100, 9500)")

# Test 2: all above threshold
data2 = [9000, 8500, 8100, 9500]
result2 = longest_streak(data2, threshold=8000)
print(f"Test 2: {result2}  (expected 4 — all days active)")

# Test 3: none above threshold
data3 = [5000, 6000, 7000]
result3 = longest_streak(data3, threshold=8000)
print(f"Test 3: {result3}  (expected 0 — no active days)")

# Test 4: empty input
result4 = longest_streak([], threshold=8000)
print(f"Test 4: {result4}  (expected 0 — empty input)")

3. Project Questions You Can Choose to Solve

You must complete at least 3 analyses total:

  • 1 Beginner

  • 1 Intermediate

  • 1 Advanced

You may choose from the examples below or propose your own questions.

3.1 What defines each level?

The level is defined by the algorithmic technique used, not by the topic.

  • Beginner (Weeks 0-4 skills): loops, conditionals, lists/dicts, basic sorting, simple counting/aggregation.

  • Intermediate (Weeks 5-7 skills): structured algorithm design (e.g., sliding window, hashing for fast lookup, sorting + search, baseline vs improved approach).

  • Advanced (Weeks 8+ skills): graph algorithms, dynamic programming, or more advanced optimization/modeling choices with clear justification.


3.2 Beginner Level (start implementing now)

Goal: Show correct problem solving with core Python and basic algorithmic thinking.

Requirements (all must be met):

  1. Solve a clearly stated question using explicit algorithmic logic (not only library one-liners such as max(), min() or pd.sort_values()).

  2. Use at least one of: linear scan, counting with dict, manual aggregation, simple sorting or list comprehensions.

  3. Report complexity informally (e.g., “single pass over rows -> O(n)”).

Example questions (choose one or design your own):

  1. Which date has the highest total step count?

  2. Which participant has the highest average daily steps?

  3. On how many days does each participant exceed a chosen threshold (e.g., 10,000 steps)?

  4. Rank participants by total steps using a simple baseline sort (e.g., bubble sort).

  5. Find each participant’s most active day (maximum value with linear scan).


3.3 Intermediate Level (after algorithmics starts, after ~week 6)

Goal: Compare at least two approaches and show why one is better.

Requirements (all must be met):

  1. Implement one intermediate technique (e.g., sliding window, hash-based lookup, sorting + binary search).

  2. Include a baseline approach and an improved approach.

  3. Explain complexity of both and justify the improvement.

Example questions (choose one or design your own):

  1. Detect 7-day trends in activity using a sliding window moving average.

  2. Detect unusually active/inactive days using a rule and efficient lookup strategy.

  3. Rank users by multiple metrics and compare two ranking implementations.

  4. Find repeated activity patterns over fixed windows (baseline nested loops vs optimized method).

  5. Query “top-k days” per participant (full sort baseline vs more efficient strategy).


3.4 Advanced Level (later in the course, after ~week 8)

Goal: Model the problem with a stronger algorithmic paradigm.

Requirements (all must be met):

  1. Use at least one advanced method (graph traversal, shortest path, dynamic programming, etc.).

  2. Justify why this modeling choice fits the question.

  3. State time and space complexity clearly.

Example questions (choose one or design your own):

  1. Longest active streak under a chosen definition (DP-style formulation).

  2. Build a participant similarity graph and find communities/components (BFS/DFS).

  3. Model activity-state transitions and compute shortest/least-cost transitions (Dijkstra-style).

  4. Detect longest stable pattern segment under constraints (DP or optimized state tracking).

  5. Compare graph-based vs non-graph baseline for a relationship question.


3.5 Propose Your Own Question (encouraged)

You are strongly encouraged to propose your own analysis question.

To classify your question, check:

  1. Technique: Which algorithmic idea from the course does it require?

  2. Complexity: Can you explain runtime/space in a meaningful way?

  3. Evidence: Can you show results on the provided wearable dataset?

  4. Comparison: (Intermediate/Advanced) Can you compare against a simpler baseline?

If these are clear, your custom question is valid.

Example README

You can find a sample-submission wit a possible layout of a README file here.


4. Optional: Use Your Own Withings Data

This section is not required. It allows you to run your algorithms on your own activity data and compare yourself with the cohort.

Step 1: Install the Withings App

Download Withings Health Mate from the App Store or Google Play and create an account. Ensure synchronisation with your phone sensors is enabled.

Step 2: OAuth Authentication

Withings requires OAuth 2.0 to access your personal data. Run the script below to complete the authentication flow. You will need to provide a callback URL — use this hosted page:

https://withings-token-show.lovable.app/

Step 3: Run the authentication script

# withings_auth.py — save this to a file and run: python withings_auth.py
# This cell only prints the script content; it does not execute OAuth.

import json, random, webbrowser, requests

CLIENT_ID       = "549043381e026880160c3f4c34984c8d7677df0d9acf73cc4e629c2fb938e3fe"
CONSUMER_SECRET = "ed922958cb33597d6c8f536b8c9c779460904a12d9edc9436a1f8fb7ea84b04b"
CALLBACK_URL    = "https://withings-token-show.lovable.app/"
AUTH_URL        = "https://account.withings.com/oauth2_user/authorize2"
TOKEN_URL       = "https://wbsapi.withings.net/v2/oauth2"
SCOPE           = "user.activity"
CONFIG_FILE     = "config.json"

def main():
    url = (f"{AUTH_URL}?response_type=code&client_id={CLIENT_ID}"
           f"&redirect_uri={CALLBACK_URL}&scope={SCOPE}"
           f"&state={random.randint(0, 999999999)}")
    print("Opening browser for Withings login...")
    webbrowser.open(url)
    code = input("\nPaste the code you received: ").strip()
    data = {"action": "requesttoken", "grant_type": "authorization_code",
            "client_id": CLIENT_ID, "client_secret": CONSUMER_SECRET,
            "code": code, "redirect_uri": CALLBACK_URL}
    resp   = requests.post(TOKEN_URL, data=data)
    resp.raise_for_status()
    tokens = resp.json()["body"]
    config = {"client_id": CLIENT_ID, "consumer_secret": CONSUMER_SECRET,
              "access_token": tokens["access_token"],
              "refresh_token": tokens["refresh_token"],
              "userid": tokens["userid"]}
    with open(CONFIG_FILE, "w") as f:
        json.dump(config, f, indent=2)
    print(f"Tokens saved to {CONFIG_FILE}")
    print("Important: add config.json to your .gitignore!")

if __name__ == "__main__":
    main()

Summary

TopicKey takeaway
Grading30% total: 15% code + 15% peer review with explicit rubrics
Definition of done3 analyses (B/I/A) + tests + README + reproducible repo
DatasetWithings ScanWatch, 12 participants, local researchdata/ folder
READMEFor each analysis: problem · algorithm · why · complexity
Data privacyNever commit researchdata/ files to git
Peer reviewWeek 12, specific and actionable technical feedback

Action items before next seminar

  • Download dataset, complete Exercise 1, and record observations

  • Build merged.csv from all participants (Exercise 2)

  • Complete at least one Beginner analysis with tests

  • Write one README-style analysis section using the worked example


University of Geneva — Algorithmics & Data Structures