Seminar 4: Introduction to the Project

Algorithmics & Data Structures — University of Geneva

Learning Objectives¶

By the end of this seminar you will be able to:

Understand the project scope, deliverables, and grading
Download and load the smartwatch dataset into Python
Explore the data programmatically using basic statistics
Implement your first algorithm on real data (find most active day)
Identify three analyses you will implement for your project (one per level)

Part 1: The Project¶

1. Goal¶

Working with real data is much more rewarding than synthetic examples. In this project you will work with physiological data collected by the Quality of Life Lab from twelve individuals using Withings ScanWatch smartwatches. The data contains step counts measured per hour and per day over several months.

You will apply the algorithms introduced in this course — sorting, searching, data structures, recursion, dynamic programming, graphs — to answer concrete questions derived from the dataset.

The focus is on:

Choosing appropriate data structures for the problem at hand
Implementing algorithms correctly — not just calling pandas built-ins
Arguing why your solution is efficient (time and space complexity)
Writing clean, testable Python code with a documented README

Deliverables and grading¶

The project contributes 30% of your final grade:

Deliverable	Weight	Deadline
Code repository + `README.md`	15%	19 May 2026, 23:59
Peer code review	15%	31 May 2026, 23:59
Total	30%	—

Detailed grading rubric¶

Code repository + README (15%)¶

Criterion	Weight inside 15%	Full-credit expectation
Correctness	5%	Correct outputs on normal and edge cases
Algorithmic reasoning	4%	Clear data-structure/algorithm choice + justified complexity
Tests	2%	Meaningful tests, including edge cases
README quality	3%	Clear explanation for each analysis
Code quality	1%	Readable, structured, PEP-8-friendly code

Peer code review (15%)¶

Criterion	Weight inside 15%	Full-credit expectation
Coverage	5%	Reviews all required analyses and README
Technical accuracy	5%	Correct comments on correctness, complexity, tests
Actionable feedback	3%	Specific suggestions with concrete fixes
Professionalism	2%	Respectful tone and clear structure

Definition of Done (minimum requirements)¶

Your submission is considered complete only if all are true:

At least 3 analyses: 1 Beginner + 1 Intermediate + 1 Advanced
Each analysis implemented as your own Python function(s)
Tests for core functions (including edge cases)
README section per analysis: problem, algorithm, why, complexity
Reproducible run instructions in the repository
researchdata/ is in .gitignore (no private data committed)

2. Setup with the Provided Dataset (Required)¶

This setup is required for everyone. All project tasks and grading can be completed using this data only.

2.0 Quickstart (see below for more details)¶

Create a new folder as “your name - ALG project”
Create/activate your virtual environment
Download and extract. The raw data will be available in researchdata/
Add researchdata/ and the other yareta-extracted folders to a .gitignore
Run Exercise 1 and confirm you can load one CSV
Run Exercise 2 and generate merged.csv

2.1 Download the data¶

The smartwatch dataset from 12 anonymised participants is available on Yareta:

https://yareta.unige.ch/archives/19bf7f3c-7d07-48be-8c03-9f754f766906

Download the archive and extract it into your project directory:

project/
  └── researchdata/
        ├── 1487.csv
        ├── 1488.csv
        └── ...  (12 files total)

⚠️ Add researchdata/ to your .gitignore. The data is private — never commit it to git.

Before you start coding, read the provided documentation to understand the data:

Who contributed the data (organisation and name)?
What is the sample size?
How many days of data have been collected per participant?

2.2 Load the data into Python¶

The files are CSV format. Use pandas to load them:

import pandas as pd

# Load a single participant's data
df = pd.read_csv("researchdata/1487.csv")

print("Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())

print("\nColumn names:", df.columns.tolist())
print("\nData types:")
print(df.dtypes)

Part 2: Exercises¶

Exercise 1 — Load the dataset and print basic statistics¶

Load one participant file (e.g. 1487.csv) and answer these questions using pandas:

How many rows and columns does it have?
What are the column names and their data types?
Print the output of df.describe() — what do you notice about the step counts?
Are there any missing values? (df.isnull().sum())

Write a short (3-4 sentence) observation about what you see in the data.

Expected evidence before moving on:

screenshot or pasted output of shape, columns, describe(), missing values
your 3-4 sentence observation

import pandas as pd

# Load one participant
df = pd.read_csv("researchdata/1487.csv")

# --- YOUR CODE HERE ---
# 1. Print shape

# 2. Print column names and data types

# 3. Print describe()

# 4. Check for missing values

(Write your 3–4 sentence observation here.)

Exercise 2 — Flatten all participant files into `merged.csv`¶

Load all 12 CSV files from the researchdata/ directory, add a participant_id column (the filename without .csv), and concatenate them into a single merged.csv file.

After saving, verify:

How many rows does merged.csv have?
How many unique participants are there?

Expected evidence before moving on:

merged.csv exists in your project root
printed total rows and participant count (should be 12 participants)

import pandas as pd
import os

# TODO: Load all CSV files from researchdata/,
# add participant_id column, concatenate and save to merged.csv.

dfs = []

for filename in os.listdir("researchdata"):
    if filename.endswith(".csv"):
        # --- YOUR CODE HERE ---
        pass

# Concatenate all DataFrames
merged = pd.concat(dfs, ignore_index=True)

# Save
merged.to_csv("merged.csv", index=False)

print(f"Total rows:       {len(merged):,}")
print(f"Participants:     {merged['participant_id'].nunique()}")
print(f"Columns:          {merged.columns.tolist()}")

Exercise 3 — Explore the data manually¶

Using merged.csv, compute the following without calling df.max(), df.min(), or df.mean(). Use explicit loops to practice algorithm thinking:

Which participant has the highest total step count (sum of all their daily steps)?
What is the average daily step count across all rows (all participants combined)?
What is the minimum daily step count, ignoring days with 0 steps?

Hint: use a dict to accumulate totals per participant for question 1.

Expected evidence before moving on:

printed results for all three questions
short note (1-2 lines) on why your loop approach is O(n)

import pandas as pd

merged = pd.read_csv("merged.csv")

# --- YOUR CODE HERE ---

# 1. Most active participant (total steps)


# 2. Average daily step count


# 3. Minimum daily step count (excluding 0-step days)

Exercise 4 — `find_most_active_day()`¶

Implement a function that returns the date with the highest total step count across all participants.

Constraints:

Do not use df.idxmax(), df.sort_values(), or df.groupby() — implement the search manually.
Your function should run in O(n) time where n is the number of rows.

What to document in your README (practice now):

Problem: find the most active day across all participants
Algorithm: linear scan, tracking max with a running variable
Why appropriate: single pass through data, O(n) time, O(d) space (d = distinct dates)
Complexity: O(n) time, O(d) space

Expected evidence before moving on:

function output and pandas cross-check both shown
one short README-style paragraph using the 4 bullet structure above

import pandas as pd

merged = pd.read_csv("merged.csv")


def find_most_active_day(df: pd.DataFrame) -> tuple:
    """
    Return the date with the highest total step count across all participants.

    Do NOT use df.idxmax() or df.sort_values().
    Implement using a loop and a dictionary for date aggregation.

    Parameters
    ----------
    df : pd.DataFrame
        Must contain 'date' and 'steps' columns.

    Returns
    -------
    tuple : (date_str, total_steps) for the most active day

    Complexity
    ----------
    Time:  O(n) — one pass to aggregate, one pass to find max
    Space: O(d) — where d is the number of distinct dates
    """
    # Step 1: Aggregate total steps per date
    totals: dict = {}
    for _, row in df.iterrows():
        date = row["date"]
        steps = row["steps"]
        # --- YOUR CODE HERE ---
        pass

    # Step 2: Find the date with the maximum total
    # --- YOUR CODE HERE ---
    best_date = None
    best_steps = -1

    return best_date, best_steps


date, steps = find_most_active_day(merged)
print(f"Most active day: {date}")
print(f"Total steps:     {steps:,}")

# Cross-check with pandas (should agree)
check = merged.groupby("date")["steps"].sum().idxmax()
print(f"\nCross-check (pandas): {check}  →  {'✓ matches' if date == check else '✗ mismatch'}")

Exercise 5 — Longest streak of active days¶

Implement a function that finds the longest consecutive streak of days where a participant’s step count is at or above a given threshold (default: 8 000 steps).

This is a classic dynamic programming pattern:

Let streak[i] = length of the longest active streak ending at day i
Transition: streak[i] = streak[i-1] + 1 if steps[i] >= threshold, else 0
Answer: max(streak[i] for all i)

You only need to keep track of the current streak — no array needed.

What to document in your README:

Problem: longest streak of days above a step threshold
Algorithm: DP with a running counter (variant of Kadane’s algorithm)
Why appropriate: O(n) time, O(1) extra space — optimal for a sequential scan
Complexity: O(n) time, O(1) space

Expected evidence before moving on:

all 4 provided tests pass
one sentence explaining why this is DP and not brute force

def longest_streak(steps_per_day: list, threshold: int = 8000) -> int:
    """
    Return the length of the longest consecutive streak of days where
    step count >= threshold.

    Parameters
    ----------
    steps_per_day : list of int
        Daily step counts in chronological order.
    threshold : int
        Minimum steps to count as an active day (default: 8000).

    Returns
    -------
    int : length of the longest streak

    Complexity
    ----------
    Time:  O(n)
    Space: O(1)
    """
    # --- YOUR CODE HERE ---
    pass


# --- Tests ---
# Test 1: provided example
data1 = [9000, 7500, 8200, 8100, 9500, 6000, 8800, 9100, 7000, 8300]
result1 = longest_streak(data1, threshold=8000)
print(f"Test 1: {result1}  (expected 3 — days 2,3,4: 8200, 8100, 9500)")

# Test 2: all above threshold
data2 = [9000, 8500, 8100, 9500]
result2 = longest_streak(data2, threshold=8000)
print(f"Test 2: {result2}  (expected 4 — all days active)")

# Test 3: none above threshold
data3 = [5000, 6000, 7000]
result3 = longest_streak(data3, threshold=8000)
print(f"Test 3: {result3}  (expected 0 — no active days)")

# Test 4: empty input
result4 = longest_streak([], threshold=8000)
print(f"Test 4: {result4}  (expected 0 — empty input)")

3. Project Questions You Can Choose to Solve¶

You must complete at least 3 analyses total:

1 Beginner
1 Intermediate
1 Advanced

You may choose from the examples below or propose your own questions.

3.1 What defines each level?¶

The level is defined by the algorithmic technique used, not by the topic.

Beginner (Weeks 0-4 skills): loops, conditionals, lists/dicts, basic sorting, simple counting/aggregation.
Intermediate (Weeks 5-7 skills): structured algorithm design (e.g., sliding window, hashing for fast lookup, sorting + search, baseline vs improved approach).
Advanced (Weeks 8+ skills): graph algorithms, dynamic programming, or more advanced optimization/modeling choices with clear justification.

3.2 Beginner Level (start implementing now)¶

Goal: Show correct problem solving with core Python and basic algorithmic thinking.

Requirements (all must be met):

Solve a clearly stated question using explicit algorithmic logic (not only library one-liners such as max(), min() or pd.sort_values()).
Use at least one of: linear scan, counting with dict, manual aggregation, simple sorting or list comprehensions.
Report complexity informally (e.g., “single pass over rows -> O(n)”).

Example questions (choose one or design your own):

Which date has the highest total step count?
Which participant has the highest average daily steps?
On how many days does each participant exceed a chosen threshold (e.g., 10,000 steps)?
Rank participants by total steps using a simple baseline sort (e.g., bubble sort).
Find each participant’s most active day (maximum value with linear scan).

3.3 Intermediate Level (after algorithmics starts, after ~week 6)¶

Goal: Compare at least two approaches and show why one is better.

Requirements (all must be met):

Implement one intermediate technique (e.g., sliding window, hash-based lookup, sorting + binary search).
Include a baseline approach and an improved approach.
Explain complexity of both and justify the improvement.

Example questions (choose one or design your own):

Detect 7-day trends in activity using a sliding window moving average.
Detect unusually active/inactive days using a rule and efficient lookup strategy.
Rank users by multiple metrics and compare two ranking implementations.
Find repeated activity patterns over fixed windows (baseline nested loops vs optimized method).
Query “top-k days” per participant (full sort baseline vs more efficient strategy).

3.4 Advanced Level (later in the course, after ~week 8)¶

Goal: Model the problem with a stronger algorithmic paradigm.

Requirements (all must be met):

Use at least one advanced method (graph traversal, shortest path, dynamic programming, etc.).
Justify why this modeling choice fits the question.
State time and space complexity clearly.

Example questions (choose one or design your own):

Longest active streak under a chosen definition (DP-style formulation).
Build a participant similarity graph and find communities/components (BFS/DFS).
Model activity-state transitions and compute shortest/least-cost transitions (Dijkstra-style).
Detect longest stable pattern segment under constraints (DP or optimized state tracking).
Compare graph-based vs non-graph baseline for a relationship question.

3.5 Propose Your Own Question (encouraged)¶

You are strongly encouraged to propose your own analysis question.

To classify your question, check:

Technique: Which algorithmic idea from the course does it require?
Complexity: Can you explain runtime/space in a meaningful way?
Evidence: Can you show results on the provided wearable dataset?
Comparison: (Intermediate/Advanced) Can you compare against a simpler baseline?

If these are clear, your custom question is valid.

Example README¶

You can find a sample-submission wit a possible layout of a README file here.

4. Optional: Use Your Own Withings Data¶

This section is not required. It allows you to run your algorithms on your own activity data and compare yourself with the cohort.

Step 1: Install the Withings App¶

Download Withings Health Mate from the App Store or Google Play and create an account. Ensure synchronisation with your phone sensors is enabled.

Step 2: OAuth Authentication¶

Withings requires OAuth 2.0 to access your personal data. Run the script below to complete the authentication flow. You will need to provide a callback URL — use this hosted page:

https://withings-token-show.lovable.app/

Step 3: Run the authentication script¶

# withings_auth.py — save this to a file and run: python withings_auth.py
# This cell only prints the script content; it does not execute OAuth.

import json, random, webbrowser, requests

CLIENT_ID       = "549043381e026880160c3f4c34984c8d7677df0d9acf73cc4e629c2fb938e3fe"
CONSUMER_SECRET = "ed922958cb33597d6c8f536b8c9c779460904a12d9edc9436a1f8fb7ea84b04b"
CALLBACK_URL    = "https://withings-token-show.lovable.app/"
AUTH_URL        = "https://account.withings.com/oauth2_user/authorize2"
TOKEN_URL       = "https://wbsapi.withings.net/v2/oauth2"
SCOPE           = "user.activity"
CONFIG_FILE     = "config.json"

def main():
    url = (f"{AUTH_URL}?response_type=code&client_id={CLIENT_ID}"
           f"&redirect_uri={CALLBACK_URL}&scope={SCOPE}"
           f"&state={random.randint(0, 999999999)}")
    print("Opening browser for Withings login...")
    webbrowser.open(url)
    code = input("\nPaste the code you received: ").strip()
    data = {"action": "requesttoken", "grant_type": "authorization_code",
            "client_id": CLIENT_ID, "client_secret": CONSUMER_SECRET,
            "code": code, "redirect_uri": CALLBACK_URL}
    resp   = requests.post(TOKEN_URL, data=data)
    resp.raise_for_status()
    tokens = resp.json()["body"]
    config = {"client_id": CLIENT_ID, "consumer_secret": CONSUMER_SECRET,
              "access_token": tokens["access_token"],
              "refresh_token": tokens["refresh_token"],
              "userid": tokens["userid"]}
    with open(CONFIG_FILE, "w") as f:
        json.dump(config, f, indent=2)
    print(f"Tokens saved to {CONFIG_FILE}")
    print("Important: add config.json to your .gitignore!")

if __name__ == "__main__":
    main()

Summary¶

Topic	Key takeaway
Grading	30% total: 15% code + 15% peer review with explicit rubrics
Definition of done	3 analyses (B/I/A) + tests + README + reproducible repo
Dataset	Withings ScanWatch, 12 participants, local `researchdata/` folder
README	For each analysis: problem · algorithm · why · complexity
Data privacy	Never commit `researchdata/` files to git
Peer review	Week 12, specific and actionable technical feedback

Action items before next seminar¶

Download dataset, complete Exercise 1, and record observations
Build merged.csv from all participants (Exercise 2)
Complete at least one Beginner analysis with tests
Write one README-style analysis section using the worked example

University of Geneva — Algorithmics & Data Structures

Learning Objectives¶

Part 1: The Project¶

1. Goal¶

Deliverables and grading¶

Detailed grading rubric¶

Code repository + README (15%)¶

Peer code review (15%)¶

Definition of Done (minimum requirements)¶

2. Setup with the Provided Dataset (Required)¶

2.0 Quickstart (see below for more details)¶

2.1 Download the data¶

2.2 Load the data into Python¶

Part 2: Exercises¶

Exercise 1 — Load the dataset and print basic statistics¶

Exercise 2 — Flatten all participant files into merged.csv¶

Exercise 3 — Explore the data manually¶

Exercise 4 — find_most_active_day()¶

Exercise 5 — Longest streak of active days¶

3. Project Questions You Can Choose to Solve¶

3.1 What defines each level?¶

3.2 Beginner Level (start implementing now)¶

3.3 Intermediate Level (after algorithmics starts, after ~week 6)¶

3.4 Advanced Level (later in the course, after ~week 8)¶

3.5 Propose Your Own Question (encouraged)¶

Example README¶

4. Optional: Use Your Own Withings Data¶

Step 1: Install the Withings App¶

Step 2: OAuth Authentication¶

Step 3: Run the authentication script¶

Summary¶

Action items before next seminar¶

Exercise 2 — Flatten all participant files into `merged.csv`¶

Exercise 4 — `find_most_active_day()`¶