University of Geneva — Algorithmics & Data Structures
Today’s Objectives¶
Why analysis matters in real systems
Big-O notation and growth classes
Python data structure costs (
list,dict,set)Implementation trade-offs in practice
Fair benchmarking and interpretation
1) Spotify Motivation¶
Spotify-scale constraints:
640M+ active users
100M+ tracks
sub-second product expectations
Two core complexity questions:
How do we check if a song is already in a user’s library?
When should recommendations be computed (on read vs precomputed)?
The key lesson: complexity is choosing where the work happens.
# Two equivalent algorithms, very different scaling
def is_liked_list(song, liked_songs):
for s in liked_songs: # O(n)
if s == song:
return True
return False
def is_liked_set(song, liked_song_set):
return song in liked_song_set # O(1) average
liked_list = ["Shape of You", "Blinding Lights", "Levitating"]
liked_set = set(liked_list)
print(is_liked_list("Levitating", liked_list))
print(is_liked_set("Levitating", liked_set))2) Big-O: Core Idea¶
Drop constants:
5n -> O(n)Drop lower-order terms:
n^2 + n -> O(n^2)Keep dominant growth term:
3n^2 + 10n + 7 -> O(n^2)
At large scale, growth class matters more than micro-optimizations.
3) Reading Big-O from Code¶
One loop over
nitems ->O(n)Nested loops over
nandn->O(n^2)Hash membership (
x in set) ->O(1)average
# O(n): playlist scan
def contains_scan(playlist, target):
for song in playlist:
if song == target:
return True
return False
# O(n^2): pairwise work
def all_pairs_work(n):
c = 0
for i in range(n):
for j in range(n):
c += 1
return c
# O(1) avg: hash lookup
def contains_hash(song_id, library_set):
return song_id in library_set4) Spotify Playlist Equivalence Problem¶
Given two playlists p1 and p2, decide whether they contain exactly the same songs:
same song IDs
same multiplicities
order does not matter
We’ll compare four designs (A/B/C/D) plus a conceptual brute-force baseline.
from collections import Counter
# Design A: The Nested Scan -> O(n^2)
def playlist_equiv_A(p1, p2):
if len(p1) != len(p2):
return False
for song in p1:
if song not in p2:
return False
return True
# Design B: Sort and compare -> O(n log n)
def playlist_equiv_B(p1, p2):
return sorted(p1) == sorted(p2)
# Design C: Frequency dict -> O(n)
def playlist_equiv_C(p1, p2):
if len(p1) != len(p2):
return False
freq = {}
for s in p1:
freq[s] = freq.get(s, 0) + 1
for s in p2:
if freq.get(s, 0) == 0:
return False
freq[s] -= 1
return True
# Design D: Counter map -> O(n)
def playlist_equiv_D(p1, p2):
return Counter(p1) == Counter(p2)
# Quick checks
print(playlist_equiv_A([1, 2, 2, 3], [2, 1, 3, 2]))
print(playlist_equiv_B([1, 2, 3], [3, 2, 1]))
print(playlist_equiv_C([1, 1, 2], [1, 2, 1]))
print(playlist_equiv_D([1, 2, 2], [2, 1, 2]))
print(playlist_equiv_D([1, 2, 2], [1, 2, 3]))
Design X (conceptual): brute force permutations¶
Brute force tries every permutation of one playlist and compares to the other.
A permutation means one possible ordering of the same items.
Example with songs [A, B, C]:
[A, B, C][A, C, B][B, A, C][B, C, A][C, A, B][C, B, A]
So for 3 songs, there are 6 possible orderings.
This is O(n!), where n! (“n factorial”) is the number of permutations of n distinct items:
n! = n * (n-1) * (n-2) * ... * 2 * 1
Examples:
3! = 3 * 2 * 1 = 65! = 5 * 4 * 3 * 2 * 1 = 12010! = 3,628,800
The number of possibilities explodes extremely fast, which is why this approach is only a cautionary baseline.
Generate Fake Song Titles¶
Next, you can use the script below to generate synthetic song titles so we can build large playlists.
# =================================================================
# DATA GENERATION (The "Black Box")
# DO NOT WORRY about how this code works!
# Its only job is to create thousands of fake song titles for our tests.
# =================================================================
import random
import time
import matplotlib.pyplot as plt
adjectives = [
"Silent", "Golden", "Crazy", "Broken", "Shining", "Lost", "Fading", "Electric", "Wild", "Lonely",
"Burning", "Hidden", "Distant", "Dark", "Bright", "Cold", "Sweet", "Eternal", "Fragile", "Majestic",
"Secret", "Velvet", "Radiant", "Fallen", "Mystic", "Glowing", "Twilight", "Endless", "Frozen", "Luminous",
"Melting", "Stormy", "Silent", "Velvet", "Shadowy", "Infinite", "Crystal", "Silver", "Golden", "Blazing",
"Gentle", "Rapid", "Thunderous", "Soft", "Vivid", "Electric", "Fierce", "Whispering", "Haunting", "Sacred"
]
nouns = [
"Heart", "Dream", "Sky", "Night", "Light", "Fire", "Rain", "Shadow", "Star", "World",
"River", "Moon", "Sun", "Ocean", "Flame", "Stone", "Wind", "Garden", "Forest", "Wave",
"Mist", "Song", "Journey", "Path", "Secret", "Whisper", "Storm", "Vision", "Time", "Soul",
"Desire", "Memory", "Ember", "Frost", "Echo", "Horizon", "Pulse", "Spark", "Tide", "Realm",
"Mirror", "Dreamscape", "Candle", "Phoenix", "Labyrinth", "Aura", "Galaxy", "Comet", "Petal", "Skyline"
]
verbs = [
"Dancing", "Running", "Falling", "Flying", "Waiting", "Hiding", "Chasing", "Burning", "Shining", "Breaking",
"Crying", "Laughing", "Rising", "Fading", "Screaming", "Gliding", "Spinning", "Whispering", "Shivering", "Drifting",
"Exploding", "Twisting", "Crawling", "Leaping", "Sliding", "Soaring", "Floating", "Collapsing", "Shattering", "Singing",
"Hoping", "Longing", "Waiting", "Glowing", "Trembling", "Calling", "Flickering", "Wandering", "Smiling", "Shaking",
"Flowing", "Burning", "Breaking", "Melting", "Raging", "Cascading", "Dimming", "Hovering", "Rising", "Quivering"
]
def generate_song_title():
pattern = random.choice([
"{adj} {noun}",
"{verb} {noun}",
"{adj} {verb}",
"{noun} of {noun}",
"{adj} {noun} {verb}",
])
return pattern.format(
adj=random.choice(adjectives),
noun=random.choice(nouns),
verb=random.choice(verbs),
)
def generate_unique_song_titles(total=5000):
titles = set()
attempts = 0
max_attempts = total * 20
while len(titles) < total and attempts < max_attempts:
titles.add(generate_song_title())
attempts += 1
return list(titles)
def generate_playlist(n, catalog, unique_ratio=0.8):
"""Generates a list of n song titles."""
num_unique = max(1, int(n * unique_ratio))
pool = random.sample(catalog, k=min(num_unique, len(catalog)))
return [random.choice(pool) for _ in range(n)]
5) Real Benchmark Setup¶
We now move from static numbers to a real benchmark.
Next, we will call this helper to generate a synthetic song-title catalog and run Designs A-D on sampled playlists.
# Build catalog + measure A/B/C/D on generated song titles
TOTAL_SONGS = 3000
catalog = generate_unique_song_titles(total=TOTAL_SONGS)
print(f"Catalog size: {len(catalog)} unique synthetic song titles")
sizes = [50, 100, 200, 400, 800, 1200]
reps = 3
curves = {
"A - Nested O(n^2)": [],
"B - Sort O(n log n)": [],
"C - Dict O(n)": [],
"D - Counter O(n)": [],
}
for n in sizes:
p1 = generate_playlist(n, catalog)
p2 = p1.copy()
random.shuffle(p2)
methods = [
("A - Nested O(n^2)", playlist_equiv_A),
("B - Sort O(n log n)", playlist_equiv_B),
("C - Dict O(n)", playlist_equiv_C),
("D - Counter O(n)", playlist_equiv_D),
]
for label, fn in methods:
best = float("inf")
for _ in range(reps):
t0 = time.perf_counter()
fn(p1, p2)
best = min(best, time.perf_counter() - t0)
curves[label].append(best * 1000)
print("Benchmark complete. Curves ready for plotting.")Now that we have measured runtimes, we can plot them as follows:
plt.figure(figsize=(10, 5))
for label, vals in curves.items():
plt.plot(sizes, vals, marker="o", linewidth=2, label=label)
plt.title("Playlist Equivalence: Real benchmark on synthetic song titles")
plt.xlabel("Playlist length n")
plt.ylabel("Best runtime over 3 reps (ms)")
plt.grid(alpha=0.3)
plt.legend()
plt.show()
a_slowdown = curves["A - Nested O(n^2)"][-1] / curves["A - Nested O(n^2)"][0]
d_slowdown = curves["D - Counter O(n)"][-1] / curves["D - Counter O(n)"][0]
print(f"Design A slowdown (n={sizes[-1]} vs n={sizes[0]}): {a_slowdown:.1f}x")
print(f"Design D slowdown (n={sizes[-1]} vs n={sizes[0]}): {d_slowdown:.1f}x")6) Why We Count Operations (Not Only Seconds)¶
Raw runtime depends on:
hardware,
background processes,
Python/runtime state.
That is why algorithm analysis focuses on machine-independent growth.
For each algorithm, we model work with a function T(n):
pick a basic operation (comparison, assignment, lookup),
count how many times it runs,
keep the dominant growth term (Big-O).
7) Hidden Loop Reminder + Check Your Understanding¶
Some Python code looks constant-time but hides linear work.
Typical hidden-loop examples:
x in my_list-> scans list (O(n)),max(my_list)-> scans list (O(n)),my_list.count(v)-> scans list (O(n)).
Quick checks:
Which list operation is not
O(1)?Why is
x in setfaster thanx in listat scale?Can a slower algorithm still win for tiny inputs?
Why does Spotify refresh Discover Weekly weekly instead of per app-open?
Part 2: Notebook Exercises¶
Each exercise prompt is directly above its code cell.
Exercise 01: The Hidden Loop Hunt¶
For each snippet below:
Write the Big-O complexity.
Mark the specific Python keyword that is actually a “Hidden Loop.”
liked_songs = ["Blinding Lights", "Levitating", "As It Was"]
play_counts = [120, 340, 85, 510]
active_users = ["alice", "bob", "carol", "dina"]
premium_members = {"alice", "carol"}
def send_email(user):
print(f"Sent premium message to {user}")
# Snippet 1
if "Levitating" in liked_songs:
print("Found it!")
# Snippet 2
most_played = max(play_counts)
print("Most played:", most_played)
# Snippet 3
for user in active_users:
if user in premium_members:
send_email(user)Exercise 02: The Accountant (Design C)¶
Complete the scaffold below to implement the dictionary-based counting strategy.
def equiv_dict(p1, p2):
"""
Design C: The Accountant (Dictionary)
Complexity: O(n)
"""
if len(p1) != len(p2):
return False
counts = {}
# TODO 1: Count songs in p1
# Hint: Use counts[song] = counts.get(song, 0) + 1
for song in p1:
pass # Your code here
# TODO 2: Check p2 against the counts
for song in p2:
if song not in counts or counts[song] == 0:
return False
counts[song] -= 1
return True
Exercise 03: Performance Analysis¶
Look at the graph you generated below.
At what value of n (playlist length) did you start to “feel” a delay in the Nested Scan?
If , and the algorithm takes 1ms for , calculate how many hours it would take for the full million.
Why is the line almost invisible at the bottom of the graph compared to ?
# Ex 03 starter — Benchmark sorted() vs nested loop (n from 10 to 5000)
def equiv_nested(p1, p2):
"""
Design A: The Nested Scan
Complexity: O(n^2)
"""
if len(p1) != len(p2):
return False
for song in p1: # Loop 1
if song not in p2: # Loop 2 (Hidden inside "in")
return False
return True
def equiv_sorted(p1, p2):
# O(n log n)
return sorted(p1) == sorted(p2)
sizes = [10, 50, 100, 250, 500, 1000, 2500, 5000]
nested_ms, sorted_ms = [], []
for n in sizes:
p1 = [random.randint(1, 100000) for _ in range(n)]
p2 = p1.copy()
random.shuffle(p2)
t0 = time.perf_counter()
equiv_nested(p1, p2)
nested_ms.append((time.perf_counter() - t0) * 1000)
t0 = time.perf_counter()
equiv_sorted(p1, p2)
sorted_ms.append((time.perf_counter() - t0) * 1000)
plt.figure(figsize=(10, 5))
plt.plot(sizes, nested_ms, marker="o", linewidth=2, label="Nested scan O(n^2)")
plt.plot(sizes, sorted_ms, marker="o", linewidth=2, label="sorted() compare O(n log n)")
plt.title("Ex 03: sorted() vs nested loop")
plt.xlabel("n")
plt.ylabel("Runtime (ms)")
plt.grid(alpha=0.3)
plt.legend()
plt.show()
Ex 04 — O(n) duplicate check¶
Design and test an O(n) algorithm to detect duplicates in a playlist.
# Ex 04 starter — Design an O(n) duplicate checker for playlists
def has_duplicates_playlist(song_ids):
# TODO: design O(n) solution
# Hint: track seen song IDs in a set
return False
# Tests
print(has_duplicates_playlist([10, 20, 30, 40]) is False)
print(has_duplicates_playlist([10, 20, 30, 20]) is True)
print(has_duplicates_playlist([]) is False)
Reflection prompt¶
For each exercise, write 2-3 lines:
What was the dominant operation?
What growth class did you observe?
Which design would you choose at Spotify scale, and why?
Key Takeaways¶
Start with the system question: where should the work happen?
Big-O guides design choices before production failures.
Benchmarking validates theory; theory explains benchmarks.
At scale, data-structure choice is often the main performance lever.