University of Geneva · CUI BSc · Algorithmics & Data Structures

Analysis & Complexity

Seminar 5 · Scaling Beyond Small Data

1 / 21

Today's Roadmap

The Scaling Wall

Why code that works for 10 items fails for 10 million

Big-O Notation

Learning to ignore the "noise" and see the "shape"

Machine Independence

Measuring steps instead of milliseconds

Case Study

Designing Spotify-scale search algorithms

Spotify

640M+

active users

100M+

tracks

< 1s

page load

How can any query finish in under 1 second when n = 640 million?

Sources: Spotify SEC Q3 2024

2 / 21

Why Analysis Matters

Real systems break when algorithms scale poorly.
At Spotify scale (n = 640M), even a "fast" O(n) scan is too slow.
If we checked every user for 0.0001s, a single query would take 17 hours.
We need algorithms that stay fast even as the world's data explodes.

3 / 21

Two Algorithms. Same Result. Same Speed?

Scenario: When you like a song, Spotify must instantly check: is it already there?

Algorithm A — List

liked_songs = ["Shape of You", "Blinding Lights", ...]

def is_liked_A(song):
    for s in liked_songs:
        if s == song:
            return True
    return False

Algorithm B — Set

liked_songs = {"Shape of You", "Blinding Lights", ...}

def is_liked_B(song):
    return song in liked_songs

In Python, both return the same answer. But the strategy is different.

4 / 21

Results: The Gap is Massive

Search time for a song in a list vs a set as n grows.

The Membership Test

# Alg A: List -> Scans items 1-by-1
"SongX" in songs_list

# Alg B: Set -> Goes directly to item
"SongX" in songs_set

n (Songs)	List Time	Set Time
1,000	0.03 ms	0.0002 ms
10,000	0.30 ms	0.0002 ms
100,000	3.00 ms	0.0002 ms
1,000,000	30.00 ms	0.0002 ms

List scales with n

10x more data = 10x more time. This is O(n).

Set is Constant

Time stays flat regardless of size. This is O(1).

Relative Scale: If the Set takes 1 second, the List takes 4 hours at n=1M.

5 / 21

The Problem with "Time"

Why seconds are not a reliable metric for theory.

import time

start = time.perf_counter()
result = my_function(data)
elapsed = time.perf_counter() - start

# Problems:
# 1. Hardware: My Mac is faster than your laptop.
# 2. Noise: Is your browser open? Is your battery low?
# 3. Language: Python vs. C++ overhead.

6 / 21

The Solution: Counting Steps

Instead of seconds, we count Basic Operations.
1 Step: Variable assignment (`x = 10`)
1 Step: Comparing two values (`a < b`)
1 Step: Accessing a list index (`my_list[0]`)
This gives us a formula T(n) that works regardless of the computer.
*Big-O* is a measure of the worst-case scenario explaing how fast something can grow if the input size grows.

7 / 21

The Math: Simplifying the Story

The Shape: g(n)
We look for the "blueprint" of growth.
Nested loops = n²
Single loops = n

The Scale: c
We use a multiplier to "stretch" the blueprint.
We don't care if it is 3n or 100n.
They are both "Linear" growth.

8 / 21

Big-O: The Safety Ceiling

The function f(n) is "trapped" under the ceiling of c * g(n).

9 / 21

Big-O: Physical Analogies

O(1): A single, direct operation i.e. a set lookup (Instant) as we know the exact address of the item.
O(log n): Tearing a phonebook in half repeatedly until you find the right page. Doubling the book's size only adds one extra tear.
O(n log n): You have n loose pages of a phonebook dropped on the floor in a random pile. To put them in order, you divide the pile into two, then those into two (that's the log n part), and at every level of splitting, you have to look at every page to decide which pile it belongs in (that's the n part).
O(n): Reading every name in the phonebook until you find the right one.
O(n²): For every person in the book, you have to read the whole book again.

10 / 21

The Growth Classes

Notation	Name	Speed	Example
O(1)	Constant	Instant	Set lookup (already seen)
O(log n)	Logarithmic	Excellent	Binary Searc (week 7)
O(n)	Linear	Good	List search (already seen)
O(n log n)	Linearithmic	Decent	Sorting / Devine and Conquer (week 9)
O(n²)	Quadratic	Dangerous	Nested Loops (already seen)

11 / 21

The "Hidden Loop" Trap

Python makes some O(n) operations look like O(1).

my_list = [10, 20, 30, 40]

if 50 in my_list:   # This is a hidden Loop! O(n)
    print("Found")

max(my_list)        # This is a hidden Loop! O(n)
my_list.count(10)   # This is a hidden Loop! O(n)

12 / 21

Case Study: Playlist Equivalence

Problem: Given two playlists, are they identical?
Must have the same songs and same counts (multiplicity).
Order does not matter.
Goal: Find the most efficient way to verify this at scale.

13 / 21

Design A: Nested Scan -> O(n²)

Pattern: Outer loop * Inner loop.

The "Double Loop" Logic

def equiv_A(p1, p2):
    for song in p1:           # Loop 1
        if song not in p2:    # Hidden Loop 2!
            return False
    return True

This works for 10 songs, but crashes for 10,000.

14 / 21

Design B: Sort & Compare -> O(n log n)

If we organize the data first, comparison becomes easy.

The "Organizer" Logic

def equiv_B(p1, p2):
    # Sorting: O(n log n)
    # Comparison: O(n)
    return sorted(p1) == sorted(p2)

Standard approach for many real-world problems.

15 / 21

Design C: Dictionary Count -> O(n)

Use a Hash Map to count occurrences in one pass.

The "Accountant" Logic

def equiv_C(p1, p2):
    counts = {}
    for s in p1:
        counts[s] = counts.get(s, 0) + 1
    # ... check p2 against counts ...
    return True

Linear speed: This is how Spotify handles millions of tracks.

16 / 21

Design D: The Pythonic Way -> O(n)

Python provides optimized objects for these common tasks.

Using Built-in Tools

from collections import Counter

def equiv_D(p1, p2):
    return Counter(p1) == Counter(p2)

Under the hood, this uses the dictionary logic from Design C.

17 / 21

The Results: When N grows...

At n=50: All strategies feel the same.
At n=200: Quadratic O(n²) begins to lag.
At n=1,000,000: O(n²) is impossible; O(n) finishes instantly.
Lesson: The curve shape is the only thing that matters at scale.

18 / 21

Notebook Exercises

1. Identify hidden loops in 5 code snippets and justify their Big-O.
2. Implement the "Accountant" (Design C) manually in your notebook.
3. Benchmark `sorted()` vs a nested loop for n from 10 to 5,000.
4. Design an O(n) solution to check for duplicates in a playlist.

19 / 21

Key Takeaways

Scale changes everything: What works for 10 items fails for 10M.
Big-O is a decision tool, not just math.
Dictionaries and Sets are performance superpowers (they provide O(1) lookups).
Identify the "single most expensive operation": Always optimize for the highest growth term (n² before n).

20 / 21

University of Geneva · CUI BSc · Algorithmics & Data Structures

Thank You!

Next: Mastering Sorting Algorithms

21 / 21