University of Geneva · CUI BSc · Algorithmics & Data Structures

Analysis & Complexity

Seminar 5 · Scaling Beyond Small Data

Today's Roadmap

1
The Scaling Wall
Why code that works for 10 items fails for 10 million
2
Big-O Notation
Learning to ignore the "noise" and see the "shape"
3
Machine Independence
Measuring steps instead of milliseconds
4
Case Study
Designing Spotify-scale search algorithms
Spotify
640M+
active users
100M+
tracks
< 1s
page load
How can any query finish in under 1 second when n = 640 million?
Sources: Spotify SEC Q3 2024

Why Analysis Matters

  • Real systems break when algorithms scale poorly.
  • At Spotify scale (n = 640M), even a "fast" O(n) scan is too slow.
  • If we checked every user for 0.0001s, a single query would take 17 hours.
  • We need algorithms that stay fast even as the world's data explodes.

Two Algorithms. Same Result. Same Speed?

Scenario: When you like a song, Spotify must instantly check: is it already there?

Algorithm A — List
liked_songs = ["Shape of You", "Blinding Lights", ...]

def is_liked_A(song):
    for s in liked_songs:
        if s == song:
            return True
    return False
Algorithm B — Set
liked_songs = {"Shape of You", "Blinding Lights", ...}

def is_liked_B(song):
    return song in liked_songs

In Python, both return the same answer. But the strategy is different.

Results: The Gap is Massive

Search time for a song in a list vs a set as n grows.
The Membership Test
# Alg A: List -> Scans items 1-by-1
"SongX" in songs_list

# Alg B: Set -> Goes directly to item
"SongX" in songs_set
n (Songs)List TimeSet Time
1,0000.03 ms0.0002 ms
10,0000.30 ms0.0002 ms
100,0003.00 ms0.0002 ms
1,000,00030.00 ms0.0002 ms
List scales with n
10x more data = 10x more time. This is O(n).
Set is Constant
Time stays flat regardless of size. This is O(1).
Relative Scale: If the Set takes 1 second, the List takes 4 hours at n=1M.

The Problem with "Time"

Why seconds are not a reliable metric for theory.
import time

start = time.perf_counter()
result = my_function(data)
elapsed = time.perf_counter() - start

# Problems:
# 1. Hardware: My Mac is faster than your laptop.
# 2. Noise: Is your browser open? Is your battery low?
# 3. Language: Python vs. C++ overhead.

The Solution: Counting Steps

  • Instead of seconds, we count Basic Operations.
  • 1 Step: Variable assignment (`x = 10`)
  • 1 Step: Comparing two values (`a < b`)
  • 1 Step: Accessing a list index (`my_list[0]`)
  • This gives us a formula T(n) that works regardless of the computer.
  • *Big-O* is a measure of the worst-case scenario explaing how fast something can grow if the input size grows.

The Math: Simplifying the Story

  • The Shape: g(n)
  • We look for the "blueprint" of growth.
  • Nested loops =
  • Single loops = n
  • The Scale: c
  • We use a multiplier to "stretch" the blueprint.
  • We don't care if it is 3n or 100n.
  • They are both "Linear" growth.

Big-O: The Safety Ceiling

Big-O: The Safety Ceiling
The function f(n) is "trapped" under the ceiling of c * g(n).

Big-O: Physical Analogies

  • O(1): A single, direct operation i.e. a set lookup (Instant) as we know the exact address of the item.
  • O(log n): Tearing a phonebook in half repeatedly until you find the right page. Doubling the book's size only adds one extra tear.
  • O(n log n): You have n loose pages of a phonebook dropped on the floor in a random pile. To put them in order, you divide the pile into two, then those into two (that's the log n part), and at every level of splitting, you have to look at every page to decide which pile it belongs in (that's the n part).
  • O(n): Reading every name in the phonebook until you find the right one.
  • O(n²): For every person in the book, you have to read the whole book again.

The Growth Classes

NotationNameSpeedExample
O(1)ConstantInstantSet lookup (already seen)
O(log n)LogarithmicExcellentBinary Searc (week 7)
O(n)LinearGoodList search (already seen)
O(n log n)LinearithmicDecentSorting / Devine and Conquer (week 9)
O(n²)QuadraticDangerousNested Loops (already seen)

The "Hidden Loop" Trap

Python makes some O(n) operations look like O(1).
my_list = [10, 20, 30, 40]

if 50 in my_list:   # This is a hidden Loop! O(n)
    print("Found")

max(my_list)        # This is a hidden Loop! O(n)
my_list.count(10)   # This is a hidden Loop! O(n)

Case Study: Playlist Equivalence

  • Problem: Given two playlists, are they identical?
  • Must have the same songs and same counts (multiplicity).
  • Order does not matter.
  • Goal: Find the most efficient way to verify this at scale.

Design A: Nested Scan -> O(n²)

Pattern: Outer loop * Inner loop.

The "Double Loop" Logic
def equiv_A(p1, p2):
    for song in p1:           # Loop 1
        if song not in p2:    # Hidden Loop 2!
            return False
    return True

This works for 10 songs, but crashes for 10,000.

Design B: Sort & Compare -> O(n log n)

If we organize the data first, comparison becomes easy.

The "Organizer" Logic
def equiv_B(p1, p2):
    # Sorting: O(n log n)
    # Comparison: O(n)
    return sorted(p1) == sorted(p2)

Standard approach for many real-world problems.

Design C: Dictionary Count -> O(n)

Use a Hash Map to count occurrences in one pass.

The "Accountant" Logic
def equiv_C(p1, p2):
    counts = {}
    for s in p1:
        counts[s] = counts.get(s, 0) + 1
    # ... check p2 against counts ...
    return True

Linear speed: This is how Spotify handles millions of tracks.

Design D: The Pythonic Way -> O(n)

Python provides optimized objects for these common tasks.

Using Built-in Tools
from collections import Counter

def equiv_D(p1, p2):
    return Counter(p1) == Counter(p2)

Under the hood, this uses the dictionary logic from Design C.

The Results: When N grows...

The Results: When N grows...
  • At n=50: All strategies feel the same.
  • At n=200: Quadratic O(n²) begins to lag.
  • At n=1,000,000: O(n²) is impossible; O(n) finishes instantly.
  • Lesson: The curve shape is the only thing that matters at scale.

Notebook Exercises

  • 1. Identify hidden loops in 5 code snippets and justify their Big-O.
  • 2. Implement the "Accountant" (Design C) manually in your notebook.
  • 3. Benchmark `sorted()` vs a nested loop for n from 10 to 5,000.
  • 4. Design an O(n) solution to check for duplicates in a playlist.

Key Takeaways

  • Scale changes everything: What works for 10 items fails for 10M.
  • Big-O is a decision tool, not just math.
  • Dictionaries and Sets are performance superpowers (they provide O(1) lookups).
  • Identify the "single most expensive operation": Always optimize for the highest growth term ( before n).
University of Geneva · CUI BSc · Algorithmics & Data Structures

Thank You!

Next: Mastering Sorting Algorithms