University of Geneva · CUI BSc · Algorithmics & Data Structures
Analysis & Complexity
Seminar 5 · Scaling Beyond Small Data
1 / 21
Today's Roadmap
1
The Scaling Wall
Why code that works for 10 items fails for 10 million
2
Big-O Notation
Learning to ignore the "noise" and see the "shape"
3
Machine Independence
Measuring steps instead of milliseconds
4
Case Study
Designing Spotify-scale search algorithms
Spotify
How can any query finish in under 1 second when n = 640 million?
Sources: Spotify SEC Q3 2024
2 / 21
Why Analysis Matters
- Real systems break when algorithms scale poorly.
- At Spotify scale (n = 640M), even a "fast" O(n) scan is too slow.
- If we checked every user for 0.0001s, a single query would take 17 hours.
- We need algorithms that stay fast even as the world's data explodes.
3 / 21
Two Algorithms. Same Result. Same Speed?
Scenario: When you like a song, Spotify must instantly check: is it already there?
liked_songs = ["Shape of You", "Blinding Lights", ...]
def is_liked_A(song):
for s in liked_songs:
if s == song:
return True
return False
liked_songs = {"Shape of You", "Blinding Lights", ...}
def is_liked_B(song):
return song in liked_songs
In Python, both return the same answer. But the strategy is different.
4 / 21
Results: The Gap is Massive
Search time for a song in a list vs a set as n grows.
The Membership Test
# Alg A: List -> Scans items 1-by-1
"SongX" in songs_list
# Alg B: Set -> Goes directly to item
"SongX" in songs_set
| n (Songs) | List Time | Set Time |
| 1,000 | 0.03 ms | 0.0002 ms |
| 10,000 | 0.30 ms | 0.0002 ms |
| 100,000 | 3.00 ms | 0.0002 ms |
| 1,000,000 | 30.00 ms | 0.0002 ms |
List scales with n
10x more data = 10x more time. This is O(n).
Set is Constant
Time stays flat regardless of size. This is O(1).
Relative Scale: If the Set takes 1 second, the List takes 4 hours at n=1M.
5 / 21
The Problem with "Time"
Why seconds are not a reliable metric for theory.
import time
start = time.perf_counter()
result = my_function(data)
elapsed = time.perf_counter() - start
# Problems:
# 1. Hardware: My Mac is faster than your laptop.
# 2. Noise: Is your browser open? Is your battery low?
# 3. Language: Python vs. C++ overhead.
6 / 21
The Solution: Counting Steps
- Instead of seconds, we count Basic Operations.
- 1 Step: Variable assignment (`x = 10`)
- 1 Step: Comparing two values (`a < b`)
- 1 Step: Accessing a list index (`my_list[0]`)
- This gives us a formula T(n) that works regardless of the computer.
- *Big-O* is a measure of the worst-case scenario explaing how fast something can grow if the input size grows.
7 / 21
The Math: Simplifying the Story
- The Shape: g(n)
- We look for the "blueprint" of growth.
- Nested loops = n²
- Single loops = n
- The Scale: c
- We use a multiplier to "stretch" the blueprint.
- We don't care if it is 3n or 100n.
- They are both "Linear" growth.
8 / 21
Big-O: The Safety Ceiling
The function f(n) is "trapped" under the ceiling of c * g(n).
9 / 21
Big-O: Physical Analogies
- O(1): A single, direct operation i.e. a set lookup (Instant) as we know the exact address of the item.
- O(log n): Tearing a phonebook in half repeatedly until you find the right page. Doubling the book's size only adds one extra tear.
- O(n log n): You have n loose pages of a phonebook dropped on the floor in a random pile. To put them in order, you divide the pile into two, then those into two (that's the log n part), and at every level of splitting, you have to look at every page to decide which pile it belongs in (that's the n part).
- O(n): Reading every name in the phonebook until you find the right one.
- O(n²): For every person in the book, you have to read the whole book again.
10 / 21
The Growth Classes
| Notation | Name | Speed | Example |
| O(1) | Constant | Instant | Set lookup (already seen) |
| O(log n) | Logarithmic | Excellent | Binary Searc (week 7) |
| O(n) | Linear | Good | List search (already seen) |
| O(n log n) | Linearithmic | Decent | Sorting / Devine and Conquer (week 9) |
| O(n²) | Quadratic | Dangerous | Nested Loops (already seen) |
11 / 21
The "Hidden Loop" Trap
Python makes some O(n) operations look like O(1).
my_list = [10, 20, 30, 40]
if 50 in my_list: # This is a hidden Loop! O(n)
print("Found")
max(my_list) # This is a hidden Loop! O(n)
my_list.count(10) # This is a hidden Loop! O(n)
12 / 21
Case Study: Playlist Equivalence
- Problem: Given two playlists, are they identical?
- Must have the same songs and same counts (multiplicity).
- Order does not matter.
- Goal: Find the most efficient way to verify this at scale.
13 / 21
Design A: Nested Scan -> O(n²)
Pattern: Outer loop * Inner loop.
def equiv_A(p1, p2):
for song in p1: # Loop 1
if song not in p2: # Hidden Loop 2!
return False
return True
This works for 10 songs, but crashes for 10,000.
14 / 21
Design B: Sort & Compare -> O(n log n)
If we organize the data first, comparison becomes easy.
def equiv_B(p1, p2):
# Sorting: O(n log n)
# Comparison: O(n)
return sorted(p1) == sorted(p2)
Standard approach for many real-world problems.
15 / 21
Design C: Dictionary Count -> O(n)
Use a Hash Map to count occurrences in one pass.
def equiv_C(p1, p2):
counts = {}
for s in p1:
counts[s] = counts.get(s, 0) + 1
# ... check p2 against counts ...
return True
Linear speed: This is how Spotify handles millions of tracks.
16 / 21
Design D: The Pythonic Way -> O(n)
Python provides optimized objects for these common tasks.
from collections import Counter
def equiv_D(p1, p2):
return Counter(p1) == Counter(p2)
Under the hood, this uses the dictionary logic from Design C.
17 / 21
The Results: When N grows...
- At n=50: All strategies feel the same.
- At n=200: Quadratic O(n²) begins to lag.
- At n=1,000,000: O(n²) is impossible; O(n) finishes instantly.
- Lesson: The curve shape is the only thing that matters at scale.
18 / 21
Notebook Exercises
- 1. Identify hidden loops in 5 code snippets and justify their Big-O.
- 2. Implement the "Accountant" (Design C) manually in your notebook.
- 3. Benchmark `sorted()` vs a nested loop for n from 10 to 5,000.
- 4. Design an O(n) solution to check for duplicates in a playlist.
19 / 21
Key Takeaways
- Scale changes everything: What works for 10 items fails for 10M.
- Big-O is a decision tool, not just math.
- Dictionaries and Sets are performance superpowers (they provide O(1) lookups).
- Identify the "single most expensive operation": Always optimize for the highest growth term (n² before n).
20 / 21
University of Geneva · CUI BSc · Algorithmics & Data Structures
Thank You!
Next: Mastering Sorting Algorithms
21 / 21