Lets build MapReduce from Scratch
We will see how to build a MapReduce system from scratch (its workings as well)
Last week, I posted about reading the MapReduce paper. I covered its core details. I was intrigued to try its workings by building a mini version.
You can get the complete code here :
https://github.com/venkat1017/mapreduce-go/tree/main/mapreduce-go
Share for Good Karma 😇
You can read the complete paper summary here:
Paper I read this week : MapReduce
🚀🚀Welcome to our weekly roundup where we unpack fascinating ideas from the latest research papers I’ve explored!
Here is a quick summary and learning from it.
The Big Problem: Google had tons of data and needed to run computations on it – indexing the web, analyzing logs, etc. Doing this on one super-powerful computer wasn't feasible or cost-effective. They had large clusters of regular, commodity machines.
The challenge?
Writing programs that could reliably run across hundreds or thousands of these machines, handle failures gracefully, and manage all the data movement was incredibly complex. Programmers were spending too much time on the infrastructure (distribution, fault tolerance) and not enough on the actual computation logic.
The MapReduce was born
The key idea was to abstract away the messy parts of distributed computing. They proposed a simple programming model where the programmer only needs to define two functions: map and reduce.
Map function: It’s like processing individual items in a large collection. It takes an input key-value pair and produces a set of intermediate key-value pairs. For example, if you're counting words in documents, the map function might take a document name (key) and its content (value), and output each word (key) with a count of '1' (value). (document1, "see spot run") -> (see, 1), (spot, 1), (run, 1)
Reduce function: This function takes an intermediate key and all the values associated with that key, and merges those values. Continuing the word count example, the reduce function would get a word (key) and a list of all the '1's generated for that word across all mappers, and it would sum them up to get the final count. (spot,) -> (spot, 3)
💪How it Works (The Execution Flow):
This is where the magic happens behind the scenes, handled by the MapReduce library/framework itself:
Input Splitting: The huge input dataset (like many large files) is automatically split into smaller, manageable chunks (e.g., 16-64MB).
Map Phase: Multiple copies of the user's program are started on different machines (workers) in the cluster. One machine acts as the "master," coordinating everything. Each worker is assigned one or more input splits. It reads the split, parses key-value pairs, and runs the user's map function on each pair. The output (intermediate key-value pairs) is temporarily stored on the local disk of the map worker.
Shuffle and Sort (The Crucial Link): This is the heart of the data aggregation.
The locations of the buffered intermediate pairs are passed back to the master.
The master tells the reduced workers where to pull this data from.
Reduce workers remotely read the buffered data from the map workers' local disks.
Crucially, the framework sorts these intermediate pairs by key and groups all values belonging to the same key together. So, all the (spot, 1) pairs from different mappers end up together before being sent to a reducer.
Reduce Phase: The master assigns unique intermediate keys to specific reduce workers. Each reduce worker iterates through the sorted intermediate data it received. For each unique key, it invokes the user's reduce function with the key and the list of all associated values.
Output: The output of the reduce function is appended to a final output file for that reduce partition, typically stored in a distributed file system.
Key Points & Why It Was Revolutionary:
Simplicity: Programmers just write map and reduce, focusing on the logic, not the parallelism or fault tolerance.
Scalability: Need to process more data or do it faster? Just add more machines to the cluster. The framework handles the distribution.
Fault Tolerance: This was huge. If a worker machine crashes mid-task, the master detects it and reschedules the task on another machine. Since map and reduce tasks are generally stateless (or their state is managed), re-execution works well. If the master itself fails, it's more complex, but checkpoints can help. The use of a distributed file system for input/output also inherently handles data redundancy.
Universality: This model turned out to apply to a surprisingly wide range of problems beyond Google's initial use cases.
🚫Limitations :
While groundbreaking, MapReduce isn't perfect, especially compared to newer systems:
High Latency / Batch-Oriented: It's designed for large batch jobs. Reading/writing to disk between the map and reduce stages adds significant overhead and latency, making it unsuitable for real-time or interactive queries.
Inefficient for Iterative Algorithms: Many machine learning algorithms require multiple passes over the data. MapReduce forces disk I/O between each map-reduce step, which is very slow. Systems like Spark improve this drastically with in-memory processing.
Complex Workflows: Chaining multiple MapReduce jobs together can be cumbersome to manage.
Not Great for Streaming: It's fundamentally a batch system, not designed for processing continuous streams of data.
Verbosity: Sometimes requires a lot of boilerplate code even for simple tasks compared to higher-level APIs in newer frameworks.
Single Point of Failure (Master): While master failure can be handled, it's a potential bottleneck and complex failure point in the original design.
💡What Could Be/Has Been Improved?
Many limitations led to the development of newer systems, often building on MapReduce's core ideas:
In-Memory Computation: Systems like Apache Spark keep intermediate data in memory across stages whenever possible, dramatically speeding up iterative algorithms and interactive queries.
More Flexible Execution Models: Apache Tez and Spark use Directed Acyclic Graphs (DAGs) to represent computation, allowing more complex workflows than the rigid map-then-reduce structure, optimizing data flow and reducing disk I/O.
Stream Processing: Frameworks like Apache Flink, Apache Storm, and Spark Streaming are specifically designed for real-time data processing.
Higher-Level APIs: Tools like Hive and Pig provided SQL-like and scripting interfaces on top of MapReduce, making it accessible to non-programmers. Newer systems like Spark offer richer APIs (DataFrames, Datasets) that are often more concise and optimized.
Resource Management: YARN (Yet Another Resource Negotiator) decoupled resource management from the MapReduce processing model, allowing other frameworks (like Spark, Flink) to run on the same Hadoop cluster resources.
In essence, MapReduce was a pivotal paper and system. It democratized large-scale data processing by providing a simple abstraction over complex distributed infrastructure. While many modern workloads use more advanced tools like Spark or Flink that address their limitations (especially around latency and iterative processing), the fundamental concepts of mapping, shuffling/sorting by key, and reducing remain influential in the world of big data. It laid the groundwork for much of what came after.
Share for Good Karma 😇
Let’s build MapReduce from scratch
Goal: Create a framework that lets a user run a simple map and reduce function across many machines on a large dataset, handling distribution and failures automatically.
Assumptions:
We have a cluster of machines connected via a network.
We can run processes on remote machines (like using SSH or a similar mechanism).
There's a way to store large input files and write large output files, potentially accessible by all machines (think a shared network filesystem or a basic distributed file system concept).
One machine will act as the "Master," and the others will be "Workers."
Tutorial: Building MapReduce From Scratch
Phase 1: Setting the Stage - The Master and the User Input
User Invocation:
The user starts the MapReduce job, typically from their machine or the Master node.
They provide:
The location of the input files (e.g., /data/input/dataset.*)
The location for the output files (e.g., /data/output/results)
The code/location of their map function.
The code/location of their reduce function.
The number of map tasks desired (M).
The number of reduced tasks desired (R).
// User initiates the job
job_config = {
input_path: "/data/input/",
output_path: "/data/output/",
map_function: "user_map.py",
reduce_function: "user_reduce.py",
num_map_tasks: 100, // M
num_reduce_tasks: 20 // R
}
master.SubmitJob(job_config)
Master Initialization:
The Master node receives the job configuration.
Input Splitting: The first crucial step is dividing the input data. The Master reads the input files (or metadata about them) and splits them into M chunks. A simple way is to split by byte offsets in the files (e.g., split a 1GB file into 16 splits of 64MB each). Each split will become one Map task.
Task Management: The Master creates internal data structures to track the state of M map tasks and R reduce tasks. Initially, all tasks are marked as idle.
Worker Management: The Master needs to know which Worker machines are available. Let's assume it has a list of worker hostnames/IPs. It will also track the status of each worker (idle, busy).
// Master Node Logic
Master {
job_config
map_tasks = [ {id: i, state: 'idle', input_split: ... } for i in 0..M-1 ]
reduce_tasks = [ {id: j, state: 'idle' } for j in 0..R-1 ]
available_workers = [ worker1, worker2, ... ]
task_assignments = {} // { worker_id -> task_id }
function SubmitJob(config) {
this.job_config = config
Splits = SplitInputFiles(config.input_path, config.num_map_tasks)
// Initialize map_tasks with splits
// Initialize reduce_tasks
StartAssigningTasks()
}
function StartAssigningTasks() {
// Loop: Find idle workers and assign idle tasks
}
}
Phase 2: The Map Phase - Processing Input Splits
Assigning Map Tasks:
The Master loops through its list of tasks. It finds an idle map task and an idle worker.
It remotely instructs the chosen Worker to start processing that map task, providing the input split details and the location of the user's map function code.
The Master marks the task as in progress and the worker as busy.
// Master Node Logic (inside StartAssigningTasks loop)
if (idle_map_task exists AND idle_worker exists) {
task = GetIdleMapTask()
worker = GetIdleWorker()
AssignTaskToWorker(worker, task) // Sends RPC/command to worker
task.state = 'in-progress'
worker.state = 'busy'
task_assignments[worker.id] = task.id
}
Map Worker Execution:
A Worker receives instructions from the Master.
It retrieves the user's map function code (if not already present).
It reads the assigned input split (key-value pairs).
For each key-value pair in its split, it calls the user's map(key, value) function.
Crucial Step: Partitioning: The map function produces intermediate key-value pairs (k', v'). The Worker doesn't just write these randomly. It needs to decide which future Reduce task will handle k'. It does this using a partitioning function, typically a hash: partition = hash(k') % R.
The Worker writes each intermediate (k', v') pair to a local temporary file specific to the calculated partition. So, a single Map worker creates R intermediate files on its local disk (e.g., temp-map-{map_task_id}-{partition_0}, temp-map-{map_task_id}-{partition_1}, ..., temp-map-{map_task_id}-{partition_R-1}).
Once the Map worker finishes processing its split, it informs the Master of completion and the locations of its R intermediate files.
// Worker Node Logic
Worker {
function ExecuteMapTask(task_info, R) {
map_code = GetMapCode(task_info.map_function)
input_data = ReadInputSplit(task_info.input_split)
intermediate_files = [ CreateLocalFile(...) for _ in 0..R-1 ]
for key, value in input_data:
intermediate_pairs = map_code.run(key, value)
for k_prime, v_prime in intermediate_pairs:
partition_index = hash(k_prime) % R
AppendToFile(intermediate_files[partition_index], (k_prime, v_prime))
// Notify Master: Task complete, locations of intermediate_files
master.NotifyMapComplete(task_info.id, GetFileLocations(intermediate_files))
this.state = 'idle'
}
}
Phase 3: The Shuffle and Sort - Connecting Map and Reduce
Master Tracking Intermediate Files:
The Master receives completion notifications from Map workers. It marks the map task as completed.
Crucially, it stores the locations of the intermediate files reported by each Map worker, grouping them by partition number. For example, it knows that Reduce task j will eventually need temp-map-{map_task_0}-{partition_j}, temp-map-{map_task_1}-{partition_j}, ..., temp-map-{map_task_M-1}-{partition_j}.
Assigning Reduce Tasks:
Once all Map tasks are complete, the Master can start assigning Reduce tasks.
It finds an idle reduce task (j) and an idle worker.
It instructs the chosen Worker to start processing that reduce task. Importantly, it tells the worker the locations of all intermediate files corresponding to partition j (which are scattered across the disks of the Map workers).
The Master marks the reduce task j as in-progress and the worker as busy.
// Master Node Logic
function NotifyMapComplete(map_task_id, file_locations) {
map_tasks[map_task_id].state = 'completed'
StoreIntermediateFileLocations(map_task_id, file_locations)
// Check if all map tasks are done, if so, start assigning reduce tasks
}
// Inside StartAssigningTasks loop, after all map tasks are 'completed':
if (idle_reduce_task exists AND idle_worker exists) {
task = GetIdleReduceTask() // e.g., task for partition 'j'
worker = GetIdleWorker()
// Get locations of all intermediate files for partition 'j'
intermediate_locations = GetLocationsForPartition(task.id)
AssignTaskToWorker(worker, task, intermediate_locations) // Send RPC/command
task.state = 'in-progress'
worker.state = 'busy'
}
Phase 4: The Reduce Phase - Aggregating Results
Reduce Worker Execution (Shuffle, Sort, Reduce):
A Worker receives instructions for the Reduce task j.
Shuffle: It uses the provided locations to connect to the Map workers (or their local storage) and fetch all the relevant intermediate files (those belonging to partition j). This involves remote reads across the network.
Sort: As the data arrives (or after it has all arrived), the Reduce worker sorts the fetched key-value pairs by key. This is essential because the reduce function needs to process all values for a single key together. (k1, v1), (k2, v2), (k1, v3) becomes (k1, v1), (k1, v3), (k2, v2).
Reduce: The worker iterates through the sorted data. For each unique key k', it gathers all associated values [v'1, v'2, ...] and calls the user's reduce(k', [v'1, v'2, ...]) function.
The output of the reduce function is the final key-value pair. The worker appends this final pair to its designated final output file (e.g., /data/output/results-j).
Once the Reduce worker finishes processing all its keys, it informs the Master of completion.
// Worker Node Logic
Worker {
function ExecuteReduceTask(task_info, intermediate_locations) {
reduce_code = GetReduceCode(task_info.reduce_function)
output_file = CreateOutputFile(job_config.output_path, task_info.id) // e.g., results-j
// Shuffle Phase: Fetch data from locations
all_intermediate_pairs = []
for location in intermediate_locations:
pairs = RemoteRead(location)
all_intermediate_pairs.extend(pairs)
// Sort Phase: Sort by key
sorted_pairs = SortByKey(all_intermediate_pairs)
// Reduce Phase: Group by key and reduce
current_key = null
current_values = []
for k_prime, v_prime in sorted_pairs:
if (k_prime != current_key and current_key is not null):
// Process the previous key
result = reduce_code.run(current_key, current_values)
AppendToFile(output_file, result)
current_values = [] // Reset for new key
current_key = k_prime
current_values.append(v_prime)
// Process the last key
if current_key is not null:
result = reduce_code.run(current_key, current_values)
AppendToFile(output_file, result)
// Notify Master: Task complete
master.NotifyReduceComplete(task_info.id)
this.state = 'idle'
}
}
Phase 5: Completion and Fault Tolerance
Job Completion:
The Master receives completion notifications for all Reduce tasks.
Once all M map tasks and R reduce tasks are completed, the Master marks the entire job as finished and can notify the user. The final output is available in the R output files (/data/output/results-0, ..., /data/output/results-R-1).
Handling Failures (Key Concept!):
Worker Failure: The Master periodically pings Workers (or Workers send heartbeats). If a Worker becomes unresponsive:
Any task (map or reduce) marked in-progress on that worker is reset to idle and becomes eligible for rescheduling on another worker.
If the failed worker completed a map task, its intermediate output is now lost (since it was on local disk). The Master must reschedule that map task to another worker. This is okay because map tasks are typically designed to be idempotent (running them again produces the same result).
Completed reduce tasks usually write to the final distributed storage, so their output might be okay, but re-running the reduce task is often the simpler strategy if the output write wasn't atomic.
Master Failure: This is harder. The original paper suggests the Master periodically checkpoints its state (task statuses, file locations). If the Master fails, a new Master process can start, read the last checkpoint, and resume coordination. Implementing a robust Master failover is complex.
Key Concepts Highlighted by Building It:
Abstraction: The user only cares about map and reduce logic.
Data Locality (Attempted): The Master could try to assign Map tasks to workers that already have the data locally (if using a distributed file system that knows data placement), reducing network I/O for reads.
Intermediate Local Writes: Map outputs go to local disk first. This avoids overwhelming the network during the map phase but requires the shuffle phase later.
Partitioning: Distributes the intermediate keys among the Reduce tasks, enabling parallel reduction. The hash(key) % R is the core mechanism.
Shuffle and Sort: The critical, often bottleneck phase where data is transferred, grouped, and ordered for the reducers. Handled entirely by the framework.
Fault Tolerance: The Master's ability to detect worker failures and reschedule tasks is fundamental to making MapReduce work reliably on large, failure-prone clusters. Idempotency of tasks is important here.
Scalability: Add more workers, and the Master can assign more tasks in parallel (up to M and R).
That’s it for the day
While this is the simplest and initial version you can learn there is more on Github repo and also feel free to contribute new features.
Liked this article? Make sure to ❤️ click the like button.
Feedback or addition? Make sure to 💬 comment.
Know someone who would find this helpful? Make sure to 🔁 share this post.
Your support means a great deal!
Thank you! Let’s learn together!