Homework #8: MapReduce

DUE: Monday 3/03 11:59pm

HOW TO SUBMIT: Submit the required files for all problems (see WHAT TO SUBMIT under each problem below) through WebSubmit. On the WebSubmit interface, make sure you select compsci290 and the appropriate homework number. You can submit multiple times, but please resubmit files for all problems each time.

0. Getting Started

WHAT TO SUBMIT: Nothing is required for this part.

Start up your VM and issue the following commands, which will install mrjob python package and create a working directory for your assignment:

cp -pr /opt/datacourse/assignments/hw08-template/ ~/hw08/

Execute a MapReduce job locally

To get you started, We have provided a sample file tweet_sm.txt containing a fairly small number of tweets and a simple program hashtag_count.py that counts the total number of tweets and the total number of uses of hashtags.

You can execute the MapReduce program locally on your VM, without using a cluster, as follows:

python hashtag_count.py tweets_sm.txt

As a rule of thumb, you should always test and debug your MapReduce program locally on smaller datasets, before you attempt it on a big cluster on Amazon---it will cost you money!

Signing up for Amazon AWS and setting up mrjob/EMR

Executing your MapReduce program on Amazon EMR

Now, you are ready to run a program on Amazon EMR (Elastic MapReduce using Hadoop)! Just type the following command:

python hashtag_count.py -c mrjob.conf -r emr tweets_sm.txt


1. Finding the top 50 Twitter hashtags

In this exercise, you will write a MapReduce program to find the 50 most popular hashtags from a file containing approximately 3.5 million tweets (a couple hours in the twitterverse).

We strongly recommend that you debug first locally before running on the full file in Amazon. You can make a copy of the example code:

cp hashtag_count.py topk.py

and edit topk.py to suit your needs. When running on Amazon, you can use the following files:

You can give these file URLs to your python program directly, e.g.:

python topk.py -c mrjob.conf -r emr s3://cs290-spring2014/twitter/tweets.txt

Hint 1: Your approach should work on much bigger datasets than what we are using here. Keep in mind the tips from class about potential problems when computing on data in parallel.

Hint 2: You can create standard python data structures within MRJob functions or class instances. Keep in mind that these will be stored in-memory for each mapper or reducer, so if used, any such structure should be kept small.

WHAT TO SUBMIT: Submit a plain-text file named topk.txt with your results. You need to submit the results on the big file (the small file is there in case you need to debug your program). Submit also your code (topk.py) including comments explaining the code.