Hadoop MapReduce Examples

This package contains a subset of the Hadoop 0.20.2 example MapReduce programs adapted to use the new MapReduce API.

  • Authors: The Hadoop team, Herodotos Herodotou
  • Source Code: Download here
  • Dataset: Data generators are included in the source code above
  • Implementation: Java (New Hadoop API)
  • Documentation: See README file in source code for usage instructions. The programs included are:
    1. compress: A map/reduce program that compressed the input.
    2. grep: A map/reduce program that counts the matches of a regex in the input.
    3. randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
    4. randomwriter: A map/reduce program that writes 10GB of random data per node.
    5. secondarysort: An example defining a secondary sort to the reduce.
    6. teragen: Generate data for the terasort
    7. terasort: Run the terasort
    8. teravalidate: Checking results of terasort
    9. wordcount: A map/reduce program that counts the words in the input files.

PigMix Benchmark

PigMix is a set of queries used test pig performance from release to release. There are queries that test latency and queries that test scalability. In addition it includes a set of map reduce java programs to run equivalent map reduce jobs directly. Original code taken from https://issues.apache.org/jira/browse/PIG-200 and modified by Herodotos Herodotou to simplify usage and use Hadoop's new API.

  • Authors: Daniel Dai, Alan Gates, Amir Youseffi, Herodotos Herodotou
  • Source Code: Download here
  • Dataset: The data generator is included in the source code above and can be executed using the pigmix/scripts/generate_data.sh script.
  • Implementation: Pig-Latin, Java (New Hadoop API)
  • Documentation: See enclosed README file for detailed instructions on how to generate the data and execute the benchmark. You will find further details at https://cwiki.apache.org/confluence/display/PIG/PigMix

Hive Performance Benchmark

This benchmark was taken from https://issues.apache.org/jira/browse/HIVE-396 and was slight modified to make it easier to use. The queries appear in the Pavle et. al. paper A Comparison of Approaches to Large-Scale Data Analysis.

  • Authors: Yuntao Jia, Zheng Shao
  • Source Code: Download here
  • Dataset: The data generator is included in the source code above in the datagen directory
  • Implementation: HiveQL, Pig-Latin, Java (New Hadoop API)
  • Documentation: See enclosed README file for detailed instructions on how to generate the data and execute the benchmark

TPC-H Benchmark

TPC-H is an ad-hoc, decision support benchmark (http://www.tpc.org/tpch/). The attached source ode contains (i) scripts for generating TPC-H data in parallel and loading them on HDFS, (ii) the Pig-Latin implementation of all TPC-H queries, and (iii) the HiveQL implementation of all TPC-H queries (see https://issues.apache.org/jira/browse/HIVE-600).

  • Authors: Herodotos Herodotou, Jie Li, Yuntao Jia
  • Source Code: Download here
  • Dataset: The data generator is included in the source code above in the datagen directory
  • Implementation: Pig-Latin, HiveQL
  • Documentation: See enclosed README file for detailed instructions on how to generate the data and execute the benchmark

Term Frequency-Inverse Document Frequency

The TF-IDF weight (Term Frequency-Inverse Document Frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. For details, please visit: http://en.wikipedia.org/wiki/Tf%E2%80%93idf.

  • Authors: Herodotos Herodotou (Code adapted from Marcello de Sales)
  • Source Code: Download here
  • Dataset: Any set of text documents
  • Implementation: Java (New Hadoop API)
  • Documentation: See enclosed README file for detailed instructions on how to execute the jobs. TF-IDF is implemented as a pipeline of 3 MapReduce jobs:
    1. Calculates word frequency in documents
    2. Calculates word counts for documents
    3. Counts the documents in the corpus and computes the TF-IDF

Behavioral Targeting (Model Generation)

This application builds a model that can be used to score ads for each user based on its search and click behavior. The model consists of a weight assigned to each (ad, keyword) pair and that shows the correlation between the two (meaning that a larger weight signifies a greater probability that a user will click on an ad if he has searched for that keyword). The job takes as input the path to the folder that contains 3 log files named clickLog, impressionsLog and searchLog. The impressionsLog records when an ad (identified by the adId) was shown to a user (identified by the userId). The clickLog records when a user clicked on an ad. The searchLog records when a user searched for a particular keyword. These values are assumed to be comma separated and the order of the fields is: timestamp, userId, adId (or keyword in the searchLog).

  • Authors: Rozemary Scarlat
  • Source Code: Download here
  • Dataset: The data generator is included in the source code above
  • Implementation: Java (New Hadoop API)
  • Documentation: See enclosed BT_description.pdf file for detailed instructions on how to execute the jobs.

Various Pig Workflows

This package includes pig scripts and dataset generators for various pig workflows.

  • Authors: Harold Lim
  • Source Code: Download here
  • Dataset: The data generators are included in the source code above
  • Implementation: Pig-Latin
  • Documentation: See enclosed README file for detailed instructions on how to execute the pig scripts. The Pig-Latin scripts included are:
    1. coauthor.pig: Finds the top 20 coauthorship pairs. The input data has "paperid \t authorid" records. It first creates a list of authorids for each paperid. Then, it count the author pairs. Then it finds the author pairs with the highest count.
    2. pagerank.pig: Runs 1 iteration of page rank. The input datasets are Rankings ("pageid \t rank") and Pages ("pageid \t comma-delimited links").
    3. pavlo.pig: Complex join task from the Sigmod 08 Pavlo et. al. paper.
    4. tf-idf.pig: Computes TF-IDF. The input data has "doc \t word" records.
    5. tpch-17.pig: Query 17 from the TPC-H benchmark but slightly modified to have a different orderings of columns (this script assumes that part key is the first column of both tables).
    6. common.pig: A seven job workflow. The first job scans and performs an initial processing of the data. Two jobs read, filter, and find the sum and maximum of the prices for each {order ID, part ID}, {orderID, supplier ID}, respectively. The results of these two jobs are further processed by separate jobs to find the overall sum and maximum prices for each {order ID}. Finally, the results are separately used to find the number of distinct aggregated prices.