Hadoop MapReduce Examples
This package contains a subset of the Hadoop 0.20.2 example MapReduce programs adapted to use the new MapReduce API.
- Authors: The Hadoop team, Herodotos Herodotou
- Source Code: Download here
- Dataset: Data generators are included in the source code above
- Implementation: Java (New Hadoop API)
- Documentation:
See README file in source code for usage instructions.
The programs included are:
- compress: A map/reduce program that compressed the input.
- grep: A map/reduce program that counts the matches of a regex in the input.
- randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
- randomwriter: A map/reduce program that writes 10GB of random data per node.
- secondarysort: An example defining a secondary sort to the reduce.
- teragen: Generate data for the terasort
- terasort: Run the terasort
- teravalidate: Checking results of terasort
- wordcount: A map/reduce program that counts the words in the input files.
PigMix Benchmark
PigMix is a set of queries used test pig performance from release to release. There are queries that test latency and queries that test scalability. In addition it includes a set of map reduce java programs to run equivalent map reduce jobs directly. Original code taken from https://issues.apache.org/jira/browse/PIG-200 and modified by Herodotos Herodotou to simplify usage and use Hadoop's new API.
- Authors: Daniel Dai, Alan Gates, Amir Youseffi, Herodotos Herodotou
- Source Code: Download here
- Dataset: The data generator is included in the source code above and can be executed using the pigmix/scripts/generate_data.sh script.
- Implementation: Pig-Latin, Java (New Hadoop API)
- Documentation: See enclosed README file for detailed instructions on how to generate the data and execute the benchmark. You will find further details at https://cwiki.apache.org/confluence/display/PIG/PigMix
Hive Performance Benchmark
This benchmark was taken from https://issues.apache.org/jira/browse/HIVE-396 and was slight modified to make it easier to use. The queries appear in the Pavle et. al. paper A Comparison of Approaches to Large-Scale Data Analysis.
- Authors: Yuntao Jia, Zheng Shao
- Source Code: Download here
- Dataset: The data generator is included in the source code above in the datagen directory
- Implementation: HiveQL, Pig-Latin, Java (New Hadoop API)
- Documentation: See enclosed README file for detailed instructions on how to generate the data and execute the benchmark
TPC-H Benchmark
TPC-H is an ad-hoc, decision support benchmark (http://www.tpc.org/tpch/). The attached source ode contains (i) scripts for generating TPC-H data in parallel and loading them on HDFS, (ii) the Pig-Latin implementation of all TPC-H queries, and (iii) the HiveQL implementation of all TPC-H queries (see https://issues.apache.org/jira/browse/HIVE-600).
- Authors: Herodotos Herodotou, Jie Li, Yuntao Jia
- Source Code: Download here
- Dataset: The data generator is included in the source code above in the datagen directory
- Implementation: Pig-Latin, HiveQL
- Documentation: See enclosed README file for detailed instructions on how to generate the data and execute the benchmark
Term Frequency-Inverse Document Frequency
The TF-IDF weight (Term Frequency-Inverse Document Frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. For details, please visit: http://en.wikipedia.org/wiki/Tf%E2%80%93idf.
- Authors: Herodotos Herodotou (Code adapted from Marcello de Sales)
- Source Code: Download here
- Dataset: Any set of text documents
- Implementation: Java (New Hadoop API)
- Documentation: See enclosed README file for detailed
instructions on how to execute the jobs.
TF-IDF is implemented as a pipeline of 3 MapReduce jobs:
- Calculates word frequency in documents
- Calculates word counts for documents
- Counts the documents in the corpus and computes the TF-IDF
Behavioral Targeting (Model Generation)
This application builds a model that can be used to score ads for each user based on its search and click behavior. The model consists of a weight assigned to each (ad, keyword) pair and that shows the correlation between the two (meaning that a larger weight signifies a greater probability that a user will click on an ad if he has searched for that keyword). The job takes as input the path to the folder that contains 3 log files named clickLog, impressionsLog and searchLog. The impressionsLog records when an ad (identified by the adId) was shown to a user (identified by the userId). The clickLog records when a user clicked on an ad. The searchLog records when a user searched for a particular keyword. These values are assumed to be comma separated and the order of the fields is: timestamp, userId, adId (or keyword in the searchLog).
- Authors: Rozemary Scarlat
- Source Code: Download here
- Dataset: The data generator is included in the source code above
- Implementation: Java (New Hadoop API)
- Documentation: See enclosed BT_description.pdf file for detailed instructions on how to execute the jobs.
Various Pig Workflows
This package includes pig scripts and dataset generators for various pig workflows.
- Authors: Harold Lim
- Source Code: Download here
- Dataset: The data generators are included in the source code above
- Implementation: Pig-Latin
- Documentation: See enclosed README file for detailed
instructions on how to execute the pig scripts.
The Pig-Latin scripts included are:
- coauthor.pig: Finds the top 20 coauthorship pairs. The input data has "paperid \t authorid" records. It first creates a list of authorids for each paperid. Then, it count the author pairs. Then it finds the author pairs with the highest count.
- pagerank.pig: Runs 1 iteration of page rank. The input datasets are Rankings ("pageid \t rank") and Pages ("pageid \t comma-delimited links").
- pavlo.pig: Complex join task from the Sigmod 08 Pavlo et. al. paper.
- tf-idf.pig: Computes TF-IDF. The input data has "doc \t word" records.
- tpch-17.pig: Query 17 from the TPC-H benchmark but slightly modified to have a different orderings of columns (this script assumes that part key is the first column of both tables).
- common.pig: A seven job workflow. The first job scans and performs an initial processing of the data. Two jobs read, filter, and find the sum and maximum of the prices for each {order ID, part ID}, {orderID, supplier ID}, respectively. The results of these two jobs are further processed by separate jobs to find the overall sum and maximum prices for each {order ID}. Finally, the results are separately used to find the number of distinct aggregated prices.