Streaming

Contents

Streaming#

Hadoop Streaming#

hadoop jar /home/hadoop/hadoop-streaming-2.9.2.jar \
  -input /input_dir \
  -output myOutputDir \
  -mapper /bin/cat \
  -reducer /usr/bin/wc

Streaming Command Options#

hadoop command [genericOptions] [streamingOptions]

Parameter	Optional/Required	Description
-input directoryname or filename	Required	Input location for mapper
-output directoryname	Required	Output location for reducer
-mapper executable or JavaClassName	Optional	Mapper executable. If not specified, IdentityMapper is used as the default
-reducer executable or JavaClassName	Optional	Reducer executable. If not specified, IdentityReducer is used as the default
-file filename	Optional	Make the mapper, reducer, or combiner executable available locally on the compute nodes
-inputformat JavaClassName	Optional	Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default
-outputformat JavaClassName	Optional	Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default
-partitioner JavaClassName	Optional	Class that determines which reduce a key is sent to
-combiner streamingCommand or JavaClassName	Optional	Combiner executable for map output
-cmdenv name=value	Optional	Pass environment variable to streaming commands
-inputreader	Optional	For backwards-compatibility: specifies a record reader class (instead of an input format class)
-verbose	Optional	Verbose output
-lazyOutput	Optional	Create output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to Context.write
-numReduceTasks	Optional	Specify the number of reducers
-mapdebug	Optional	Script to call when map task fails
-reducedebug	Optional	Script to call when reduce task fails

Specifying a Java Class as the Mapper/Reducer#

hadoop jar /home/hadoop/hadoop-streaming-2.9.2.jar \
  -input /input_dir \
  -output myOutputDir \
  -inputformat org.apache.hadoop.mapred.KeyValueTextInputFormat \
  -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
  -reducer /usr/bin/wc

References#

Hadoop Streaming