Package org.apache.hadoop.tools.rumen
package org.apache.hadoop.tools.rumen
Rumen is a data extraction and analysis tool built for
Apache Hadoop. Rumen mines job history
logs to extract meaningful data and stores it into an easily-parsed format.
The default output format of Rumen is JSON.
Rumen uses the Jackson library to
create JSON objects.
The following classes can be used to programmatically invoke Rumen:
The following classes can be used to programmatically invoke Rumen:
-
JobConfigurationParser
A parser to parse and filter out interesting properties from job configuration.
Sample code:Some of the commonly used interesting properties are enumerated in// An example to parse and filter out job name String conf_filename = .. // assume the job configuration filename here // construct a list of interesting properties List<String> interestedProperties = new ArrayList<String>(); interestedProperties.add("mapreduce.job.name"); JobConfigurationParser jcp = new JobConfigurationParser(interestedProperties); InputStream in = new FileInputStream(conf_filename); Properties parsedProperties = jcp.parse(in);JobConfPropertyNames.
Note: A single instance ofJobConfigurationParsercan be used to parse multiple job configuration files. -
JobHistoryParser
A parser that parses job history files. It is an interface and actual implementations are defined as Enum inJobHistoryParserFactory. Note thatRewindableInputStream
is a wrapper class aroundInputStreamto make the input stream rewindable.
Sample code:// An example to parse a current job history file i.e a job history // file for which the version is known String filename = .. // assume the job history filename here InputStream in = new FileInputStream(filename); HistoryEvent event = null; JobHistoryParser parser = new CurrentJHParser(in); event = parser.nextEvent(); // process all the events while (event != null) { // ... process all event event = parser.nextEvent(); } // close the parser and the underlying stream parser.close();JobHistoryParserFactoryprovides aJobHistoryParserFactory.getParser(org.apache.hadoop.tools.rumen.RewindableInputStream)API to get a parser for parsing the job history file. Note that this API can be used if the job history version is unknown.
Sample code:Note: Create one instance to parse a job history log and close it after use.// An example to parse a job history for which the version is not // known i.e using JobHistoryParserFactory.getParser() String filename = .. // assume the job history filename here InputStream in = new FileInputStream(filename); RewindableInputStream ris = new RewindableInputStream(in); // JobHistoryParserFactory will check and return a parser that can // parse the file JobHistoryParser parser = JobHistoryParserFactory.getParser(ris); // now use the parser to parse the events HistoryEvent event = parser.nextEvent(); while (event != null) { // ... process the event event = parser.nextEvent(); } parser.close(); -
TopologyBuilder
Builds the cluster topology based on the job history events. Every job history file consists of events. Each event can be represented usingHistoryEvent. These events can be passed toTopologyBuilderusingTopologyBuilder.process(org.apache.hadoop.mapreduce.jobhistory.HistoryEvent). A cluster topology can be represented usingLoggedNetworkTopology. Once all the job history events are processed, the cluster topology can be obtained usingTopologyBuilder.build().
Sample code:// Building topology for a job history file represented using // 'filename' and the corresponding configuration file represented // using 'conf_filename' String filename = .. // assume the job history filename here String conf_filename = .. // assume the job configuration filename here InputStream jobConfInputStream = new FileInputStream(filename); InputStream jobHistoryInputStream = new FileInputStream(conf_filename); TopologyBuilder tb = new TopologyBuilder(); // construct a list of interesting properties List<String> interestingProperties = new ArrayList%lt;String>(); // add the interesting properties here interestingProperties.add("mapreduce.job.name"); JobConfigurationParser jcp = new JobConfigurationParser(interestingProperties); // parse the configuration file tb.process(jcp.parse(jobConfInputStream)); // read the job history file and pass it to the // TopologyBuilder. JobHistoryParser parser = new CurrentJHParser(jobHistoryInputStream); HistoryEvent e; // read and process all the job history events while ((e = parser.nextEvent()) != null) { tb.process(e); } LoggedNetworkTopology topology = tb.build(); -
JobBuilder
Summarizes a job history file.JobHistoryUtilsprovidesJobHistoryUtils.extractJobID(String)API for extracting job id from job history or job configuration files which can be used for instantiatingJobBuilder.JobBuildergenerates aLoggedJobobject viaJobBuilder.build(). SeeLoggedJobfor more details.
Sample code:Note: The order of parsing the job configuration file or job history file is not important. Create one instance to parse the history file and job configuration.// An example to summarize a current job history file 'filename' // and the corresponding configuration file 'conf_filename' String filename = .. // assume the job history filename here String conf_filename = .. // assume the job configuration filename here InputStream jobConfInputStream = new FileInputStream(job_filename); InputStream jobHistoryInputStream = new FileInputStream(conf_filename); String jobID = TraceBuilder.extractJobID(job_filename); JobBuilder jb = new JobBuilder(jobID); // construct a list of interesting properties List<String> interestingProperties = new ArrayList%lt;String>(); // add the interesting properties here interestingProperties.add("mapreduce.job.name"); JobConfigurationParser jcp = new JobConfigurationParser(interestingProperties); // parse the configuration file jb.process(jcp.parse(jobConfInputStream)); // parse the job history file JobHistoryParser parser = new CurrentJHParser(jobHistoryInputStream); try { HistoryEvent e; // read and process all the job history events while ((e = parser.nextEvent()) != null) { jobBuilder.process(e); } } finally { parser.close(); } LoggedJob job = jb.build(); -
DefaultOutputter
ImplementsOutputterand writes JSON object in text format to the output file.DefaultOutputtercan be initialized with the output filename.
Sample code:// An example to summarize a current job history file represented by // 'filename' and the configuration filename represented using // 'conf_filename'. Also output the job summary to 'out.json' along // with the cluster topology to 'topology.json'. String filename = .. // assume the job history filename here String conf_filename = .. // assume the job configuration filename here Configuration conf = new Configuration(); DefaultOutputter do = new DefaultOutputter(); do.init("out.json", conf); InputStream jobConfInputStream = new FileInputStream(filename); InputStream jobHistoryInputStream = new FileInputStream(conf_filename); // extract the job-id from the filename String jobID = TraceBuilder.extractJobID(filename); JobBuilder jb = new JobBuilder(jobID); TopologyBuilder tb = new TopologyBuilder(); // construct a list of interesting properties List<String> interestingProperties = new ArrayList%lt;String>(); // add the interesting properties here interestingProperties.add("mapreduce.job.name"); JobConfigurationParser jcp = new JobConfigurationParser(interestingProperties); // parse the configuration file tb.process(jcp.parse(jobConfInputStream)); // read the job history file and pass it to the // TopologyBuilder. JobHistoryParser parser = new CurrentJHParser(jobHistoryInputStream); HistoryEvent e; while ((e = parser.nextEvent()) != null) { jb.process(e); tb.process(e); } LoggedJob j = jb.build(); // serialize the job summary in json (text) format do.output(j); // close do.close(); do.init("topology.json", conf); // get the job summary using TopologyBuilder LoggedNetworkTopology topology = topologyBuilder.build(); // serialize the cluster topology in json (text) format do.output(topology); // close do.close(); -
JobTraceReader
A reader for readingLoggedJobserialized usingDefaultOutputter.LoggedJobprovides various APIs for extracting job details. Following are the most commonly used onesLoggedJob.getMapTasks(): Get the map tasksLoggedJob.getReduceTasks(): Get the reduce tasksLoggedJob.getOtherTasks(): Get the setup/cleanup tasksLoggedJob.getOutcome(): Get the job's outcomeLoggedJob.getSubmitTime(): Get the job's submit timeLoggedJob.getFinishTime(): Get the job's finish time
Sample code:// An example to read job summary from a trace file 'out.json'. JobTraceReader reader = new JobTracerReader("out.json"); LoggedJob job = reader.getNext(); while (job != null) { // .... process job level information for (LoggedTask task : job.getMapTasks()) { // process all the map tasks in the job for (LoggedTaskAttempt attempt : task.getAttempts()) { // process all the map task attempts in the job } } // get the next job job = reader.getNext(); } reader.close(); -
ClusterTopologyReader
A reader to readLoggedNetworkTopologyserialized usingDefaultOutputter.ClusterTopologyReadercan be initialized using the serialized topology filename.ClusterTopologyReader.get()can be used to get theLoggedNetworkTopology.
Sample code:// An example to read the cluster topology from a topology output file // 'topology.json' ClusterTopologyReader reader = new ClusterTopologyReader("topology.json"); LoggedNetworkTopology topology = reader.get(); for (LoggedNetworkTopology t : topology.getChildren()) { // process the cluster topology } reader.close();
-
ClassDescriptionorg.apache.hadoop.tools.rumen.AbstractClusterStory
AbstractClusterStoryprovides a partial implementation ofClusterStoryby parsing the topology tree.org.apache.hadoop.tools.rumen.Anonymizerorg.apache.hadoop.tools.rumen.CDFPiecewiseLinearRandomGeneratororg.apache.hadoop.tools.rumen.CDFRandomGeneratorAn instance of this class generates random values that confirm to the embeddedLoggedDiscreteCDF.org.apache.hadoop.tools.rumen.ClusterStoryClusterStoryrepresents all configurations of a MapReduce cluster, including nodes, network topology, and slot configurations.org.apache.hadoop.tools.rumen.ClusterTopologyReaderReading JSON-encoded cluster topology and produce the parsedLoggedNetworkTopologyobject.org.apache.hadoop.tools.rumen.CurrentJHParserJobHistoryParserthat parses JobHistory files.org.apache.hadoop.tools.rumen.DeepCompareClasses that implement this interface can deep-compare [for equality only, not order] with another instance.org.apache.hadoop.tools.rumen.DeepInequalityExceptionWe use this exception class in the unit test, and we do a deep comparison when we run theorg.apache.hadoop.tools.rumen.DefaultInputDemuxerDefaultInputDemuxeracts as a pass-through demuxer.org.apache.hadoop.tools.rumen.DefaultOutputter<T>The defaultOutputterthat outputs to a plain file.org.apache.hadoop.tools.rumen.DeskewedJobTraceReaderorg.apache.hadoop.tools.rumen.Folderorg.apache.hadoop.tools.rumen.Hadoop20JHParserJobHistoryParserto parse job histories for hadoop 0.20 (META=1).org.apache.hadoop.tools.rumen.HadoopLogsAnalyzerDeprecated.org.apache.hadoop.tools.rumen.InputDemuxerInputDemuxerdem-ultiplexes the input files into individual input streams.org.apache.hadoop.tools.rumen.Job20LineHistoryEventEmitterorg.apache.hadoop.tools.rumen.JobBuilderJobBuilderbuilds one job.org.apache.hadoop.tools.rumen.JobConfigurationParserJobConfigurationParserparses the job configuration xml file, and extracts configuration properties.org.apache.hadoop.tools.rumen.JobHistoryParserJobHistoryParserdefines the interface of a Job History file parser.org.apache.hadoop.tools.rumen.JobHistoryParserFactoryJobHistoryParserFactoryis a singleton class that attempts to determine the version of job history and return a proper parser.org.apache.hadoop.tools.rumen.JobHistoryUtilsJob History related utils for handling multiple formats of history logs of different hadoop versions like Pre21 history logs, current history logs.org.apache.hadoop.tools.rumen.JobStoryJobStoryrepresents the runtime information available for a completed Map-Reduce job.org.apache.hadoop.tools.rumen.JobStoryProducerJobStoryProducerproduces the sequence ofJobStory's.org.apache.hadoop.tools.rumen.JobTraceReaderReading JSON-encoded job traces and produceLoggedJobinstances.org.apache.hadoop.tools.rumen.JsonObjectMapperWriter<T>Simple wrapper aroundJsonGeneratorto write objects in JSON format.org.apache.hadoop.tools.rumen.LoggedDiscreteCDFALoggedDiscreteCDFis a discrete approximation of a cumulative distribution function, with this class set up to meet the requirements of the Jackson JSON parser/generator.org.apache.hadoop.tools.rumen.LoggedJobALoggedDiscreteCDFis a representation of an hadoop job, with the details of this class set up to meet the requirements of the Jackson JSON parser/generator.org.apache.hadoop.tools.rumen.LoggedLocationALoggedLocationis a representation of a point in an hierarchical network, represented as a series of membership names, broadest first.org.apache.hadoop.tools.rumen.LoggedNetworkTopologyALoggedNetworkTopologyrepresents a tree that in turn represents a hierarchy of hosts.org.apache.hadoop.tools.rumen.LoggedSingleRelativeRankingALoggedSingleRelativeRankingrepresents an X-Y coordinate of a single point in a discrete CDF.org.apache.hadoop.tools.rumen.LoggedTaskALoggedTaskrepresents a [hadoop] task that is part of a hadoop job.org.apache.hadoop.tools.rumen.LoggedTaskAttemptALoggedTaskAttemptrepresents an attempt to run an hadoop task in a hadoop job.org.apache.hadoop.tools.rumen.MachineNodeMachineNoderepresents the configuration of a cluster node.org.apache.hadoop.tools.rumen.MachineNode.BuilderBuilder for a NodeInfo objectorg.apache.hadoop.tools.rumen.MapAttempt20LineHistoryEventEmitterorg.apache.hadoop.tools.rumen.MapTaskAttemptInfoMapTaskAttemptInforepresents the information with regard to a map task attempt.org.apache.hadoop.tools.rumen.NodeNoderepresents a node in the cluster topology.org.apache.hadoop.tools.rumen.Outputter<T>Interface to output a sequence of objects of type T.org.apache.hadoop.tools.rumen.ParsedHostorg.apache.hadoop.tools.rumen.ParsedJobThis is a wrapper class aroundLoggedJob.org.apache.hadoop.tools.rumen.ParsedTaskThis is a wrapper class aroundLoggedTask.org.apache.hadoop.tools.rumen.ParsedTaskAttemptThis is a wrapper class aroundLoggedTaskAttempt.org.apache.hadoop.tools.rumen.Pre21JobHistoryConstantsJob History related constants for Hadoop releases prior to 0.21This enum contains some of the values commonly used by history log events.org.apache.hadoop.tools.rumen.RackNodeRackNoderepresents a rack node in the cluster topology.org.apache.hadoop.tools.rumen.RandomSeedGeneratorThe purpose of this class is to generate new random seeds from a master seed.org.apache.hadoop.tools.rumen.ReduceAttempt20LineHistoryEventEmitterorg.apache.hadoop.tools.rumen.ReduceTaskAttemptInfoReduceTaskAttemptInforepresents the information with regard to a reduce task attempt.org.apache.hadoop.tools.rumen.ResourceUsageMetricsCaptures the resource usage metrics.org.apache.hadoop.tools.rumen.RewindableInputStreamA simple wrapper class to make any input stream "rewindable".org.apache.hadoop.tools.rumen.Task20LineHistoryEventEmitterorg.apache.hadoop.tools.rumen.TaskAttempt20LineEventEmitterorg.apache.hadoop.tools.rumen.TaskAttemptInfoTaskAttemptInfois a collection of statistics about a particular task-attempt gleaned from job-history of the job.org.apache.hadoop.tools.rumen.TaskInfoorg.apache.hadoop.tools.rumen.TopologyBuilderBuilding the cluster topology.org.apache.hadoop.tools.rumen.TraceBuilderThe main driver of the Rumen Parser.org.apache.hadoop.tools.rumen.TreePathThis describes a path from a node to the root.org.apache.hadoop.tools.rumen.ZombieClusterZombieClusterrebuilds the cluster topology using the information obtained from job history logs.org.apache.hadoop.tools.rumen.ZombieJobZombieJobis a layer aboveLoggedJobraw JSON objects.org.apache.hadoop.tools.rumen.ZombieJobProducerProducingJobStorys from job trace.