Analyzing Apache Access Logs in Apache Hive ~ Java, Java EE & Java Script

Monday, January 25, 2016

Analyzing Apache Access Logs in Apache Hive

The following statistics are analyzed:

A count of response code's returned from the server.
The content size of responses returned from the server to host.
The top ten most popular URL’s in the Apache log
The average, min, and max content size of responses returned from the server.

The steps to process data with Apache Hive

Before proceed the below steps, we have to install the Cloudera Quickstart vm 5.5 and VMwareplayer. The Hadoop 2.6, Java 1.7, Eclipse Luna, Hive, Hbase, Spark, and all required libraries have been included in cloudera.

Download the apache log file from http://www.monitorware.com/en/logsamples/ apache.php and unzip it.
Create a loganalyzer/input directory named path in HDFS.

hadoop fs -mkdir -p /user/cloudera/hive/input