In HowTo: Rotate Logs to S3 and HowTo: Rotate Apache Logs to S3 Based on EC2 Tags we rotated apache logs to S3 and put them in a directory such that it would facilitate running a hive script against the logs. In this article we give a demonstration of such a hive script.

Command used by logrotate to upload apache access_log to S3

1
2
3
4
5
/usr/bin/s3cmd \
-m text/plain \
sync \
/var/log/apache2/access_log-* \
s3://${BUCKET}/apache/access_log/site=${SITE}/instance=${INSTANCE_ID}/

With BUCKET being logging-bucket and SITE being www.example.com and INSTANCE_ID being ``, here are some example paths for the access_log files:

s3://logging-bucket/apache/access_log/site=www.example.com/instance=/access_log-2013-08-24-1377303374
s3://logging-bucket/apache/access_log/site=www.example.com/instance=/access_log-2013-10-24-1382591855

The first thing we will want to do is start up an interactive hive session, see HowTo: AWS CLI Elastic MapReduce - Interactive Hive for one way of starting an interactive hive Elastic MapReduce cluster.

Once in hive, we’ll set the input location and load hive_contrib.jar

hive>

1
2
3
set INPUT=s3n://logging-bucket/apache;

add jar /home/hadoop/hive/lib/hive_contrib.jar;

All Sites

We can create a table that will load all the sites:

hive>

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
CREATE EXTERNAL TABLE IF NOT EXISTS raw_access_log(
remote_host STRING,
remote_logname STRING,
remote_user STRING,
request_time STRING,
first_line STRING,
http_status STRING,
bytes STRING
)
PARTITIONED BY (site string, instance string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s"
)
LOCATION '${hiveconf:INPUT}/access_log/'
;

ALTER TABLE raw_access_log RECOVER PARTITIONS;

One thing to note is that the partitions will be columns available in the table

hive>

1
describe raw_access_log;

Output

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
OK
remote_host             string                  from deserializer
remote_logname          string                  from deserializer
remote_user             string                  from deserializer
request_time            string                  from deserializer
first_line              string                  from deserializer
http_status             string                  from deserializer
bytes                   string                  from deserializer
site                    string                  None
instance                string                  None

# Partition Information
# col_name              data_type               comment

site                    string                  None
instance                string                  None
Time taken: 0.722 seconds, Fetched: 15 row(s)

Change Column Types

Another issue is that all the types are strings. It would be nice if we could have the request_time be a timestamp and have bytes actually be a number.

There are two options we can take: create a view that casts the types; or create a new table an populate And we can do that by creating a new table and formatting those two columns when populating the new table.

View

hive>

1
2
3
4
5
6
7
8
9
10
11
12
13
14
CREATE VIEW access_log_view AS
SELECT
    remote_host,
    remote_logname,
    remote_user,
    CAST(FROM_UNIXTIME(UNIX_TIMESTAMP(request_time, '[dd/MMM/yyyy:HH:mm:ss Z]')) AS TIMESTAMP) AS request_time,
    first_line,
    http_status,
    CAST(REGEXP_REPLACE(bytes, '-', '0') AS BIGINT) AS bytes,
    site,
    instance
FROM
    raw_access_log
;

Table

hive>

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
CREATE TABLE IF NOT EXISTS access_log_table (
    remote_host STRING,
    remote_logname STRING,
    remote_user STRING,
    request_time TIMESTAMP,
    first_line STRING,
    http_status STRING,
    bytes BIGINT,
    site STRING,
    instance STRING
)
;

INSERT OVERWRITE TABLE access_log_table
SELECT
    remote_host,
    remote_logname,
    remote_user,
    CAST(FROM_UNIXTIME(UNIX_TIMESTAMP(request_time, '[dd/MMM/yyyy:HH:mm:ss Z]')) AS TIMESTAMP),
    first_line,
    http_status,
    CAST(REGEXP_REPLACE(bytes, '-', '0') AS BIGINT),
    site,
    instance
FROM
    raw_access_log
;

Example queries

Get the logs for www.example.com:

hive>

1
2
3
4
5
6
7
8
9
10
11
12
13
14
set SITE=www.example.com;
SELECT
    remote_host,
    remote_logname,
    remote_user,
    request_time,
    first_line,
    http_status,
    bytes
FROM
    access_log_view
WHERE
    site = '${hiveconf:SITE}'
;

References