03 November 2014

Hadoop and Pig come to the Salesforce Platform with Data Pipelines


Event Log Files is big - really, really big. This isn't your everyday CRM data where you may have hundreds of thousands of records or even a few million here and there. One organization I work with does approximately twenty million rows of event data per day using Event Log Files. That's approximately 600 million rows per month or 3.6 billion every half year.

Because the size of the data does matter, we need tools that can orchestrate and process this data for a variety of use cases. For instance, one best practice when working with Event Log Files is to de-normalize Ids into Name fields. Rather than reporting on the most adopted reports by Id, it's better to show the most adopted reports by Name.

There are many ways to handle this operation outside of the platform. However, on the platform there's really only been one way to handle this in the past with Batch Apex.

In pilot with the Winter '15 release (page 198 of the release notes), data pipelines provides a way for users to upload data into Hadoop and run Pig scripts against it. The advantage is that it handles many different data sources including sobjects, chatter files, and external objects at scale.

I worked with Prashant Kommireddi on the following scripts which help me understand which reports users are viewing:

1. Export user Ids and Names using SOQL into userMap.csv (Id,Name) which I upload to chatter files
-- 069B0000000NBbN = userMap file stored in chatter
user = LOAD 'force://chatter/069B0000000NBbN' using gridforce.hadoop.pig.loadstore.func.ForceStorage() as (Id, Name);
-- loop through user map to reduce Id from 18 to 15 characters to match the log lines
subUser = foreach user generate SUBSTRING(Id, 0, 15) as Id, Name;
-- storing into FFX to retrieve in next step
STORE subUser INTO 'ffx://userMap15.csv' using gridforce.hadoop.pig.loadstore.func.ForceStorage();

2. Export report Ids and Names using SOQL into reportMap.csv (Id,Name) which I upload to chatter files
-- 069B0000000NBbD = reportMap file stored in chatter
report = LOAD 'force://chatter/069B0000000NBbD' using gridforce.hadoop.pig.loadstore.func.ForceStorage() as (Id, Name);
-- loop through user map to reduce Id from 18 to 15 characters to match the log lines
subReport = foreach report generate SUBSTRING(Id, 0, 15) as Id, Name;
-- storing into FFX to retrieve in next step
STORE subReport INTO 'ffx://reportMap15.csv' using gridforce.hadoop.pig.loadstore.func.ForceStorage();

3. createReportExport - Upload ReportExport.csv to chatter files and run script to combine all three
-- Step 1: load users and store 15 char id
userMap = LOAD 'ffx://userMap15.csv' using gridforce.hadoop.pig.loadstore.func.ForceStorage() as (Id, Name);
-- Step 2: load reports and store 15 char id
reportMap = LOAD 'ffx://reportMap15.csv' using gridforce.hadoop.pig.loadstore.func.ForceStorage() as (Id, Name);
-- Step 3: load full schema from report export elf csv file
elf = LOAD 'force://chatter/069B0000000NB1r' using gridforce.hadoop.pig.loadstore.func.ForceStorage() as (EVENT_TYPE, TIMESTAMP, REQUEST_ID, ORGANIZATION_ID, USER_ID, RUN_TIME, CLIENT_IP, URI, CLIENT_INFO, REPORT_DESCRIPTION);
-- Step 4: remove '/' from URI field to create an Id map
cELF = foreach elf generate EVENT_TYPE, TIMESTAMP, REQUEST_ID, ORGANIZATION_ID, USER_ID, RUN_TIME, CLIENT_IP, SUBSTRING(URI, 1, 16) as URI, CLIENT_INFO, REPORT_DESCRIPTION;
-- Step 5: join all three files by the common user Id field
joinUserCELF = join userMap by Id, cELF by USER_ID;
joinReportMapELF = join reportMap by Id, cELF by URI;
finalJoin = join joinUserCELF by cELF::USER_ID, joinReportMapELF by cELF::USER_ID;
-- Step 6: generate output based on the expected column positions
elfPrunedOutput = foreach finalJoin generate $0, $1, $2, $3, $4, $5, $7, $8, $10, $11, $12, $13;
-- Step 7: store output in CSV
STORE elfPrunedOutput INTO 'force://chatter/reportExportMaster.csv' using gridforce.hadoop.pig.loadstore.func.ForceStorage();

By combining the power of data pipelines, I can transform the following Wave platform report from:

To:


To learn more about data pipelines and using Hadoop at Salesforce, download the Data Pipelines Implementation Guide (Pilot) and talk with your account executive about getting into the pilot.

No comments:

Post a Comment