03 November 2014

Hadoop and Pig come to the Salesforce Platform with Data Pipelines


Event Log Files is big - really, really big. This isn't your everyday CRM data where you may have hundreds of thousands of records or even a few million here and there. One organization I work with does approximately twenty million rows of event data per day using Event Log Files. That's approximately 600 million rows per month or 3.6 billion every half year.

Because the size of the data does matter, we need tools that can orchestrate and process this data for a variety of use cases. For instance, one best practice when working with Event Log Files is to de-normalize Ids into Name fields. Rather than reporting on the most adopted reports by Id, it's better to show the most adopted reports by Name.

There are many ways to handle this operation outside of the platform. However, on the platform there's really only been one way to handle this in the past with Batch Apex.

In pilot with the Winter '15 release (page 198 of the release notes), data pipelines provides a way for users to upload data into Hadoop and run Pig scripts against it. The advantage is that it handles many different data sources including sobjects, chatter files, and external objects at scale.

I worked with Prashant Kommireddi on the following scripts which help me understand which reports users are viewing:

1. Export user Ids and Names using SOQL into userMap.csv (Id,Name) which I upload to chatter files
-- 069B0000000NBbN = userMap file stored in chatter
user = LOAD 'force://chatter/069B0000000NBbN' using gridforce.hadoop.pig.loadstore.func.ForceStorage() as (Id, Name);
-- loop through user map to reduce Id from 18 to 15 characters to match the log lines
subUser = foreach user generate SUBSTRING(Id, 0, 15) as Id, Name;
-- storing into FFX to retrieve in next step
STORE subUser INTO 'ffx://userMap15.csv' using gridforce.hadoop.pig.loadstore.func.ForceStorage();

2. Export report Ids and Names using SOQL into reportMap.csv (Id,Name) which I upload to chatter files
-- 069B0000000NBbD = reportMap file stored in chatter
report = LOAD 'force://chatter/069B0000000NBbD' using gridforce.hadoop.pig.loadstore.func.ForceStorage() as (Id, Name);
-- loop through user map to reduce Id from 18 to 15 characters to match the log lines
subReport = foreach report generate SUBSTRING(Id, 0, 15) as Id, Name;
-- storing into FFX to retrieve in next step
STORE subReport INTO 'ffx://reportMap15.csv' using gridforce.hadoop.pig.loadstore.func.ForceStorage();

3. createReportExport - Upload ReportExport.csv to chatter files and run script to combine all three
-- Step 1: load users and store 15 char id
userMap = LOAD 'ffx://userMap15.csv' using gridforce.hadoop.pig.loadstore.func.ForceStorage() as (Id, Name);
-- Step 2: load reports and store 15 char id
reportMap = LOAD 'ffx://reportMap15.csv' using gridforce.hadoop.pig.loadstore.func.ForceStorage() as (Id, Name);
-- Step 3: load full schema from report export elf csv file
elf = LOAD 'force://chatter/069B0000000NB1r' using gridforce.hadoop.pig.loadstore.func.ForceStorage() as (EVENT_TYPE, TIMESTAMP, REQUEST_ID, ORGANIZATION_ID, USER_ID, RUN_TIME, CLIENT_IP, URI, CLIENT_INFO, REPORT_DESCRIPTION);
-- Step 4: remove '/' from URI field to create an Id map
cELF = foreach elf generate EVENT_TYPE, TIMESTAMP, REQUEST_ID, ORGANIZATION_ID, USER_ID, RUN_TIME, CLIENT_IP, SUBSTRING(URI, 1, 16) as URI, CLIENT_INFO, REPORT_DESCRIPTION;
-- Step 5: join all three files by the common user Id field
joinUserCELF = join userMap by Id, cELF by USER_ID;
joinReportMapELF = join reportMap by Id, cELF by URI;
finalJoin = join joinUserCELF by cELF::USER_ID, joinReportMapELF by cELF::USER_ID;
-- Step 6: generate output based on the expected column positions
elfPrunedOutput = foreach finalJoin generate $0, $1, $2, $3, $4, $5, $7, $8, $10, $11, $12, $13;
-- Step 7: store output in CSV
STORE elfPrunedOutput INTO 'force://chatter/reportExportMaster.csv' using gridforce.hadoop.pig.loadstore.func.ForceStorage();

By combining the power of data pipelines, I can transform the following Wave platform report from:

To:


To learn more about data pipelines and using Hadoop at Salesforce, download the Data Pipelines Implementation Guide (Pilot) and talk with your account executive about getting into the pilot.

26 comments:

  1. Great Article:

    Looking for the Best Walkin Interview For Freshers , Find latest Govt Job Notifications with various information such as govt vacancies, eligibility, Bank Jobs, Latest Bank Job Updates @ pdtvjobs.com

    ReplyDelete
  2. The majority of the profiles need 10 to 20 years of experience. Be that as it may, for a multi year like me, the experience is something which can never be purchased.data science course in pune

    ReplyDelete
  3. Excellent Blog! I would like to thank for the efforts you have made in writing this post.
    digital marketing course

    ReplyDelete
  4. Well, The information which you posted here is very helpful & it is very useful for the needy like me.., Wonderful information you posted here. Thank you so much for helping me out to find the Data science course in Mumbai Organisations and introducing reputed stalwarts in the industry dealing with data analyzing & assorting it in a structured and precise manner. Keep up the good work. Looking forward to view more from you.

    ReplyDelete
  5. Such a very useful article. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. I would like to state about something which creates curiosity in knowing more about it. It is a part of our daily routine life which we usually don`t notice in all the things which turns the dreams in to real experiences. Back from the ages, we have been growing and world is evolving at a pace lying on the shoulder of technology."data science courses in hyderabad" will be a great piece added to the term technology. Cheer for more ideas & innovation which are part of evolution.

    ReplyDelete
  6. Such a very useful article. I have learn some new information.thanks for sharing.
    data scientist course in mumbai

    ReplyDelete
  7. I just got to this amazing site not long ago. I was actually captured with the piece of resources you have got here. Big thumbs up for making such wonderful blog page!
    Data Analytics Course in Mumbai

    ReplyDelete
  8. Such a very useful article. Very interesting to read this article. I have learn some new information.thanks for sharing. ExcelR

    ReplyDelete
  9. Very nice blog here and thanks for post it.. Keep blogging...
    ExcelR data science training

    ReplyDelete
  10. Excellent blog thanks for sharing The world around is changing at turbo speed. With digital marketing companies booming up at every corner, it can be hard to decide which is the best place for you to begin your online marketing journey. If you are based in Chennai, the answer is plain simple - Adhuntt Media has the best team that cover all your branding needs - SEO, Graphic Design, Logo Design, Social Media Marketing, Google Ads, Competitor Analysis and much more.
    seo service in chennai

    ReplyDelete
  11. Nice blog thanks for sharing Say no to unethical ways for growing plants fast. We at Karuna Nursery Gardens stick to traditional and environment friendly methods of caring and growing therefore enabling us to showcase the largest collection on organic plants in Chennai.
    corporate gardening service in chennai

    ReplyDelete
  12. Great blog thanks for sharing It’s very important to have the best beauty parlour equipment to run a successful salon. Pixies Beauty Shop is the best place in Chennai to get high quality imported top brands at the best price.
    beauty Shop in Chennai

    ReplyDelete
  13. Attend The Data Analytics Course From ExcelR. Practical Data Analytics Course Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Data Analytics Course.
    ExcelR Data Analytics Course

    ReplyDelete
  14. Excellent Blog! I would like to thank for the efforts you have made in writing this post. I am hoping the same best work from you in the future as well. I wanted to thank you for this websites! Thanks for sharing. Great websites!
    ExcelR data science course in mumbai

    ReplyDelete
  15. Great post i must say and thanks for the information. Education is definitely a sticky subject. However, is still among the leading topics of our time. I appreciate your post and look forward to more.
    data science course in mumbai

    ReplyDelete
  16. I like viewing web sites which comprehend the price of delivering the excellent useful resource free of charge. I truly adored reading your posting. Thank you!
    data analytics courses

    ReplyDelete
  17. Nice information, valuable and excellent work, as share good stuff with good ideas and concepts, lots of great information and inspiration, both of which I need, thanks to offer such a helpful information here. data science course

    ReplyDelete
  18. Excellent effort to make this blog more wonderful and attractive. ExcelR Data Scientist Course Pune

    ReplyDelete
  19. Great post i must say and thanks for the information. Education is definitely a sticky subject. However, is still among the leading topics of our time. I appreciate your post and look forward to more.
    ExcelR Data Analytics Course

    ReplyDelete