29 October 2015

LogDate vs CreatedDate - when to use one vs the other in an integration

Why are there two date-time fields for each Event Log File: LogDate and CreatedDate? Shouldn't one be good enough?

Seems like it should be a straightforward question, but it comes up frequently and the answer can effect how you integrate with Event Log Files.

Lets start with the definition of each:
  • LogDate tracks usage activity of a log file for a 24-hour period, from 12:00 a.m. to 11:59 p.m.
  • CreatedDate tracks when the log file was generated.
Why is this important? Why have these two different timestamps? Because having both ensure the reliability and eventual consistency of log delivery within a massively distributed system.

Each customer is co-located on a logical collection of servers we call a 'pod'. You can read more about pods and the multi-tenant architecture on the developer force blog.

There can be anywhere from ten to one hundred thousand customer organizations on a single pod. Each pod has numerous app servers which, at any given time, handle requests from any of those customer organizations. In such a large, distributed system, it's possible, though not frequently, for an app server to go down.

As a result, while a customer's transactions are being captured, if an app server does goes down, their transactions can be routed to another server seamlessly without affecting the end-user's experience or the integrity of the data. For a variety of reasons, app servers can go up or down throughout the day but what's important is that this activity doesn't affect the end-user's experience of the app or the integrity of the customer's data.

Each app server captures it's own log files throughout the day regardless of which customer's transactions are being handled by the server. Each log file therefore represents log entries for all customers who had transactions on that app server throughout the day. At the end of the day (~2am local server time), Salesforce ships the log files from active app servers to an HDFS where Hadoop jobs run. The Hadoop server generates the Event Log File content (~3am local server time) for each customer based on the app logs that were shipped earlier. This job is what generates Event Log File content which is accessible to the customer via the API.

It's possible that some log files will have to get shipped at a later date. This could have been from an app server that was offline during some part of the day that comes back on line after the log files are shipped for that day.  Therefore log files may be considered eventually consistent based on the log shipper or Hadoop job picking up a past file in a future job run. As a result, it's possible that Salesforce will catch up at a later point. We have built look back functionality to address this scenario. Every night when we run the Hadoop job, we check to see if new files exist for previous days and then re-generate new Event Log File content, overwriting the existing log files that were previously generated.

This is why we have both CreatedDate and LogDate fields - LogDate reflects the actual 24-hour period when the user activity occurred and CreatedDate reflects when the actual Event Log File was generated. So it's possible, due to look back functionality, that we will re-generate a previous LogDate's file and in the process, write more lines than we did on the previous day with the newly available app server files co-mingled with the original app server log files that were originally received.



This eventual consistency of log files may impact your integration with Event Log Files.

The easy way to integrate with Event Log Files is to use LogDate and write a daily query that simply asks for the last 'n' days of log files based on the LogDate:

Select LogFile from EventLogFile where LogDate=Last_n_Days:7

However, if you query on LogDate, it is possible to miss data that you might get from downloading it later. For instance, if you downloaded yesterday's log files and then re-download them tomorrow, you may actually have more log lines in the newer download. This is because some app log files may have caught up, overwriting the original log content with more log lines.

To ensure a more accurate query that also captures look back updates of the previous day's log files, you should use CreatedDate:

Select LogFile from EventLogFile where CreatedDate=Last_n_Days:7

This is a more complicated integration because you will have to keep track of the CreatedDate for each LogDate and EventType that was previously downloaded in the case that a CreatedDate is newer than a previously downloaded file. You may also need to handle event row de-duplication where you've already downloaded log lines from a previous download into an analytics tool like Splunk only to find additional log lines added in a subsequent download.

There's one option that simplifies this a little bit. You can overwrite past data every time you run your integration. This is what we do with some analytics apps that work with Event Log Files; the job automatically overwrites the last seven days worth of log data with each job rather than appending new data and de-duplicating older downloads.

This may seem antithetical but believe it or not, having look back is a really good thing because it increases the reliability and eventual consistency when working with logs to ensure you get all of the data you expect to be there.

No comments:

Post a Comment