Dissecting PASTA Audit event information

The PASTA Audit service collects information about user-related events that occur in PASTA, including data entity downloads. This information is stored in the Audit Manager's event database and is reported back to you through the PASTA Audit service "Get Audit Report" REST API method as an XML document composed of individual event audit records. Understanding this information is critical to determine the meaning of these events as they relate to your data.

<auditRecord>
  <oid>121607670</oid>
  <entryTime>2021-12-01T13:17:07</entryTime>
  <category>info</category>
  <service>DataPackageManager-1.0</service>
  <serviceMethod>readDataEntity</serviceMethod>
  <responseStatus>200</responseStatus>
  <resourceId>https://pasta.lternet.edu/package/data/eml/edi/522/7/d35591d6e18662290359e9e3076777e8</resourceId>
  <user>public</user>
  <userAgent>Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0</userAgent>
  <groups></groups>
  <authSystem>https://pasta.edirepository.org/authentication</authSystem>
  <entryText>Entity Name: lake_reservoir; Object Name: AGOSUS_RSVR_v1.1.csv; Data Format: text/csv</entryText>
</auditRecord>

Each Audit Record consists of 12 fields, eleven of which capture the event details (one field contains the event identifier). In the above example, the audit record event describes a data entity read request for an EDI scoped data object originating through a Firefox web browser from an anonymous (public) user on 1 December 2021 at 1:17pm. Key to understanding this event are five specific fields:

  1. serviceMethod - what type of action occurred with this event? In this case, it was a "readDataEntity" action (in other words, download a data object). The PASTA Audit Manager can record 46 different actions in its event database.
  2. resourceId - what PASTA resource was requested. The "resourceId" field contains the internal PASTA resource identifier that the action operates on. In the above example the resource is a EDI scoped data object, specifically object identifier "d35591d6e18662290359e9e3076777e8" of data package edi.522.7.
  3. userAgent - what type of application invoked the action? The "userAgent" field contains the HTTP request user agent string that identifies the application used to generate the request. Common user agents are web browsers, software applications that connect to the Internet, and search engine crawlers and robots. What ever application generates the request, it is important to know that the user agent string is not required and may be empty, or it may have a custom user agent string that describes the originating application, such as "DataONE-Python/3.4.7 +http://dataone.org/." In this context, the user agent string serves as the primary information that distinguishes legitimate requests for data from those that come from search engines or other applications that are requesting data for non-scientific purposes. The user agent string from the above event indicates it was generated from the Firefox web browser using the Mozilla web browser engine. This data request is very likely legitimate based on the identified user agent.
  4. user - what user (who) initiated this action? The "user" field is information provided by PASTA's authentication service and captures the user's unique login identifier, whether it be an LDAP distinguished name, an ORCID identifier, or the "public" moniker used for non-authenticated (i.e., anonymous) users. Non-legitimate users are labeled as "robot" in the user field and are classified this way by filtering user agent strings through a robot-identifying algorithm.
  5. entryTime - the date-timestamp of when the action occurred. The "entryTime" field contains the PASTA system date-timestamp and is recorded in the Denver/Mountain timezone, either -6 or -7 hours from UTC.

So now you know the critical aspects of an individual PASTA audit record, but how do you query the PASTA Audit service for meaningful and interesting data? The PASTA Audit "Get Audit Report" REST API method provides you with a set of filter parameters that can narrow down your report content. The two most important are the fromTime and toTime parameters, which limit the event records to those that have an "entryTime" between the "fromTime" and "toTime" values. Another important parameter is the serviceMethod. The "serviceMethod" uniquely identifies the type of action that occurred (the full list of 46 service methods can be seen on the Data Portal audit search interface) and limits the report event records to only that particular service method. The user parameter filters the event records to only those records with the specified "user" identifier. Because the "public" user is by far the most common user identifier in the PASTA Audit event database table, filtering on "public" will return the most event records, while filtering out event records with "robot" and other non-public user identifiers. Finally, the resourceId parameter may be used to filter on the full PASTA resource identifier, if you require information about a specific data package object, or you may use a sub-string of the resource identifier to return a more generalized report. Note that you must have a valid EDI user account to execute the "Get Audit Report" REST API method. Below is an example of using the "Get Audit Report" REST API method to return all data entity read events that have the "EDI" scope and identifier beginning with the number 1 by the "public" user for 2021:

curl --user "uid=USER,o=EDI,dc=edirepository,dc=org:USER_PASSWORD" -X GET "https://pasta.lternet.edu/audit/report?serviceMethod=readDataEntity&fromTime=20210101T00:00:00&toTime=20220101T00:00:00&user=public&resourceId=/edi/1"

When you analyze the PASTA Audit event report, you may still need to remove audit records that were not possible to exclude by using the available report parameters. For example, to determine the number web browser initiated events, you may want to remove all audit records where the user agent string does not match one that originated from a web browser. Similarly, user agents strings that contain "DataONE" may be used to determine the number of events originating from the DataONE Generic Member Node.

Lastly, PASTA Audit reports query a large database containing millions of records and may take some time to generate. In addition, a report containing a lot of audit records will create a great deal of XML output, which may be slow to completely download. To limit the impact of general queries, it is best to use the API method parameters to limit your report to only those records you require.