Why are all my incoming logs only matching 1 schema?

Last updated: September 3, 2024

Issue

I have multiple custom schemas for a log source, but all my incoming logs keep matching the same one.

Example: I have a log source which contains logs for two apps: an nginx server and a web app called Webby. Both apps output logs with different formats, so I defined two schemas for this log source; one for each app. However, all the logs being imported into Panther are being parsed as nginx logs, even the web app ones.

Resolution

The issue can be solved in a few ways:

If you are using S3 to onboard your logs...

Consider using prefix filtering. Have each data source send its logs to a prefixed location within your S3 bucket. Then, when creating the log source in the Panther Console (or editing a previously created source), specify the prefixes for each log type. You can direct logs with specific prefixes to use individual schemas. This is the safest method and is recommended wherever possible.

Example: Suppose you've configured both logs to be saved in the root directory of your S3 bucket. Instead, configure the logs to be saved to separate folders, such as /nginx and /webby Then, in the Panther console, edit or create the log source. Under the S3 Prefixes & Schemas, specify the folder names (these are the prefixes we're talking about).

Scroll to the "S3 Prefixes & Schemas" section. Enter the name of the folder where the particular logs are stored (for this case, "Webby").
Click the "Add Schemas" button to add a new field where we can choose a schema for these logs.

In the new field which appears below the Prefix input, choose a schema. Only this schema will be used to parse logs coming from the /webby folder in our S3 bucket.
Since we need to configure the same prefix filter for our nginx logs, click the link along the bottom which says "Add another Prefix & Schema combination". We can do this as many times as we need.

If you are not using S3...

If you can't use prefix filtering as mentioned above, then you need to make the log schemas mutually exclusive. Consider which fields both schemas have in common, and which are specific to one or the other. In each schema, look for a field which is required, and not used in any other schema. Set this field to required (as detailed here). One exclusive required field per schema will guarantee that any correctly formatted logs will match only one schema.

Example: Suppose all nginx logs have a field called "level", which is never used in any Webby logs. Likewise, suppose Webby logs have a field called "webby_token_id", which never appears in nginx logs. By defining both fields as required in their respective schemas, you can prevent any false matches for incoming logs.

# nginx schema

version: 0
fields:
  - name: date
    type: timestamp
  - name: level
    type: integer
    required: true
  - name: message
    type: string
...

# webby schema

version: 0
fields:
  - name: date
    type: timestamp
  - name: webby_token_id
    type: string
    required: true
  - name: actor
    type: object
....

Alternatively, if each schema shares a single field, but each allows different values, utilize the allow and deny keys in your schema definitions (as explained here). For example, if Schema A and Schema B both have a field named "app_name", but one is always equal to "nginx" and the other always equals "apache", then you can specify this in the schema definition, which will determine which schema is used for a particular incoming event.

Example: What if Webby also includes a field called "level"? Suppose it does, but in Webby's logs, the "level" field is a string enum with the values "INFO", "DEBUG", "WARN", etc., while the "level" field in nginx uses integers. We can configure the schema to sort applicability based on the value of the "level" field.

# nginx schema

...
  - name: level
    type: integer
    required: true
    validate:
      deny: ["DEBUG", "INFO", "WARN", "ERROR"]
...

# webby schema

...
  - name: level
    type: string
    required: false
    validate:
      allow: ["DEBUG", "INFO", "WARN", "ERROR"]
...

A note on exclusivity: do your best to ensure schemas are mutually exclusive. Any arbitrary log should only be able to match one schema. If Log A matches Schema A, but Log B matches Schema A and Schema B, you risk the chance that the log is parsed using the wrong schema. Panther does not scan through all schemas and choose the best match.

Panther matches an incoming event against the schemas associated with the event's log source in the following way:

If no events have been parsed yet in the current run, schemas are tried in alphabetical order.
If some logs have been parsed in the current run, the schema for the most recent successful classification is tried. If it doesn't work, other schemas are tried in order of last successful parsing.

Cause

Most often, this is due to schemas with too broad criteria. If one schema, or multiple, can match logs with multiple formats, then logs can be parsed incorrectly.