My query using p_event_time in Panther runs slowly
Issue
When trying to query data in the Data Explorer when my Panther instance uses Snowflake (as opposed to Athena or a different backend database), with a query like the following, it takes a long time to run:
SELECT * FROM <reasonably_large_table>_variant ORDER BY p_event_time DESC LIMIT 10;
The query contains an ORDER BY
and LIMIT
clause, but no WHERE
constraint on p_event_time
Resolution
To accelerate queries like this, constrain the p_event_time
search as follows:
SELECT * FROM <reasonably_large table>_variant WHERE DATE_TRUNC('MINUTE', p_event_time) in (select MAX(DATE_TRUNC('MINUTE', p_event_time)) FROM <reasonably_large_table>_variant) ORDER BY p_event_time DESC LIMIT 10;
or even something simpler like:
SELECT * FROM <reasonably_large table>_variant WHERE p_occurs_since('24 hours') ORDER BY p_event_time DESC LIMIT 10;
Adding a WHERE clause will limit the results that are needed to be ordered. When you use a WHERE clause along with a cluster key like p_event_time, it is already pre-ordered, which means it only has to look in a few partitions versus having to look in every partition.
Cause
Row level, transactional DBs like MySQL or Postgres will execute queries like these relatively quickly if you have an index on p_event_time
because the index also serves as a sort key. However, in Snowflake, which is an OLAP DB, even though we cluster on p_event_time
, the micro partitions on that column themselves aren't sorted. As a result, the above query still needs to execute a full table scan, which can take very long.
For more information about why Snowflake doesn't sort micro-partitions, see Snowflake's documentation here.