When trying to query data in the Data Explorer when my Panther instance uses Snowflake (as opposed to Athena), with a query like the following, it takes a long time to run:
SELECT * FROM <reasonably_large_table>_variant
ORDER BY p_event_time DESC
LIMIT 10;
The query contains an ORDER BY
and LIMIT
clause, but no WHERE
constraint on p_event_time
To accelerate queries like this, constrain the p_event_time
search as follows:
SELECT * FROM <reasonably_large table>_variant
WHERE DATE_TRUNC('MINUTE', p_event_time) in (select MAX(DATE_TRUNC('MINUTE', p_event_time)) FROM <reasonably_large_table>_variant)
ORDER BY p_event_time DESC
LIMIT 10;
or even something simpler like:
SELECT * FROM <reasonably_large table>_variant
WHERE p_occurs_since('24 hours')
ORDER BY p_event_time DESC
LIMIT 10;
Adding a WHERE clause will limit the results that are needed to be ordered. When you use a WHERE clause along with a cluster key like p_event_time, it is already pre-ordered, which means it only has to look in a few partitions versus having to look in every partition.
Row level, transactional DBs like MySQL or Postgres will execute queries like these relatively quickly if you have an index on p_event_time
because the index also serves as a sort key. However, in Snowflake, which is an OLAP DB, even though we cluster on p_event_time
, the micro partitions on that column themselves aren't sorted. As a result, the above query still needs to execute a full table scan, which can take very long.
For more information about why Snowflake doesn't sort micro-partitions, see Snowflake's documentation here.