Scale back value and enhance question efficiency with Amazon Athena Question Consequence Reuse

on

|

views

and

comments

[ad_1]

Amazon Athena is an interactive question service that makes it straightforward to research information in Amazon Easy Storage Service (Amazon S3) utilizing customary SQL. Athena is serverless, so there isn’t a infrastructure to handle, and also you pay just for the queries that you just run on datasets at petabyte scale. You need to use Athena to question your S3 information lake to be used circumstances corresponding to information exploration for machine studying (ML) and AI, enterprise intelligence (BI) reporting, and advert hoc querying.

It’s not unusual for datasets in information lakes to replace solely every day, or at most just a few occasions per day, but queries working on these datasets could also be repeated extra often. Beforehand, all queries resulted in a knowledge scan, even when the identical question was repeated once more. When the supply information hasn’t modified, repeat queries run needlessly, resulting in the identical outcomes with increased information scan prices and question latency. Wouldn’t or not it’s higher if the outcomes of a latest question may very well be reused as a substitute?

Question Consequence Reuse is a brand new characteristic obtainable in Athena engine model 3 that makes it attainable to reuse the outcomes of a earlier question. This may enhance efficiency and scale back value for often run queries, by skipping scanning the supply information and as a substitute returning a beforehand calculated consequence straight. With Question Consequence Reuse, you’ll be able to inform Athena that you just wish to reuse outcomes of a earlier question run, with a most age setting that controls how latest a earlier consequence must be.

Athena robotically reuses any earlier outcomes that match your question and most age setting, or transparently runs the question once more if no match is discovered. If {that a} dataset modifications just a few occasions per day, you’ll be able to, for instance, inform Athena to reuse outcomes which might be as much as an hour outdated to keep away from rerunning most queries, however nonetheless get new outcomes if you run a question quickly after new information has turn into obtainable.

On this put up, we show find out how to scale back value and enhance question efficiency with the brand new Question Consequence Reuse characteristic.

When must you use Question Consequence Reuse?

We advocate utilizing Question Consequence Reuse for each question the place the supply information doesn’t change often. You’ll be able to configure the utmost age of outcomes to reuse per question, or use the default, which is 60 minutes. In sure circumstances the place queries embrace non-deterministic features corresponding to RAND(), the question fetches recent information from the enter supply even when the Question Consequence Reuse characteristic is enabled.

Question Consequence Reuse permits outcomes to be shared amongst customers in a workgroup, so long as they’ve entry to the tables and information. This implies Question Consequence Reuse can profit not solely a single consumer, but additionally different customers within the workgroup who is perhaps working the identical queries. One instance the place this can be particularly useful is when you might have dashboards which might be considered by many customers. The dashboard widgets run the identical queries for all customers, and are subsequently accelerated by Question Consequence Reuse, when enabled.

One other instance is when you have a dataset that’s up to date every day, and plenty of customers who all question the latest information to create stories. Totally different individuals would possibly run the identical queries as a part of their work; with Question Consequence Reuse, they will collectively keep away from working the identical question greater than as soon as, making everybody extra productive and reducing general value by avoiding repeated scans of the identical information.

Lastly, when you have a historic dataset that’s often queried, however by no means or very not often up to date, you’ll be able to configure queries to reuse outcomes which might be as much as 7 days outdated to maximise the probabilities of reusing outcomes and keep away from pointless prices.

How does Question Consequence Reuse work?

Question Consequence Reuse takes benefit of the truth that Athena writes question outcomes to Amazon S3 as a CSV file. Earlier than the introduction of Question Consequence Reuse, it was attainable to reuse question outcomes by studying these recordsdata straight. You could possibly additionally use the ClientRequestToken parameter of the StartQueryExecution API to make sure queries are run solely as soon as, and subsequent runs return the identical outcomes. With Question Consequence Reuse, the method of reusing question outcomes is simpler and extra versatile.

When Athena receives a question with Question Consequence Reuse enabled, it appears for a consequence for a question with the identical question string that was run in the identical workgroup. The question string must be similar in an effort to match.

Question Consequence Reuse is enabled on a per question foundation. Once you run a question, you specify how outdated a consequence may be for it to be reused, from 1 minute as much as 7 days. If the question has been run earlier than, and a consequence exists that matches the request, it’s returned, in any other case the question is run and a brand new result’s calculated. This new result’s then obtainable to be reused by subsequent queries.

You’ll be able to run the question a number of occasions with completely different settings for the way outdated a consequence you’ll be able to settle for. Outcomes may be reused throughout the similar workgroup, even when a unique consumer ran the question beforehand.

Earlier than a question result’s reused, Athena does just a few checks to be sure that the consumer remains to be allowed to see the outcomes. It checks that the consumer has entry to the tables concerned within the question and permission to learn the consequence file on Amazon S3.

There are some conditions the place question outcomes can’t be reused, for instance if the question makes use of non-deterministic features, or has AWS Lake Type ation fine-grained entry controls enabled. These limitations are described in additional element later on this put up.

Run queries with Question Consequence Reuse

On this part, we show find out how to run queries with the Question Consequence Reuse characteristic by way of the Athena API, the Athena console, and the JDBC and ODBC drivers.

Run queries utilizing the Athena API

For functions that use the Athena API by the AWS Command Line Interface (AWS CLI) or the AWS SDKs, the StartQueryExecution API name now has the extra parameter ResultReuseConfiguration, the place you’ll be able to allow Question Consequence Reuse and specify the utmost age of outcomes. For instance, when utilizing the AWS CLI, you’ll be able to run a question with Question Consequence Reuse enabled as follows:

aws athena start-query-execution 
  --work-group "my_work_group" 
  --query-string "SELECT * FROM my_table LIMIT 10" 
  --result-reuse-configuration 
    "ResultReuseByAgeConfiguration={Enabled=true,MaxAgeInMinutes=60}"

The next code exhibits how to do that with the AWS SDK for Python:

import boto3

consumer = boto3.consumer('athena')
response = consumer.start_query_execution(
    WorkGroup='my_work_group',
    QueryString='SELECT * FROM my_table LIMIT 10',
    ResultReuseConfiguration={
        'ResultReuseByAgeConfiguration': {
   	    	'Enabled': True,
     		'MaxAgeInSeconds': 60
        }
    }
)

These examples assume that my_work_group makes use of Athena engine v3, that the workgroup has an output location configured, and that the AWS Area has been set within the AWS CLI configuration.

When a question result’s reused, you’ll be able to see within the statistics part of the response from the GetQueryExecution API name that no information was scanned and that outcomes had been reused:

{
    "QueryExecution": {
        …
        "Statistics": {
            "EngineExecutionTimeInMillis": 272,
            "DataScannedInBytes": 0,
            "TotalExecutionTimeInMillis": 445,
            "QueryQueueTimeInMillis": 143,
            "ServiceProcessingTimeInMillis": 30,
            "ResultReuseInformation": {
               	"ReusedPreviousResult": true
           	}
        }
    }
}

Run queries utilizing the Athena console

Once you run queries on the Athena console, Question Consequence Reuse is now enabled by default. You’ll be able to allow and disable Question Consequence Reuse within the question editor. You may as well select the pen icon to vary the utmost age of outcomes. This setting applies to all queries run on the Athena console.

The next screenshot exhibits an instance question run in opposition to AWS CloudTrail logs with Question Consequence Reuse enabled.

After we ran the question once more, the outcomes confirmed up instantly, and we may see the message “utilizing reused question outcomes” within the Question outcomes pane as a affirmation that the outcomes of our first question had been reused. The Information scanned statistic additionally confirmed “-” to point that no information was scanned.

Run queries utilizing the JDBC and ODBC drivers

For those who use the JDBC or ODBC driver to question Athena, now you can add enableResultReuse=1 to your connection parameters to allow Question Consequence Reuse, and use ageforResultReuse=60 to set the utmost age to 60 minutes. The drivers robotically apply the setting to all queries working within the context of the connection.

For extra data on how to connect with Athena by way of JDBC and ODBC, consult with Connecting to Amazon Athena with ODBC and JDBC drivers.

Limitations and concerns

Question Consequence Reuse is supported for many Athena queries, however there are some limitations. We wish to be sure that reusing outcomes doesn’t create shocking conditions, or expose outcomes {that a} consumer shouldn’t have entry to. For that cause, Athena all the time runs a recent question within the following conditions:

  • Non-deterministic features – Some features and expressions produce completely different outcomes from question to question, corresponding to CURRENT_TIME and RAND(). Outcomes for queries that use temporal and non-deterministic expressions and features aren’t reusable as a result of that would create shocking and inconsistent outcomes.
  • Effective-grained entry controls – Row-level and column-level permissions are configured in Lake Formation, and Athena can’t know if these have modified since a earlier question consequence was created. Customers utilizing the identical workgroup may have completely different permissions, and checking all permissions would undo lots of the value and efficiency financial savings you get from Question Consequence Reuse.
  • Federated queries, user-defined features (UDFs), and exterior Hive metastores – Customers utilizing the identical workgroup can have completely different permissions to invoke the AWS Lambda features that these options depend on. Athena isn’t in a position to examine {that a} consumer that wishes to reuse a consequence has permission to invoke these Lambda features with out working the question, which might negate the price and efficiency financial savings.

Athena detects these circumstances robotically and runs the question as if Question Consequence Reuse wasn’t enabled. You gained’t get errors, however you’ll be able to decide that Question Consequence Reuse wasn’t in impact by inspecting the question standing (see our earlier examples).

Question Consequence Reuse is obtainable in Athena engine model 3 solely.

Conclusion

Question Consequence Reuse is a brand new characteristic in Athena that goals to scale back value and question response occasions for datasets that change much less often than they’re queried. For groups that always run the identical question, or have dashboards which might be used extra typically than the information modifications, Question Consequence Reuse may end up in decrease prices and sooner outcomes. It’s straightforward to get began with Question Consequence Reuse by way of the Athena console, API, and JDBC/ODBC; all it’s a must to do is about the utmost age of outcomes, and run your queries as normal.

We hope that you’ll like this new characteristic, and that it’ll save value and enhance efficiency for you and your crew!


In regards to the authors

Theo Tolv is a Senior Large Information Architect within the Athena crew. He’s labored with small and massive information for many of his profession and infrequently hangs out on Stack Overflow answering questions on Athena.

Vijay Jain is a Senior Product Supervisor in Amazon Internet Providers (AWS) Athena crew. He’s keen about constructing scalable analytics applied sciences and merchandise working carefully with enterprise prospects. Exterior of labor, Vijay likes working and spending time along with his household.

[ad_2]

Share this
Tags

Must-read

What companies are using big data analytics

What do companies use big data for? What companies are using big data analytics. There are a multitude of reasons companies use big data, but...

How to use big data in healthcare

What is data quality and why is it important in healthcare? How to use big data in healthcare. In healthcare, data quality is important for...

How to build a big data platform

What is big data platform? How to build a big data platform. A big data platform is a powerful platform used to manage and analyze...

Recent articles

More like this