New Year Special 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: best70

Note! Amazon Web Services has retired the DAS-C01 Exam Contact us through Live Chat or email us for more information.

DAS-C01 AWS Certified Data Analytics - Specialty Questions and Answers

Questions 4

An event ticketing website has a data lake on Amazon S3 and a data warehouse on Amazon Redshift. Two datasets exist: events data and sales data. Each dataset has millions of records.

The entire events dataset is frequently accessed and is stored in Amazon Redshift. However, only the last 6 months of sales data is frequently accessed and is stored in Amazon Redshift. The rest of the sales data is available only in Amazon S3.

A data analytics specialist must create a report that shows the total revenue that each event has generated in the last 12 months. The report will be accessed thousands of times each week.

Which solution will meet these requirements with the LEAST operational effort?

Options:

A.

Create an AWS Glue job to access sales data that is older than 6 months from Amazon S3 and to access event and sales data from Amazon Redshift. Load the results into a new table in Amazon Redshift.

B.

Create a stored procedure to copy sales data that is older than 6 months and newer than 12 months from Amazon S3 to Amazon Redshift. Create a materialized view with the autorefresh option

C.

Create an AWS Lambda function to copy sales data that is older than 6 months and newer than 12 months to an Amazon Kinesis Data Firehose delivery stream. Specify Amazon Redshift as the destination of the delivery stream. Create a materialized view with the autorefresh option.

D.

Create a materialized view in Amazon Redshift with the autorefresh option. Use Amazon Redshift Spectrum to include sales data that is older than 6 months.

Buy Now
Questions 5

A transportation company uses IoT sensors attached to trucks to collect vehicle data for its global delivery fleet. The company currently sends the sensor data in small .csv files to Amazon S3. The files are then loaded into a 10-node Amazon Redshift cluster with two slices per node and queried using both Amazon Athena and Amazon Redshift. The company wants to optimize the files to reduce the cost of querying and also improve the speed of data loading into the Amazon Redshift cluster.

Which solution meets these requirements?

Options:

A.

Use AWS Glue to convert all the files from .csv to a single large Apache Parquet file. COPY the file into Amazon Redshift and query the file with Athena from Amazon S3.

B.

Use Amazon EMR to convert each .csv file to Apache Avro. COPY the files into Amazon Redshift and query the file with Athena from Amazon S3.

C.

Use AWS Glue to convert the files from .csv to a single large Apache ORC file. COPY the file into Amazon Redshift and query the file with Athena from Amazon S3.

D.

Use AWS Glue to convert the files from .csv to Apache Parquet to create 20 Parquet files. COPY the files into Amazon Redshift and query the files with Athena from Amazon S3.

Buy Now
Questions 6

A company analyzes historical data and needs to query data that is stored in Amazon S3. New data is generated daily as .csv files that are stored in Amazon S3. The company’s analysts are using Amazon Athena to perform SQL queries against a recent subset of the overall data. The amount of data that is ingested into Amazon S3 has increased substantially over time, and the query latency also has increased.

Which solutions could the company implement to improve query performance? (Choose two.)

Options:

A.

Use MySQL Workbench on an Amazon EC2 instance, and connect to Athena by using a JDBC or ODBC connector. Run the query from MySQL Workbench instead of Athena directly.

B.

Use Athena to extract the data and store it in Apache Parquet format on a daily basis. Query the extracted data.

C.

Run a daily AWS Glue ETL job to convert the data files to Apache Parquet and to partition the converted files. Create a periodic AWS Glue crawler to automatically crawl the partitioned data on a daily basis.

D.

Run a daily AWS Glue ETL job to compress the data files by using the .gzip format. Query the compressed data.

E.

Run a daily AWS Glue ETL job to compress the data files by using the .lzo format. Query the compressed data.

Buy Now
Questions 7

Three teams of data analysts use Apache Hive on an Amazon EMR cluster with the EMR File System (EMRFS) to query data stored within each teams Amazon S3 bucket. The EMR cluster has Kerberos enabled and is configured to authenticate users from the corporate Active Directory. The data is highly sensitive, so access must be limited to the members of each team.

Which steps will satisfy the security requirements?

Options:

A.

For the EMR cluster Amazon EC2 instances, create a service role that grants no access to Amazon S3. Create three additional IAM roles, each granting access to each team’s specific bucket. Add the additional IAM roles to the cluster’s EMR role for the EC2 trust policy. Create a security configuration mapping for the additional IAM roles to Active Directory user groups for each team.

B.

For the EMR cluster Amazon EC2 instances, create a service role that grants no access to Amazon S3. Create three additional IAM roles, each granting access to each team's specific bucket. Add the service role for the EMR cluster EC2 instances to the trust policies for the additional IAM roles.Create a security configuration mapping for the additional IAM roles to Active Directory user groups for each team.

C.

For the EMR cluster Amazon EC2 instances, create a service role that grants full access to Amazon S3. Create three additional IAM roles, each granting access to each team’s specific bucket. Add the service role for the EMR cluster EC2 instances to the trust polices for the additional IAM roles. Create a security configuration mapping for the additional IAM roles to Active Directory user groups for each team.

D.

For the EMR cluster Amazon EC2 instances, create a service role that grants full access to Amazon S3. Create three additional IAM roles, each granting access to each team's specific bucket. Add the service role for the EMR cluster EC2 instances to the trust polices for the base IAM roles. Create a security configuration mapping for the additional IAM roles to Active Directory user groups for each team.

Buy Now
Questions 8

A company wants to ingest clickstream data from its website into an Amazon S3 bucket. The streaming data is in JSON format. The data in the S3 bucket must be partitioned by product_id.

Which solution will meet these requirements MOST cost-effectively?

Options:

A.

Create an Amazon Kinesis Data Firehose delivery stream to ingest the streaming data into the S3 bucket. Enable dynamic partitioning. Specify the data field of productjd as one partitioning key.

B.

Create an AWS Glue streaming job to partition the data by productjd before delivering the data to the S3 bucket. Create an Amazon Kinesis Data Firehose delivery stream. Specify the AWS Glue job as the destination of the delivery stream.

C.

Create an Amazon Kinesis Data Firehose delivery stream to ingest the streaming data into the S3 bucket. Create an AWS Glue ETL job to read the data stream in the S3 bucket, partition the data by productjd, and write the data into another S3 bucket.

D.

Create an Amazon Kinesis Data Firehose delivery stream to ingest the streaming data into the S3 bucket. Create an Amazon EMR cluster that includes a job to read the data stream in the S3 bucket, partition the data by productjd, and write the data into another S3 bucket.

Buy Now
Questions 9

A company uses Amazon Redshift as its data warehouse. The Redshift cluster is not encrypted. A data analytics specialist needs to use hardware security module (HSM) managed encryption keys to encrypt the data that is stored in the Redshift cluster.

Which combination of steps will meet these requirements? (Select THREE.)

Options:

A.

Stop all write operations on the source cluster. Unload data from the source cluster.

B.

Copy the data to a new target cluster that is encrypted with AWS Key Management Service (AWS KMS).

C.

Modify the source cluster by activating AWS CloudHSM encryption. Configure Amazon Redshift to automatically migrate data to a new encrypted cluster.

D.

Modify the source cluster by activating encryption from an external HSM. Configure Amazon Redshift to automatically migrate data to a new encrypted cluster.

E.

Copy the data to a new target cluster that is encrypted with an HSM from AWS CloudHSM.

F.

Rename the source cluster and the target cluster after the migration so that the target cluster is using the original endpoint.

Buy Now
Questions 10

A data analyst is designing a solution to interactively query datasets with SQL using a JDBC connection. Users will join data stored in Amazon S3 in Apache ORC format with data stored in Amazon Elasticsearch Service (Amazon ES) and Amazon Aurora MySQL.

Which solution will provide the MOST up-to-date results?

Options:

A.

Use AWS Glue jobs to ETL data from Amazon ES and Aurora MySQL to Amazon S3. Query the data with Amazon Athena.

B.

Use Amazon DMS to stream data from Amazon ES and Aurora MySQL to Amazon Redshift. Query the data with Amazon Redshift.

C.

Query all the datasets in place with Apache Spark SQL running on an AWS Glue developer endpoint.

D.

Query all the datasets in place with Apache Presto running on Amazon EMR.

Buy Now
Questions 11

A data analyst notices the following error message while loading data to an Amazon Redshift cluster:

"The bucket you are attempting to access must be addressed using the specified endpoint."

What should the data analyst do to resolve this issue?

Options:

A.

Specify the correct AWS Region for the Amazon S3 bucket by using the REGION option with the COPY command.

B.

Change the Amazon S3 object's ACL to grant the S3 bucket owner full control of the object.

C.

Launch the Redshift cluster in a VPC.

D.

Configure the timeout settings according to the operating system used to connect to the Redshift cluster.

Buy Now
Questions 12

A company is hosting an enterprise reporting solution with Amazon Redshift. The application provides reporting capabilities to three main groups: an executive group to access financial reports, a data analyst group to run long-running ad-hoc queries, and a data engineering group to run stored procedures and ETL processes. The executive team requires queries to run with optimal performance. The data engineering team expects queries to take minutes.

Which Amazon Redshift feature meets the requirements for this task?

Options:

A.

Concurrency scaling

B.

Short query acceleration (SQA)

C.

Workload management (WLM)

D.

Materialized views

Buy Now
Questions 13

A financial company uses Apache Hive on Amazon EMR for ad-hoc queries. Users are complaining of sluggish performance.

A data analyst notes the following:

  • Approximately 90% of queries are submitted 1 hour after the market opens.
  • Hadoop Distributed File System (HDFS) utilization never exceeds 10%.

Which solution would help address the performance issues?

Options:

A.

Create instance fleet configurations for core and task nodes. Create an automatic scaling policy to scale out the instance groups based on the Amazon CloudWatch CapacityRemainingGB metric. Create an automatic scaling policy to scale in the instance fleet based on the CloudWatch CapacityRemainingGB metric.

B.

Create instance fleet configurations for core and task nodes. Create an automatic scaling policy to scale out the instance groups based on the Amazon CloudWatch YARNMemoryAvailablePercentage metric. Create an automatic scaling policy to scale in the instance fleet based on the CloudWatch YARNMemoryAvailablePercentage metric.

C.

Create instance group configurations for core and task nodes. Create an automatic scaling policy to scale out the instance groups based on the Amazon CloudWatch CapacityRemainingGB metric. Create an

automatic scaling policy to scale in the instance groups based on the CloudWatch CapacityRemainingGB metric.

D.

Create instance group configurations for core and task nodes. Create an automatic scaling policy to scale out the instance groups based on the Amazon CloudWatch YARNMemoryAvailablePercentage metric. Create an automatic scaling policy to scale in the instance groups based on the CloudWatch YARNMemoryAvailablePercentage metric.

Questions 14

A retail company has 15 stores across 6 cities in the United States. Once a month, the sales team requests a visualization in Amazon QuickSight that provides the ability to easily identify revenue trends across cities and stores.The visualization also helps identify outliers that need to be examined with further analysis.

Which visual type in QuickSight meets the sales team's requirements?

Options:

A.

Geospatial chart

B.

Line chart

C.

Heat map

D.

Tree map

Buy Now
Questions 15

An airline has been collecting metrics on flight activities for analytics. A recently completed proof of concept demonstrates how the company provides insights to data analysts to improve on-time departures. The proof of concept used objects in Amazon S3, which contained the metrics in .csv format, and used Amazon Athena for querying the data. As the amount of data increases, the data analyst wants to optimize the storage solution to improve query performance.

Which options should the data analyst use to improve performance as the data lake grows? (Choose three.)

Options:

A.

Add a randomized string to the beginning of the keys in S3 to get more throughput across partitions.

B.

Use an S3 bucket in the same account as Athena.

C.

Compress the objects to reduce the data transfer I/O.

D.

Use an S3 bucket in the same Region as Athena.

E.

Preprocess the .csv data to JSON to reduce I/O by fetching only the document keys needed by the query.

F.

Preprocess the .csv data to Apache Parquet to reduce I/O by fetching only the data blocks needed for predicates.

Questions 16

A data analytics specialist has a 50 GB data file in .csv format and wants to perform a data transformation task. The data analytics specialist is using the Amazon Athena CREATE TABLE AS SELECT (CTAS) statement to perform the transformation. The resulting output will be used to query the data from Amazon Redshift Spectrum.

Which CTAS statement should the data analytics specialist use to provide the MOST efficient performance?

Options:

A.

Option A

B.

Option B

C.

Option C

D.

Option D

Buy Now
Questions 17

A marketing company has an application that stores event data in an Amazon RDS database. The company is replicating this data to Amazon Redshift for reporting and

business intelligence (BI) purposes. New event data is continuously generated and ingested into the RDS database throughout the day and captured by a change data

capture (CDC) replication task in AWS Database Migration Service (AWS DMS). The company requires that the new data be replicated to Amazon Redshift in near-real

time.

Which solution meets these requirements?

Options:

A.

Use Amazon Kinesis Data Streams as the destination of the CDC replication task in AWS DMS. Use an AWS Glue streaming job to read changed records from Kinesis Data Streams and perform an upsert into the Redshift cluster.

B.

Use Amazon S3 as the destination of the CDC replication task in AWS DMS. Use the COPY command to load data into the Redshift cluster.

C.

Use Amazon DynamoDB as the destination of the CDC replication task in AWS DMS. Use the COPY command to load data into the Redshift cluster.

D.

Use Amazon Kinesis Data Firehose as the destination of the CDC replication task in AWS DMS. Use an AWS Glue streaming job to read changed records from Kinesis Data Firehose and perform an upsert into the Redshift cluster.

Buy Now
Questions 18

A company's data science team is designing a shared dataset repository on a Windows server. The data repository will store a large amount of training data that the data

science team commonly uses in its machine learning models. The data scientists create a random number of new datasets each day.

The company needs a solution that provides persistent, scalable file storage and high levels of throughput and IOPS. The solution also must be highly available and must

integrate with Active Directory for access control.

Which solution will meet these requirements with the LEAST development effort?

Options:

A.

Store datasets as files in an Amazon EMR cluster. Set the Active Directory domain for authentication.

B.

Store datasets as files in Amazon FSx for Windows File Server. Set the Active Directory domain for authentication.

C.

Store datasets as tables in a multi-node Amazon Redshift cluster. Set the Active Directory domain for authentication.

D.

Store datasets as global tables in Amazon DynamoDB. Build an application to integrate authentication with the Active Directory domain.

Buy Now
Questions 19

An ecommerce company stores customer purchase data in Amazon RDS. The company wants a solution to store and analyze historical data. The most recent 6 months of data will be queried frequently for analytics workloads. This data is several terabytes large. Once a month, historical data for the last 5 years must be accessible and will be joined with the more recent data. The company wants to optimize performance and cost.

Which storage solution will meet these requirements?

Options:

A.

Create a read replica of the RDS database to store the most recent 6 months of data. Copy the historical data into Amazon S3. Create an AWS Glue Data Catalog of the data in Amazon S3 and Amazon RDS. Run historical queries using Amazon Athena.

B.

Use an ETL tool to incrementally load the most recent 6 months of data into an Amazon Redshift cluster. Run more frequent queries against this cluster. Create a read replica of the RDS database to run queries on the historical data.

C.

Incrementally copy data from Amazon RDS to Amazon S3. Create an AWS Glue Data Catalog of the data in Amazon S3. Use Amazon Athena to query the data.

D.

Incrementally copy data from Amazon RDS to Amazon S3. Load and store the most recent 6 months of data in Amazon Redshift. Configure an Amazon Redshift Spectrum table to connect to all historical data.

Buy Now
Questions 20

A central government organization is collecting events from various internal applications using Amazon Managed Streaming for Apache Kafka (Amazon MSK). The organization has configured a separate Kafka topic for each application to separate the data. For security reasons, the Kafka cluster has been configured to only allow TLS encrypted data and it encrypts the data at rest.

A recent application update showed that one of the applications was configured incorrectly, resulting in writing data to a Kafka topic that belongs to another application. This resulted in multiple errors in the analytics pipeline as data from different applications appeared on the same topic. After this incident, the organization wants to prevent applications from writing to a topic different than the one they should write to.

Which solution meets these requirements with the least amount of effort?

Options:

A.

Create a different Amazon EC2 security group for each application. Configure each security group to have access to a specific topic in the Amazon MSK cluster. Attach the security group to each application based on the topic that the applications should read and write to.

B.

Install Kafka Connect on each application instance and configure each Kafka Connect instance to write to a specific topic only.

C.

Use Kafka ACLs and configure read and write permissions for each topic. Use the distinguished name of the clients' TLS certificates as the principal of the ACL.

D.

Create a different Amazon EC2 security group for each application. Create an Amazon MSK cluster and Kafka topic for each application. Configure each security group to have access to the specific cluster.

Buy Now
Questions 21

A media analytics company consumes a stream of social media posts. The posts are sent to an Amazon Kinesis data stream partitioned on user_id. An AWS Lambda function retrieves the records and validates the content before loading the posts into an Amazon Elasticsearch cluster. The validation process needs to receive the posts for a given user in the order they were received. A data analyst has noticed that, during peak hours, the social media platform posts take more than an hour to appear in the Elasticsearch cluster.

What should the data analyst do reduce this latency?

Options:

A.

Migrate the validation process to Amazon Kinesis Data Firehose.

B.

Migrate the Lambda consumers from standard data stream iterators to an HTTP/2 stream consumer.

C.

Increase the number of shards in the stream.

D.

Configure multiple Lambda functions to process the stream.

Buy Now
Questions 22

A company is using an AWS Lambda function to run Amazon Athena queries against a cross-account AWS Glue Data Catalog. A query returns the following error:

HIVE METASTORE ERROR

The error message states that the response payload size exceeds the maximum allowed payload size. The queried table is already partitioned, and the data is stored in an

Amazon S3 bucket in the Apache Hive partition format.

Which solution will resolve this error?

Options:

A.

Modify the Lambda function to upload the query response payload as an object into the S3 bucket. Include an S3 object presigned URL as the payload in the Lambda function response.

B.

Run the MSCK REPAIR TABLE command on the queried table.

C.

Create a separate folder in the S3 bucket. Move the data files that need to be queried into that folder. Create an AWS Glue crawler that points to the folder instead of the S3 bucket.

D.

Check the schema of the queried table for any characters that Athena does not support. Replace any unsupported characters with characters that Athena supports.

Buy Now
Questions 23

A company using Amazon QuickSight Enterprise edition has thousands of dashboards analyses and datasets. The company struggles to manage and assign permissions for granting users access to various items within QuickSight. The company wants to make it easier to implement sharing and permissions management.

Which solution should the company implement to simplify permissions management?

Options:

A.

Use QuickSight folders to organize dashboards, analyses, and datasets Assign individual users permissions to these folders

B.

Use QuickSight folders to organize dashboards analyses, and datasets Assign group permissions by using these folders.

C.

Use AWS 1AM resource-based policies to assign group permissions to QuickSight items

D.

Use QuickSight user management APIs to provision group permissions based on dashboard naming conventions

Buy Now
Questions 24

A company uses Amazon Redshift as its data warehouse A new table includes some columns that contain sensitive data and some columns that contain non-sensitive data The data in the table eventually will be referenced by several existing queries that run many times each day

A data analytics specialist must ensure that only members of the company's auditing team can read the columns that contain sensitive data All other users must have read-only access to the columns that contain non-sensitive data

Which solution will meet these requirements with the LEAST operational overhead?

Options:

A.

Grant the auditing team permission to read from the table. Load the columns that contain non-sensitive data into a second table. Grant the appropriate users read-only permissions to the second table.

B.

Grant all users read-only permissions to the columns that contain non-sensitive data Use the GRANT SELECT command to allow the auditing team to access the columns that contain sensitive data

C.

Grant all users read-only permissions to the columns that contain non-sensitive data Attach an 1AM policy to the auditing team with an explicit Allow action that grants access to the columns that contain sensitive data

D.

Grant the auditing team permission to read from the table Create a view of the table that includes the columns that contain non-sensitive data Grant the appropriate users read-only permissions to that view

Buy Now
Questions 25

A machinery company wants to collect data from sensors. A data analytics specialist needs to implement a solution that aggregates the data in near-real time and saves the data to a persistent data store. The data must be stored in nested JSON format and must be queried from the data store with a latency of single-digit milliseconds.

Which solution will meet these requirements?

Options:

A.

Use Amazon Kinesis Data Streams to receive the data from the sensors. Use Amazon Kinesis Data Analytics to read the stream, aggregate the data, and send the data to an AWS Lambda function. Configure the Lambda function to store the data in Amazon DynamoDB.

B.

Use Amazon Kinesis Data Firehose to receive the data from the sensors. Use Amazon Kinesis Data Analytics to aggregate the data. Use an AWS Lambda function to read the data from Kinesis Data Analytics and store the data in Amazon S3.

C.

Use Amazon Kinesis Data Firehose to receive the data from the sensors. Use an AWS Lambda function to aggregate the data during capture. Store the data from Kinesis Data Firehose in Amazon DynamoDB.

D.

Use Amazon Kinesis Data Firehose to receive the data from the sensors. Use an AWS Lambda function to aggregate the data during capture. Store the data in Amazon S3.

Buy Now
Questions 26

An operations team notices that a few AWS Glue jobs for a given ETL application are failing. The AWS Glue jobs read a large number of small JSON files from an Amazon S3 bucket and write the data to a different S3 bucket in Apache Parquet format with no major transformations. Upon initial investigation, a data engineer notices the following error message in the History tab on the AWS Glue console: “Command Failed with Exit Code 1.”

Upon further investigation, the data engineer notices that the driver memory profile of the failed jobs crosses the safe threshold of 50% usage quickly and reaches 90–95% soon after. The average memory usage across all executors continues to be less than 4%.

The data engineer also notices the following error while examining the related Amazon CloudWatch Logs.

What should the data engineer do to solve the failure in the MOST cost-effective way?

Options:

A.

Change the worker type from Standard to G.2X.

B.

Modify the AWS Glue ETL code to use the ‘groupFiles’: ‘inPartition’ feature.

C.

Increase the fetch size setting by using AWS Glue dynamics frame.

D.

Modify maximum capacity to increase the total maximum data processing units (DPUs) used.

Buy Now
Questions 27

A manufacturing company uses Amazon S3 to store its data. The company wants to use AWS Lake Formation to provide granular-level security on those data assets. The data is in Apache Parquet format. The company has set a deadline for a consultant to build a data lake.

How should the consultant create the MOST cost-effective solution that meets these requirements?

Options:

A.

Run Lake Formation blueprints to move the data to Lake Formation. Once Lake Formation has the data, apply permissions on Lake Formation.

B.

To create the data catalog, run an AWS Glue crawler on the existing Parquet data. Register the Amazon S3 path and then apply permissions through Lake Formation to provide granular-level security.

C.

Install Apache Ranger on an Amazon EC2 instance and integrate with Amazon EMR. Using Ranger policies, create role-based access control for the existing data assets in Amazon S3.

D.

Create multiple IAM roles for different users and groups. Assign IAM roles to different data assets in Amazon S3 to create table-based and column-based access controls.

Questions 28

A financial services firm is processing a stream of real-time data from an application by using Apache Kafka and Kafka MirrorMaker. These tools run on premises and stream data to Amazon Managed Streaming for Apache Kafka (Amazon MSK) in the us-east-1 Region. An Apache Flink consumer running on Amazon EMR enriches the data in real time and transfers the output files to an Amazon S3 bucket. The company wants to ensure that the streaming application is highly available across AWS Regions with an RTO of less than 2 minutes.

Which solution meets these requirements?

Options:

A.

Launch another Amazon MSK and Apache Flink cluster in the us-west-1 Region that is the same size as the original

cluster in the us-east-1 Region. Simultaneously publish and process the data in both Regions. In the event of a disaster that impacts one of the Regions, switch to the other Region.

B.

Set up Cross-Region Replication from the Amazon S3 bucket in the us-east-1 Region to the us-west-1 Region. In the event of a disaster, immediately create Amazon MSK and Apache Flink clusters in the us-west-1 Region and start publishing data to this Region.

C.

Add an AWS Lambda function in the us-east-1 Region to read from Amazon MSK and write to a global Amazon

D.

DynamoDB table in on-demand capacity mode. Export the data from DynamoDB to Amazon S3 in the us-west-1 Region. In the event of a disaster that impacts the us-east-1 Region, immediately create Amazon MSK and Apache Flink clusters in the us-west-1 Region and start publishing data to this Region.

E.

Set up Cross-Region Replication from the Amazon S3 bucket in the us-east-1 Region to the us-west-1 Region. In the event of a disaster, immediately create Amazon MSK and Apache Flink clusters in the us-west-1 Region and start publishing data to this Region. Store 7 days of data in on-premises Kafka clusters and recover the data missed during the recovery time from the on-premises cluster.

Buy Now
Questions 29

A company has collected more than 100 TB of log files in the last 24 months. The files are stored as raw text in a dedicated Amazon S3 bucket. Each object has a key of the form year-month-day_log_HHmmss.txt where HHmmss represents the time the log file was initially created. A table was created in Amazon Athena that points to the S3 bucket. One-time queries are run against a subset of columns in the table several times an hour.

A data analyst must make changes to reduce the cost of running these queries. Management wants a solution with minimal maintenance overhead.

Which combination of steps should the data analyst take to meet these requirements? (Choose three.)

Options:

A.

Convert the log files to Apace Avro format.

B.

Add a key prefix of the form date=year-month-day/ to the S3 objects to partition the data.

C.

Convert the log files to Apache Parquet format.

D.

Add a key prefix of the form year-month-day/ to the S3 objects to partition the data.

E.

Drop and recreate the table with the PARTITIONED BY clause. Run the ALTER TABLE ADD PARTITION statement.

F.

Drop and recreate the table with the PARTITIONED BY clause. Run the MSCK REPAIR TABLE statement.

Buy Now
Questions 30

A streaming application is reading data from Amazon Kinesis Data Streams and immediately writing the data to an Amazon S3 bucket every 10 seconds. The application is reading data from hundreds of shards. The batch interval cannot be changed due to a separate requirement. The data is being accessed by Amazon Athena. Users are seeing degradation in query performance as time progresses.

Which action can help improve query performance?

Options:

A.

Merge the files in Amazon S3 to form larger files.

B.

Increase the number of shards in Kinesis Data Streams.

C.

Add more memory and CPU capacity to the streaming application.

D.

Write the files to multiple S3 buckets.

Buy Now
Questions 31

A marketing company is storing its campaign response data in Amazon S3. A consistent set of sources has generated the data for each campaign. The data is saved into Amazon S3 as .csv files. A business analyst will use Amazon Athena to analyze each campaign’s data. The company needs the cost of ongoing data analysis with Athena to be minimized.

Which combination of actions should a data analytics specialist take to meet these requirements? (Choose two.)

Options:

A.

Convert the .csv files to Apache Parquet.

B.

Convert the .csv files to Apache Avro.

C.

Partition the data by campaign.

D.

Partition the data by source.

E.

Compress the .csv files.

Questions 32

A company has a process that writes two datasets in CSV format to an Amazon S3 bucket every 6 hours. The company needs to join the datasets, convert the data to Apache Parquet, and store the data within another bucket for users to query using Amazon Athena. The data also needs to be loaded to Amazon Redshift for advanced analytics. The company needs a solution that is resilient to the failure of any individual job component and can be restarted in case of an error.

Which solution meets these requirements with the LEAST amount of operational overhead?

Options:

A.

Use AWS Step Functions to orchestrate an Amazon EMR cluster running Apache Spark. Use PySpark to generate data frames of the datasets in Amazon S3, transform the data, join the data, write the data back to Amazon S3, and load the data to Amazon Redshift.

B.

Create an AWS Glue job using Python Shell that generates dynamic frames of the datasets in Amazon S3, transforms the data, joins the data, writes the data back to Amazon S3, and loads the data to Amazon Redshift. Use an AWS Glue workflow to orchestrate the AWS Glue job at the desired frequency.

C.

Use AWS Step Functions to orchestrate the AWS Glue job. Create an AWS Glue job using Python Shell that creates dynamic frames of the datasets in Amazon S3, transforms the data, joins the data, writes the data back to Amazon S3, and loads the data to Amazon Redshift.

D.

Create an AWS Glue job using PySpark that creates dynamic frames of the datasets in Amazon S3, transforms the data, joins the data, writes the data back to Amazon S3, and loads the data to Amazon Redshift. Use an AWS Glue workflow to orchestrate the AWS Glue job.

Buy Now
Questions 33

A hospital uses wearable medical sensor devices to collect data from patients. The hospital is architecting a near-real-time solution that can ingest the data securely at scale. The solution should also be able to remove the patient’s protected health information (PHI) from the streaming data and store the data in durable storage.

Which solution meets these requirements with the least operational overhead?

Options:

A.

Ingest the data using Amazon Kinesis Data Streams, which invokes an AWS Lambda function using Kinesis Client Library (KCL) to remove all PHI. Write the data in Amazon S3.

B.

Ingest the data using Amazon Kinesis Data Firehose to write the data to Amazon S3. Have Amazon S3 trigger an AWS Lambda function that parses the sensor data to remove all PHI in Amazon S3.

C.

Ingest the data using Amazon Kinesis Data Streams to write the data to Amazon S3. Have the data stream launch an AWS Lambda function that parses the sensor data and removes all PHI in Amazon S3.

D.

Ingest the data using Amazon Kinesis Data Firehose to write the data to Amazon S3. Implement a transformation AWS Lambda function that parses the sensor data to remove all PHI.

Buy Now
Questions 34

A global pharmaceutical company receives test results for new drugs from various testing facilities worldwide. The results are sent in millions of 1 KB-sized JSON objects to an Amazon S3 bucket owned by the company. Thedata engineering team needs to process those files, convert them into Apache Parquet format, and load them into Amazon Redshift for data analysts to perform dashboard reporting. The engineering team uses AWS Glue to process the objects, AWS Step Functions for process orchestration, and Amazon CloudWatch for job scheduling.

More testing facilities were recently added, and the time to process files is increasing.

What will MOST efficiently decrease the data processing time?

Options:

A.

Use AWS Lambda to group the small files into larger files. Write the files back to Amazon S3. Process the files using AWS Glue and load them into Amazon Redshift tables.

B.

Use the AWS Glue dynamic frame file grouping option while ingesting the raw input files. Process the files and load them into Amazon Redshift tables.

C.

Use the Amazon Redshift COPY command to move the files from Amazon S3 into Amazon Redshift tables directly. Process the files in Amazon Redshift.

D.

Use Amazon EMR instead of AWS Glue to group the small input files. Process the files in Amazon EMR and load them into Amazon Redshift tables.

Buy Now
Questions 35

A company is migrating from an on-premises Apache Hadoop cluster to an Amazon EMR cluster. The cluster runs only during business hours. Due to a company requirement to avoid intraday cluster failures, the EMR cluster must be highly available. When the cluster is terminated at the end of each business day, the data must persist.

Which configurations would enable the EMR cluster to meet these requirements? (Choose three.)

Options:

A.

EMR File System (EMRFS) for storage

B.

Hadoop Distributed File System (HDFS) for storage

C.

AWS Glue Data Catalog as the metastore for Apache Hive

D.

MySQL database on the master node as the metastore for Apache Hive

E.

Multiple master nodes in a single Availability Zone

F.

Multiple master nodes in multiple Availability Zones

Buy Now
Questions 36

A technology company is creating a dashboard that will visualize and analyze time-sensitive data. The data will come in through Amazon Kinesis DataFirehose with the butter interval set to 60 seconds. The dashboard must support near-real-time data.

Which visualization solution will meet these requirements?

Options:

A.

Select Amazon Elasticsearch Service (Amazon ES) as the endpoint for Kinesis Data Firehose. Set up a Kibana dashboard using the data in Amazon ES with the desired analyses and visualizations.

B.

Select Amazon S3 as the endpoint for Kinesis Data Firehose. Read data into an Amazon SageMaker Jupyter notebook and carry out the desired analyses and visualizations.

C.

Select Amazon Redshift as the endpoint for Kinesis Data Firehose. Connect Amazon QuickSight with SPICE to Amazon Redshift to create the desired analyses and visualizations.

D.

Select Amazon S3 as the endpoint for Kinesis Data Firehose. Use AWS Glue to catalog the data and Amazon Athena to query it. Connect Amazon QuickSight with SPICE to Athena to create the desired analyses and visualizations.

Buy Now
Questions 37

A large marketing company needs to store all of its streaming logs and create near-real-time dashboards. The dashboards will be used to help the company make critical business decisions and must be highly available.

Which solution meets these requirements?

Options:

A.

Store the streaming logs in Amazon S3 with replication to an S3 bucket in a different Availability Zone. Create the dashboards by using Amazon QuickSight.

B.

Deploy an Amazon Redshift cluster with at least three nodes in a VPC that spans two Availability Zones. Store the streaming logs and use the Redshift cluster as a source to create the dashboards by using Amazon QuickSight.

C.

Store the streaming logs in Amazon S3 with replication to an S3 bucket in a different Availability Zone. Every time a new log is added in the bucket, invoke an AWS Lambda function to update the dashboards in Amazon QuickSight.

D.

Store the streaming logs in Amazon OpenSearch Service deployed across three Availability Zones and with three dedicated master nodes. Create the dashboards by using OpenSearch Dashboards.

Buy Now
Questions 38

A financial company uses Amazon S3 as its data lake and has set up a data warehouse using a multi-node Amazon Redshift cluster. The data files in the data lake are organized in folders based on the data source of each data file. All the data files are loaded to one table in the Amazon Redshift cluster using a separate COPY command for each data file location. With this approach, loading all the data files into Amazon Redshift takes a long time to complete. Users want a faster solution with little or no increase in cost while maintaining the segregation of the data files in the S3 data lake.

Which solution meets these requirements?

Options:

A.

Use Amazon EMR to copy all the data files into one folder and issue a COPY command to load the data into Amazon Redshift.

B.

Load all the data files in parallel to Amazon Aurora, and run an AWS Glue job to load the data into Amazon Redshift.

C.

Use an AWS Glue job to copy all the data files into one folder and issue a COPY command to load the data into Amazon Redshift.

D.

Create a manifest file that contains the data file locations and issue a COPY command to load the data into Amazon Redshift.

Buy Now
Questions 39

A company wants to improve the data load time of a sales data dashboard. Data has been collected as .csv files and stored within an Amazon S3 bucket that is partitioned by date. The data is then loaded to an Amazon Redshiftdata warehouse for frequent analysis. The data volume is up to 500 GB per day.

Which solution will improve the data loading performance?

Options:

A.

Compress .csv files and use an INSERT statement to ingest data into Amazon Redshift.

B.

Split large .csv files, then use a COPY command to load data into Amazon Redshift.

C.

Use Amazon Kinesis Data Firehose to ingest data into Amazon Redshift.

D.

Load the .csv files in an unsorted key order and vacuum the table in Amazon Redshift.

Questions 40

A media company is using Amazon QuickSight dashboards to visualize its national sales data. The dashboard is using a dataset with these fields: ID, date, time_zone, city, state, country, longitude, latitude, sales_volume, and number_of_items.

To modify ongoing campaigns, the company wants an interactive and intuitive visualization of which states across the country recorded a significantly lower sales volume compared to the national average.

Which addition to the company’s QuickSight dashboard will meet this requirement?

Options:

A.

A geospatial color-coded chart of sales volume data across the country.

B.

A pivot table of sales volume data summed up at the state level.

C.

A drill-down layer for state-level sales volume data.

D.

A drill through to other dashboards containing state-level sales volume data.

Buy Now
Questions 41

A company is planning to do a proof of concept for a machine learning (ML) project using Amazon SageMaker with a subset of existing on-premises data hosted in the company’s 3 TB data warehouse. For part of the project, AWS Direct Connect is established and tested. To prepare the data for ML, data analysts are performing data curation. The data analysts want to perform multiple step, including mapping, dropping null fields, resolving choice, and splitting fields. The company needs the fastest solution to curate the data for this project.

Which solution meets these requirements?

Options:

A.

Ingest data into Amazon S3 using AWS DataSync and use Apache Spark scrips to curate the data in an Amazon EMR cluster. Store the curated data in Amazon S3 for ML processing.

B.

Create custom ETL jobs on-premises to curate the data. Use AWS DMS to ingest data into Amazon S3 for ML processing.

C.

Ingest data into Amazon S3 using AWS DMS. Use AWS Glue to perform data curation and store the data in Amazon S3 for ML processing.

D.

Take a full backup of the data store and ship the backup files using AWS Snowball. Upload Snowball data into Amazon S3 and schedule data curation jobs using AWS Batch to prepare the data for ML.

Buy Now
Questions 42

A company is designing a data warehouse to support business intelligence reporting. Users will access the executive dashboard heavily each Monday and Friday morning

for I hour. These read-only queries will run on the active Amazon Redshift cluster, which runs on dc2.8xIarge compute nodes 24 hours a day, 7 days a week. There are

three queues set up in workload management: Dashboard, ETL, and System. The Amazon Redshift cluster needs to process the queries without wait time.

What is the MOST cost-effective way to ensure that the cluster processes these queries?

Options:

A.

Perform a classic resize to place the cluster in read-only mode while adding an additional node to the cluster.

B.

Enable automatic workload management.

C.

Perform an elastic resize to add an additional node to the cluster.

D.

Enable concurrency scaling for the Dashboard workload queue.

Buy Now
Questions 43

A transport company wants to track vehicular movements by capturing geolocation records. The records are 10 B in size and up to 10,000 records are captured each second. Data transmission delays of a few minutes are acceptable, considering unreliable network conditions. The transport company decided to use Amazon Kinesis Data Streams to ingest the data. The company is looking for a reliable mechanism to send data to Kinesis Data Streams while maximizing the throughput efficiency of the Kinesis shards.

Which solution will meet the company’s requirements?

Options:

A.

Kinesis Agent

B.

Kinesis Producer Library (KPL)

C.

Kinesis Data Firehose

D.

Kinesis SDK

Buy Now
Questions 44

A retail company is building its data warehouse solution using Amazon Redshift. As a part of that effort, the company is loading hundreds of files into the fact table created in its Amazon Redshift cluster. The company wants the solution to achieve the highest throughput and optimally use cluster resources when loading data into the company’s fact table.

How should the company meet these requirements?

Options:

A.

Use multiple COPY commands to load the data into the Amazon Redshift cluster.

B.

Use S3DistCp to load multiple files into the Hadoop Distributed File System (HDFS) and use an HDFS connector to ingest the data into the Amazon Redshift cluster.

C.

Use LOAD commands equal to the number of Amazon Redshift cluster nodes and load the data in parallel into each node.

D.

Use a single COPY command to load the data into the Amazon Redshift cluster.

Questions 45

A regional energy company collects voltage data from sensors attached to buildings. To address any known dangerous conditions, the company wants to be alerted when a sequence of two voltage drops is detected within 10 minutes of a voltage spike at the same building. It is important to ensure that all messages are delivered as quickly as possible. The system must be fully managed and highly available. The company also needs a solution that will automatically scale up as it covers additional cites with this monitoring feature. The alerting system is subscribed to an Amazon SNS topic for remediation.

Which solution meets these requirements?

Options:

A.

Create an Amazon Managed Streaming for Kafka cluster to ingest the data, and use an Apache Spark Streaming with Apache Kafka consumer API in an automatically scaled Amazon EMR cluster to process the incoming data. Use the Spark Streaming application to detect the known event sequence and send the SNS message.

B.

Create a REST-based web service using Amazon API Gateway in front of an AWS Lambda function. Create an Amazon RDS for PostgreSQL database with sufficient Provisioned IOPS (PIOPS). In the Lambda function, store incoming events in the RDS database and query the latest data to detect the known event sequence and send the SNS message.

C.

Create an Amazon Kinesis Data Firehose delivery stream to capture the incoming sensor data. Use an AWS Lambda transformation function to detect the known event sequence and send the SNS message.

D.

Create an Amazon Kinesis data stream to capture the incoming sensor data and create another stream for alert messages. Set up AWS Application Auto Scaling on both. Create a Kinesis Data Analytics for Java application to detect the known event sequence, and add a message to the message stream. Configure an AWS Lambda function to poll the message stream and publish to the SNS topic.

Buy Now
Questions 46

A team of data scientists plans to analyze market trend data for their company’s new investment strategy. The trend data comes from five different data sources in large volumes. The team wants to utilize Amazon Kinesis to support their use case. The team uses SQL-like queries to analyze trends and wants to send notifications based on certain significant patterns in the trends. Additionally, the data scientists want to save the data to Amazon S3 for archival and historical re-processing, and use AWS managed services wherever possible. The team wants to implement the lowest-cost solution.

Which solution meets these requirements?

Options:

A.

Publish data to one Kinesis data stream. Deploy a custom application using the Kinesis Client Library (KCL) for analyzing trends, and send notifications using Amazon SNS. Configure Kinesis Data Firehose on the Kinesis data stream to persist data to an S3 bucket.

B.

Publish data to one Kinesis data stream. Deploy Kinesis Data Analytic to the stream for analyzing trends, and configure an AWS Lambda function as an output to send notifications using Amazon SNS. Configure Kinesis Data Firehose on the Kinesis data stream to persist data to an S3 bucket.

C.

Publish data to two Kinesis data streams. Deploy Kinesis Data Analytics to the first stream for analyzing trends, and configure an AWS Lambda function as an output to send notifications using Amazon SNS. Configure Kinesis Data Firehose on the second Kinesis data stream to persist data to an S3 bucket.

D.

Publish data to two Kinesis data streams. Deploy a custom application using the Kinesis Client Library (KCL) to the first stream for analyzing trends, and send notifications using Amazon SNS. Configure Kinesis Data Firehose on the second Kinesis data stream to persist data to an S3 bucket.

Buy Now
Questions 47

A company owns facilities with IoT devices installed across the world. The company is using Amazon Kinesis Data Streams to stream data from the devices to Amazon S3. The company's operations team wants to get insights from the IoT data to monitor data quality at ingestion. The insights need to be derived in near-real time, and the output must be logged to Amazon DynamoDB for further analysis.

Which solution meets these requirements?

Options:

A.

Connect Amazon Kinesis Data Analytics to analyze the stream data. Save the output to DynamoDB by using the default output from Kinesis Data Analytics.

B.

Connect Amazon Kinesis Data Analytics to analyze the stream data. Save the output to DynamoDB by using an AWS Lambda function.

C.

Connect Amazon Kinesis Data Firehose to analyze the stream data by using an AWS Lambda function. Save the output to DynamoDB by using the default output from Kinesis Data Firehose.

D.

Connect Amazon Kinesis Data Firehose to analyze the stream data by using an AWS Lambda function. Save the data to Amazon S3. Then run an AWS Glue job on schedule to ingest the data into DynamoDB.

Buy Now
Questions 48

A manufacturing company uses Amazon Connect to manage its contact center and Salesforce to manage its customer relationship management (CRM) data. The data engineering team must build a pipeline to ingest data from the contact center and CRM system into a data lake that is built on Amazon S3.

What is the MOST efficient way to collect data in the data lake with the LEAST operational overhead?

Options:

A.

Use Amazon Kinesis Data Streams to ingest Amazon Connect data and Amazon AppFlow to ingest Salesforce data.

B.

Use Amazon Kinesis Data Firehose to ingest Amazon Connect data and Amazon Kinesis Data Streams to ingest Salesforce data.

C.

Use Amazon Kinesis Data Firehose to ingest Amazon Connect data and Amazon AppFlow to ingest Salesforce data.

D.

Use Amazon AppFlow to ingest Amazon Connect data and Amazon Kinesis Data Firehose to ingest Salesforce data.

Buy Now
Questions 49

A large financial company is running its ETL process. Part of this process is to move data from Amazon S3 into an Amazon Redshift cluster. The company wants to use the most cost-efficient method to load the dataset into Amazon Redshift.

Which combination of steps would meet these requirements? (Choose two.)

Options:

A.

Use the COPY command with the manifest file to load data into Amazon Redshift.

B.

Use S3DistCp to load files into Amazon Redshift.

C.

Use temporary staging tables during the loading process.

D.

Use the UNLOAD command to upload data into Amazon Redshift.

E.

Use Amazon Redshift Spectrum to query files from Amazon S3.

Buy Now
Questions 50

A company has a mobile app that has millions of users. The company wants to enhance the mobile app by including interactive data visualizations that show user trends.

The data for visualization is stored in a large data lake with 50 million rows. Data that is used in the visualization should be no more than two hours old.

Which solution will meet these requirements with the LEAST operational overhead?

Options:

A.

Run an hourly batch process that renders user-specific data visualizations as static images that are stored in Amazon S3.

B.

Precompute aggregated data hourly. Store the data in Amazon DynamoDB. Render the data by using the D3.js JavaScript library.

C.

Embed an Amazon QuickSight Enterprise edition dashboard into the mobile app by using the QuickSight Embedding SDK. Refresh data in SPICE hourly.

D.

Run Amazon Athena queries behind an Amazon API Gateway API. Render the data by using the D3.js JavaScript library.

Buy Now
Questions 51

A banking company is currently using an Amazon Redshift cluster with dense storage (DS) nodes to store sensitive data. An audit found that the cluster is unencrypted. Compliance requirements state that a database with sensitive data must be encrypted through a hardware security module (HSM) with automated key rotation.

Which combination of steps is required to achieve compliance? (Choose two.)

Options:

A.

Set up a trusted connection with HSM using a client and server certificate with automatic key rotation.

B.

Modify the cluster with an HSM encryption option and automatic key rotation.

C.

Create a new HSM-encrypted Amazon Redshift cluster and migrate the data to the new cluster.

D.

Enable HSM with key rotation through the AWS CLI.

E.

Enable Elliptic Curve Diffie-Hellman Ephemeral (ECDHE) encryption in the HSM.

Buy Now
Questions 52

A company hosts its analytics solution on premises. The analytics solution includes a server that collects log files. The analytics solution uses an Apache Hadoop cluster to analyze the log files hourly and to produce output files. All the files are archived to another server for a specified duration.

The company is expanding globally and plans to move the analytics solution to multiple AWS Regions in the AWS Cloud. The company must adhere to the data archival and retention requirements of each country where the data is stored.

Which solution will meet these requirements?

Options:

A.

Create an Amazon S3 bucket in one Region to collect the log files. Use S3 event notifications to invoke an AWS Glue job for log analysis. Store the output files in the target S3 bucket. Use S3 Lifecycle rules on the target S3 bucket to set an expiration period that meets the retention requirements of the country that contains the Region.

B.

Create a Hadoop Distributed File System (HDFS) file system on an Amazon EMR cluster in one Region to collect the log files. Set up a bootstrap action on the EMR cluster to run an Apache Spark job. Store the output files in a target Amazon S3 bucket. Schedule a job on one of the EMR nodes to delete files that no longer need to be retained.

C.

Create an Amazon S3 bucket in each Region to collect log files. Create an Amazon EMR cluster. Submit steps on the EMR clusterfor analysis. Store the output files in a target S3 bucket in each Region. Use S3 Lifecycle rules on each target S3 bucket to set an expiration period that meets the retention requirements of the country that contains the Region.

D.

Create an Amazon Kinesis Data Firehose delivery stream in each Region to collect log data. Specify an Amazon S3 bucket in each Region as the destination. Use S3 Storage Lens for data analysis. Use S3 Lifecycle rules on each destination S3 bucket to set an expiration period that meets the retention requirements of the country that contains the Region.

Buy Now
Questions 53

A mobile gaming company wants to capture data from its gaming app and make the data available for analysis immediately. The data record size will be approximately 20 KB. The company is concerned about achieving optimal throughput from each device. Additionally, the company wants to develop a data stream processing application with dedicated throughput for each consumer.

Which solution would achieve this goal?

Options:

A.

Have the app call the PutRecords API to send data to Amazon Kinesis Data Streams. Use the enhanced fan-out feature while consuming the data.

B.

Have the app call the PutRecordBatch API to send data to Amazon Kinesis Data Firehose. Submit a support case to enable dedicated throughput on the account.

C.

Have the app use Amazon Kinesis Producer Library (KPL) to send data to Kinesis Data Firehose. Use the enhanced fan-out feature while consuming the data.

D.

Have the app call the PutRecords API to send data to Amazon Kinesis Data Streams. Host the stream- processing application on Amazon EC2 with Auto Scaling.

Buy Now
Questions 54

An education provider’s learning management system (LMS) is hosted in a 100 TB data lake that is built on Amazon S3. The provider’s LMS supports hundreds of schools. The provider wants to build an advanced analytics reporting platform using Amazon Redshift to handle complex queries with optimal performance. System users will query the most recent 4 months of data 95% of the time while 5% of the queries will leverage data from the previous 12 months.

Which solution meets these requirements in the MOST cost-effective way?

Options:

A.

Store the most recent 4 months of data in the Amazon Redshift cluster. Use Amazon Redshift Spectrum to query data in the data lake. Use S3 lifecycle management rules to store data from the previous 12 months in Amazon S3 Glacier storage.

B.

Leverage DS2 nodes for the Amazon Redshift cluster. Migrate all data from Amazon S3 to Amazon Redshift. Decommission the data lake.

C.

Store the most recent 4 months of data in the Amazon Redshift cluster. Use Amazon Redshift Spectrum to query data in the data lake. Ensure the S3 Standard storage class is in use with objects in the data lake.

D.

Store the most recent 4 months of data in the Amazon Redshift cluster. Use Amazon Redshift federated queries to join cluster data with the data lake to reduce costs. Ensure the S3 Standard storage class is in use with objects in the data lake.

Buy Now
Questions 55

A company wants to use a data lake that is hosted on Amazon S3 to provide analytics services for historical data. The data lake consists of 800 tables but is expected to grow to thousands of tables. More than 50 departments use the tables, and each department has hundreds of users. Different departments need access to specific tables and columns.

Which solution will meet these requirements with the LEAST operational overhead?

Options:

A.

Create an 1AM role for each department. Use AWS Lake Formation based access control to grant each 1AM role access to specific tables and columns. Use Amazon Athena to analyze the data.

B.

Create an Amazon Redshift cluster for each department. Use AWS Glue to ingest into the Redshift cluster only the tables and columns that are relevant to that department. Create Redshift database users. Grant the users access to the relevant department's Redshift cluster. Use Amazon Redshift to analyze the data.

C.

Create an 1AM role for each department. Use AWS Lake Formation tag-based access control to grant each 1AM role

access to only the relevant resources. Create LF-tags that are attached to tables and columns. Use Amazon Athena to analyze the data.

D.

Create an Amazon EMR cluster for each department. Configure an 1AM service role for each EMR cluster to access

E.

relevant S3 files. For each department's users, create an 1AM role that provides access to the relevant EMR cluster. Use Amazon EMR to analyze the data.

Buy Now
Questions 56

A company wants to collect and process events data from different departments in near-real time. Before storing the data in Amazon S3, the company needs to clean the data by standardizing the format of the address and timestamp columns. The data varies in size based on the overall load at each particular point in time. A single data record can be 100 KB-10 MB.

How should a data analytics specialist design the solution for data ingestion?

Options:

A.

Use Amazon Kinesis Data Streams. Configure a stream for the raw data. Use a Kinesis Agent to write data to the stream. Create an Amazon Kinesis Data Analytics application that reads data from the raw stream, cleanses it, and stores the output to Amazon S3.

B.

Use Amazon Kinesis Data Firehose. Configure a Firehose delivery stream with a preprocessing AWS Lambda function for data cleansing. Use a KinesisAgent to write data to the delivery stream. Configure Kinesis Data Firehose to deliver the data to Amazon S3.

C.

Use Amazon Managed Streaming for Apache Kafka. Configure a topic for the raw data. Use a Kafka producer to write data to the topic. Create an application on Amazon EC2 that reads data from the topic by using the Apache Kafka consumer API, cleanses the data, and writes to Amazon S3.

D.

Use Amazon Simple Queue Service (Amazon SQS). Configure an AWS Lambda function to read events from the SQS queue and upload the events to Amazon S3.

Buy Now
Questions 57

A marketing company is using Amazon EMR clusters for its workloads. The company manually installs third- party libraries on the clusters by logging in to the master nodes. A data analyst needs to create an automated solution to replace the manual process.

Which options can fulfill these requirements? (Choose two.)

Options:

A.

Place the required installation scripts in Amazon S3 and execute them using custom bootstrap actions.

B.

Place the required installation scripts in Amazon S3 and execute them through Apache Spark in Amazon EMR.

C.

Install the required third-party libraries in the existing EMR master node. Create an AMI out of that master node and use that custom AMI to re-create the EMR cluster.

D.

Use an Amazon DynamoDB table to store the list of required applications. Trigger an AWS Lambda function with DynamoDB Streams to install the software.

E.

Launch an Amazon EC2 instance with Amazon Linux and install the required third-party libraries on the instance. Create an AMI and use that AMI to create the EMR cluster.

Questions 58

An online retailer is rebuilding its inventory management system and inventory reordering system to automatically reorder products by using Amazon Kinesis Data Streams. The inventory management system uses the Kinesis Producer Library (KPL) to publish data to a stream. The inventory reordering system uses the Kinesis Client Library (KCL) to consume data from the stream. The stream has been configured to scale as needed. Just before production deployment, the retailer discovers that the inventory reordering system is receiving duplicated data.

Which factors could be causing the duplicated data? (Choose two.)

Options:

A.

The producer has a network-related timeout.

B.

The stream’s value for the IteratorAgeMilliseconds metric is too high.

C.

There was a change in the number of shards, record processors, or both.

D.

The AggregationEnabled configuration property was set to true.

E.

The max_records configuration property was set to a number that is too high.

Buy Now
Questions 59

An advertising company has a data lake that is built on Amazon S3. The company uses AWS Glue Data Catalog to maintain the metadata. The data lake is several years old and its overall size has increased exponentially as additional data sources and metadata are stored in the data lake. The data lake administrator wants to implement a mechanism to simplify permissions management between Amazon S3 and the Data Catalog to keep them in sync

Which solution will simplify permissions management with minimal development effort?

Options:

A.

Set AWS Identity and Access Management (1AM) permissions tor AWS Glue

B.

Use AWS Lake Formation permissions

C.

Manage AWS Glue and S3 permissions by using bucket policies

D.

Use Amazon Cognito user pools.

Buy Now
Exam Code: DAS-C01
Exam Name: AWS Certified Data Analytics - Specialty
Last Update: Dec 22, 2024
Questions: 207