MLS-C01 AWS Certified Machine Learning - Specialty Questions and Answers

Questions 4

A Machine Learning Specialist is developing a daily ETL workflow containing multiple ETL jobs The workflow consists of the following processes

* Start the workflow as soon as data is uploaded to Amazon S3

* When all the datasets are available in Amazon S3, start an ETL job to join the uploaded datasets with multiple terabyte-sized datasets already stored in Amazon S3

* Store the results of joining datasets in Amazon S3

* If one of the jobs fails, send a notification to the Administrator

Which configuration will meet these requirements?

Options:

Use AWS Lambda to trigger an AWS Step Functions workflow to wait for dataset uploads to complete in Amazon S3. Use AWS Glue to join the datasets Use an Amazon CloudWatch alarm to send an SNS notification to the Administrator in the case of a failure

Develop the ETL workflow using AWS Lambda to start an Amazon SageMaker notebook instance Use a lifecycle configuration script to join the datasets and persist the results in Amazon S3 Use an Amazon CloudWatch alarm to send an SNS notification to the Administrator in the case of a failure

Develop the ETL workflow using AWS Batch to trigger the start of ETL jobs when data is uploaded to Amazon S3 Use AWS Glue to join the datasets in Amazon S3 Use an Amazon CloudWatch alarm to send an SNS notification to the Administrator in the case of a failure

Use AWS Lambda to chain other Lambda functions to read and join the datasets in Amazon S3 as soon as the data is uploaded to Amazon S3 Use an Amazon CloudWatch alarm to send an SNS notification to the Administrator in the case of a failure

Buy Now

Answer:

Explanation:

To develop a daily ETL workflow containing multiple ETL jobs that can start as soon as data is uploaded to Amazon S3, the best configuration is to use AWS Lambda to trigger an AWS Step Functions workflow to wait for dataset uploads to complete in Amazon S3. Use AWS Glue to join the datasets. Use an Amazon CloudWatch alarm to send an SNS notification to the Administrator in the case of a failure.

AWS Lambda is a serverless compute service that lets you run code without provisioning or managing servers. You can use Lambda to create functions that respond to events such as data uploads to Amazon S3. You can also use Lambda to invoke other AWS services such as AWS Step Functions and AWS Glue.

AWS Step Functions is a service that lets you coordinate multiple AWS services into serverless workflows. You can use Step Functions to create a state machine that defines the sequence and logic of your ETL workflow. You can also use Step Functions to handle errors and retries, and to monitor the execution status of your workflow.

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics. You can use Glue to create and run ETL jobs that can join data from multiple sources in Amazon S3. You can also use Glue to catalog your data and make it searchable and queryable.

Amazon CloudWatch is a service that monitors your AWS resources and applications. You can use CloudWatch to create alarms that trigger actions when a metric or a log event meets a specified threshold. You can also use CloudWatch to send notifications to Amazon Simple Notification Service (SNS) topics, which can then deliver the notifications to subscribers such as email addresses or phone numbers.

Therefore, by using these services together, you can achieve the following benefits:

You can start the ETL workflow as soon as data is uploaded to Amazon S3 by using Lambda functions to trigger Step Functions workflows.

You can wait for all the datasets to be available in Amazon S3 by using Step Functions to poll the S3 buckets and check the data completeness.

You can join the datasets with terabyte-sized datasets in Amazon S3 by using Glue ETL jobs that can scale and parallelize the data processing.

You can store the results of joining datasets in Amazon S3 by using Glue ETL jobs to write the output to S3 buckets.

You can send a notification to the Administrator if one of the jobs fails by using CloudWatch alarms to monitor the Step Functions or Glue metrics and send SNS notifications in case of a failure.

Questions 5

A data scientist is designing a repository that will contain many images of vehicles. The repository must scale automatically in size to store new images every day. The repository must support versioning of the images. The data scientist must implement a solution that maintains multiple immediately accessible copies of the data in different AWS Regions.

Which solution will meet these requirements?

Options:

Amazon S3 with S3 Cross-Region Replication (CRR)

Amazon Elastic Block Store (Amazon EBS) with snapshots that are shared in a secondary Region

Amazon Elastic File System (Amazon EFS) Standard storage that is configured with Regional availability

AWS Storage Gateway Volume Gateway

Buy Now

Questions 6

A Machine Learning Specialist is using Apache Spark for pre-processing training data As part of the Spark pipeline, the Specialist wants to use Amazon SageMaker for training a model and hosting it Which of the following would the Specialist do to integrate the Spark application with SageMaker? (Select THREE)

Options:

Download the AWS SDK for the Spark environment

Install the SageMaker Spark library in the Spark environment.

Use the appropriate estimator from the SageMaker Spark Library to train a model.

Compress the training data into a ZIP file and upload it to a pre-defined Amazon S3 bucket.

Use the sageMakerModel. transform method to get inferences from the model hosted in SageMaker

Convert the DataFrame object to a CSV file, and use the CSV file as input for obtaining inferences from SageMaker.

Buy Now

Questions 7

A data scientist is trying to improve the accuracy of a neural network classification model. The data scientist wants to run a large hyperparameter tuning job in Amazon SageMaker.

However, previous smaller tuning jobs on the same model often ran for several weeks. The ML specialist wants to reduce the computation time required to run the tuning job.

Which actions will MOST reduce the computation time for the hyperparameter tuning job? (Select TWO.)

Options:

Use the Hyperband tuning strategy.

Increase the number of hyperparameters.

Set a lower value for the MaxNumberOfTrainingJobs parameter.

Use the grid search tuning strategy

Set a lower value for the MaxParallelTrainingJobs parameter.

Buy Now

Questions 8

During mini-batch training of a neural network for a classification problem, a Data Scientist notices that training accuracy oscillates What is the MOST likely cause of this issue?

Options:

The class distribution in the dataset is imbalanced

Dataset shuffling is disabled

The batch size is too big

The learning rate is very high

Buy Now

Questions 9

An insurance company is developing a new device for vehicles that uses a camera to observe drivers' behavior and alert them when they appear distracted The company created approximately 10,000 training images in a controlled environment that a Machine Learning Specialist will use to train and evaluate machine learning models

During the model evaluation the Specialist notices that the training error rate diminishes faster as the number of epochs increases and the model is not accurately inferring on the unseen test images

Which of the following should be used to resolve this issue? (Select TWO)

Options:

Add vanishing gradient to the model

Perform data augmentation on the training data

Make the neural network architecture complex.

Use gradient checking in the model

Add L2 regularization to the model

Buy Now

Answer:

B, E

Explanation:

The issue described in the question is a sign of overfitting, which is a common problem in machine learning when the model learns the noise and details of the training data too well and fails to generalize to new and unseen data. Overfitting can result in a low training error rate but a high test error rate, which indicates poor performance and validity of the model. There are several techniques that can be used to prevent or reduce overfitting, such as data augmentation and regularization.

Data augmentation is a technique that applies various transformations to the original training data, such as rotation, scaling, cropping, flipping, adding noise, changing brightness, etc., to create new and diverse data samples. Data augmentation can increase the size and diversity of the training data, which can help the model learn more features and patterns and reduce the variance of the model. Data augmentation is especially useful for image data, as it can simulate different scenarios and perspectives that the model may encounter in real life. For example, in the question, the device uses a camera to observe drivers’ behavior, so data augmentation can help the model deal with different lighting conditions, angles, distances, etc. Data augmentation can be done using various libraries and frameworks, such as TensorFlow, PyTorch, Keras, OpenCV, etc12

Regularization is a technique that adds a penalty term to the model’s objective function, which is typically based on the model’s parameters. Regularization can reduce the complexity and flexibility of the model, which can prevent overfitting by avoiding learning the noise and details of the training data. Regularization can also improve the stability and robustness of the model, as it can reduce the sensitivity of the model to small fluctuations in the data. There are different types of regularization, such as L1, L2, dropout, etc., but they all have the same goal of reducing overfitting. L2 regularization, also known as weight decay or ridge regression, is one of the most common and effective regularization techniques. L2 regularization adds the squared norm of the model’s parameters multiplied by a regularization parameter (lambda) to the model’s objective function. L2 regularization can shrink the model’s parameters towards zero, which can reduce the variance of the model and improve the generalization ability of the model. L2 regularization can be implemented using various libraries and frameworks, such as TensorFlow, PyTorch, Keras, Scikit-learn, etc34

The other options are not valid or relevant for resolving the issue of overfitting. Adding vanishing gradient to the model is not a technique, but a problem that occurs when the gradient of the model’s objective function becomes very small and the model stops learning. Making the neural network architecture complex is not a solution, but a possible cause of overfitting, as a complex model can have more parameters and more flexibility to fit the training data too well. Using gradient checking in the model is not a technique, but a debugging method that verifies the correctness of the gradient computation in the model. Gradient checking is not related to overfitting, but to the implementation of the model.

Questions 10

A retail company stores 100 GB of daily transactional data in Amazon S3 at periodic intervals. The company wants to identify the schema of the transactional data. The company also wants to perform transformations on the transactional data that is in Amazon S3.

The company wants to use a machine learning (ML) approach to detect fraud in the transformed data.

Which combination of solutions will meet these requirements with the LEAST operational overhead? {Select THREE.)

Options:

Use Amazon Athena to scan the data and identify the schema.

Use AWS Glue crawlers to scan the data and identify the schema.

Use Amazon Redshift to store procedures to perform data transformations

Use AWS Glue workflows and AWS Glue jobs to perform data transformations.

Use Amazon Redshift ML to train a model to detect fraud.

Use Amazon Fraud Detector to train a model to detect fraud.

Buy Now

Questions 11

A health care company is planning to use neural networks to classify their X-ray images into normal and abnormal classes. The labeled data is divided into a training set of 1,000 images and a test set of 200 images. The initial training of a neural network model with 50 hidden layers yielded 99% accuracy on the training set, but only 55% accuracy on the test set.

What changes should the Specialist consider to solve this issue? (Choose three.)

Options:

Choose a higher number of layers

Choose a lower number of layers

Choose a smaller learning rate

Enable dropout

Include all the images from the test set in the training set

Enable early stopping

Buy Now

Questions 12

A real-estate company is launching a new product that predicts the prices of new houses. The historical data for the properties and prices is stored in .csv format in an Amazon S3 bucket. The data has a header, some categorical fields, and some missing values. The company’s data scientists have used Python with a common open-source library to fill the missing values with zeros. The data scientists have dropped all of the categorical fields and have trained a model by using the open-source linear regression algorithm with the default parameters.

The accuracy of the predictions with the current model is below 50%. The company wants to improve the model performance and launch the new product as soon as possible.

Which solution will meet these requirements with the LEAST operational overhead?

Options:

Create a service-linked role for Amazon Elastic Container Service (Amazon ECS) with access to the S3 bucket. Create an ECS cluster that is based on an AWS Deep Learning Containers image. Write the code to perform the feature engineering. Train a logistic regression model for predicting the price, pointing to the bucket with the dataset. Wait for the training job to complete. Perform the inferences.

Create an Amazon SageMaker notebook with a new IAM role that is associated with the notebook. Pull the dataset from the S3 bucket. Explore different combinations of feature engineering transformations, regression algorithms, and hyperparameters. Compare all the results in the notebook, and deploy the most accurate configuration in an endpoint for predictions.

Create an IAM role with access to Amazon S3, Amazon SageMaker, and AWS Lambda. Create a training job with the SageMaker built-in XGBoost model pointing to the bucket with the dataset. Specify the price as the target feature. Wait for the job to complete. Load the model artifact to a Lambda function for inference on prices of new houses.

Create an IAM role for Amazon SageMaker with access to the S3 bucket. Create a SageMaker AutoML job with SageMaker Autopilot pointing to the bucket with the dataset. Specify the price as the target attribute. Wait for the job to complete. Deploy the best model for predictions.

Buy Now

Answer:

Explanation:

The solution D meets the requirements with the least operational overhead because it uses Amazon SageMaker Autopilot, which is a fully managed service that automates the end-to-end process of building, training, and deploying machine learning models. Amazon SageMaker Autopilot can handle data preprocessing, feature engineering, algorithm selection, hyperparameter tuning, and model deployment. The company only needs to create an IAM role for Amazon SageMaker with access to the S3 bucket, create a SageMaker AutoML job pointing to the bucket with the dataset, specify the price as the target attribute, and wait for the job to complete. Amazon SageMaker Autopilot will generate a list of candidate models with different configurations and performance metrics, and the company can deploy the best model for predictions1.

The other options are not suitable because:

Option A: Creating a service-linked role for Amazon Elastic Container Service (Amazon ECS) with access to the S3 bucket, creating an ECS cluster based on an AWS Deep Learning Containers image, writing the code to perform the feature engineering, training a logistic regression model for predicting the price, and performing the inferences will incur more operational overhead than using Amazon SageMaker Autopilot. The company will have to manage the ECS cluster, the container image, the code, the model, and the inference endpoint. Moreover, logistic regression may not be the best algorithm for predicting the price, as it is more suitable for binary classification tasks2.

Option B: Creating an Amazon SageMaker notebook with a new IAM role that is associated with the notebook, pulling the dataset from the S3 bucket, exploring different combinations of feature engineering transformations, regression algorithms, and hyperparameters, comparing all the results in the notebook, and deploying the most accurate configuration in an endpoint for predictions will incur more operational overhead than using Amazon SageMaker Autopilot. The company will have to write the code for the feature engineering, the model training, the model evaluation, and the model deployment. The company will also have to manually compare the results and select the best configuration3.

Option C: Creating an IAM role with access to Amazon S3, Amazon SageMaker, and AWS Lambda, creating a training job with the SageMaker built-in XGBoost model pointing to the bucket with the dataset, specifying the price as the target feature, loading the model artifact to a Lambda function for inference on prices of new houses will incur more operational overhead than using Amazon SageMaker Autopilot. The company will have to create and manage the Lambda function, the model artifact, and the inference endpoint. Moreover, XGBoost may not be the best algorithm for predicting the price, as it is more suitable for classification and ranking tasks4.

References:

1: Amazon SageMaker Autopilot

2: Amazon Elastic Container Service

3: Amazon SageMaker Notebook Instances

4: Amazon SageMaker XGBoost Algorithm

Questions 13

A Data Scientist is developing a binary classifier to predict whether a patient has a particular disease on a series of test results. The Data Scientist has data on 400 patients randomly selected from the population. The disease is seen in 3% of the population.

Which cross-validation strategy should the Data Scientist adopt?

Options:

A k-fold cross-validation strategy with k=5

A stratified k-fold cross-validation strategy with k=5

A k-fold cross-validation strategy with k=5 and 3 repeats

An 80/20 stratified split between training and validation

Buy Now

Questions 14

Example Corp has an annual sale event from October to December. The company has sequential sales data from the past 15 years and wants to use Amazon ML to predict the sales for this year's upcoming event. Which method should Example Corp use to split the data into a training dataset and evaluation dataset?

Options:

Pre-split the data before uploading to Amazon S3

Have Amazon ML split the data randomly.

Have Amazon ML split the data sequentially.

Perform custom cross-validation on the data

Buy Now

Questions 15

A company provisions Amazon SageMaker notebook instances for its data science team and creates Amazon VPC interface endpoints to ensure communication between the VPC and the notebook instances. All connections to the Amazon SageMaker API are contained entirely and securely using the AWS network. However, the data science team realizes that individuals outside the VPC can still connect to the notebook instances across the internet.

Which set of actions should the data science team take to fix the issue?

Options:

Modify the notebook instances' security group to allow traffic only from the CIDR ranges of the VPC. Apply this security group to all of the notebook instances' VPC interfaces.

Create an IAM policy that allows the sagemaker:CreatePresignedNotebooklnstanceUrl and sagemaker:DescribeNotebooklnstance actions from only the VPC endpoints. Apply this policy to all IAM users, groups, and roles used to access the notebook instances.

Add a NAT gateway to the VPC. Convert all of the subnets where the Amazon SageMaker notebook instances are hosted to private subnets. Stop and start all of the notebook instances to reassign only private IP addresses.

Change the network ACL of the subnet the notebook is hosted in to restrict access to anyone outside the VPC.

Buy Now

Answer:

Explanation:

The issue is that the notebook instances’ security group allows inbound traffic from any source IP address, which means that anyone with the authorized URL can access the notebook instances over the internet. To fix this issue, the data science team should modify the security group to allow traffic only from the CIDR ranges of the VPC, which are the IP addresses assigned to the resources within the VPC. This way, only the VPC interface endpoints and the resources within the VPC can communicate with the notebook instances. The data science team should apply this security group to all of the notebook instances’ VPC interfaces, which are the network interfaces that connect the notebook instances to the VPC.

The other options are not correct because:

Option B: Creating an IAM policy that allows the sagemaker:CreatePresignedNotebookInstanceUrl and sagemaker:DescribeNotebookInstance actions from only the VPC endpoints does not prevent individuals outside the VPC from accessing the notebook instances. These actions are used to generate and retrieve the authorized URL for the notebook instances, but they do not control who can use the URL to access the notebook instances. The URL can still be shared or leaked to unauthorized users, who can then access the notebook instances over the internet.

Option C: Adding a NAT gateway to the VPC and converting the subnets where the notebook instances are hosted to private subnets does not solve the issue either. A NAT gateway is used to enable outbound internet access from a private subnet, but it does not affect inbound internet access. The notebook instances can still be accessed over the internet if their security group allows inbound traffic from any source IP address. Moreover, stopping and starting the notebook instances to reassign only private IP addresses is not necessary, because the notebook instances already have private IP addresses assigned by the VPC interface endpoints.

Option D: Changing the network ACL of the subnet the notebook is hosted in to restrict access to anyone outside the VPC is not a good practice, because network ACLs are stateless and apply to the entire subnet. This means that the data science team would have to specify both the inbound and outbound rules for each IP address range that they want to allow or deny. This can be cumbersome and error-prone, especially if the VPC has multiple subnets and resources. It is better to use security groups, which are stateful and apply to individual resources, to control the access to the notebook instances.

References:

Connect to SageMaker Within your VPC - Amazon SageMaker

Security Groups for Your VPC - Amazon Virtual Private Cloud

VPC Interface Endpoints - Amazon Virtual Private Cloud

Questions 16

A media company with a very large archive of unlabeled images, text, audio, and video footage wishes to index its assets to allow rapid identification of relevant content by the Research team. The company wants to use machine learning to accelerate the efforts of its in-house researchers who have limited machine learning expertise.

Which is the FASTEST route to index the assets?

Options:

Use Amazon Rekognition, Amazon Comprehend, and Amazon Transcribe to tag data into distinct categories/classes.

Create a set of Amazon Mechanical Turk Human Intelligence Tasks to label all footage.

Use Amazon Transcribe to convert speech to text. Use the Amazon SageMaker Neural Topic Model (NTM) and Object Detection algorithms to tag data into distinct categories/classes.

Use the AWS Deep Learning AMI and Amazon EC2 GPU instances to create custom models for audio transcription and topic modeling, and use object detection to tag data into distinct categories/classes.

Buy Now

Questions 17

A company's machine learning (ML) specialist is building a computer vision model to classify 10 different traffic signs. The company has stored 100 images of each class in Amazon S3, and the company has another 10.000 unlabeled images. All the images come from dash cameras and are a size of 224 pixels * 224 pixels. After several training runs, the model is overfitting on the training data.

Which actions should the ML specialist take to address this problem? (Select TWO.)

Options:

Use Amazon SageMaker Ground Truth to label the unlabeled images

Use image preprocessing to transform the images into grayscale images.

Use data augmentation to rotate and translate the labeled images.

Replace the activation of the last layer with a sigmoid.

Use the Amazon SageMaker k-nearest neighbors (k-NN) algorithm to label the unlabeled images.

Buy Now

Answer:

C, E

Explanation:

Data augmentation is a technique to increase the size and diversity of the training data by applying random transformations such as rotation, translation, scaling, flipping, etc. This can help reduce overfitting and improve the generalization of the model. Data augmentation can be done using the Amazon SageMaker image classification algorithm, which supports various augmentation options such as horizontal_flip, vertical_flip, rotate, brightness, contrast, etc1

The Amazon SageMaker k-nearest neighbors (k-NN) algorithm is a supervised learning algorithm that can be used to label unlabeled data based on the similarity to the labeled data. The k-NN algorithm assigns a label to an unlabeled instance by finding the k closest labeled instances in the feature space and taking a majority vote among their labels. This can help increase the size and diversity of the training data and reduce overfitting. The k-NN algorithm can be used with the Amazon SageMaker image classification algorithm by extracting features from the images using a pre-trained model and then applying the k-NN algorithm on the feature vectors2

Using Amazon SageMaker Ground Truth to label the unlabeled images is not a good option because it is a manual and costly process that requires human annotators. Moreover, it does not address the issue of overfitting on the existing labeled data.

Using image preprocessing to transform the images into grayscale images is not a good option because it reduces the amount of information and variation in the images, which can degrade the performance of the model. Moreover, it does not address the issue of overfitting on the existing labeled data.

Replacing the activation of the last layer with a sigmoid is not a good option because it is not suitable for a multi-class classification problem. A sigmoid activation function outputs a value between 0 and 1, which can be interpreted as a probability of belonging to a single class. However, for a multi-class classification problem, the output should be a vector of probabilities that sum up to 1, which can be achieved by using a softmax activation function.

References:

1: Image classification algorithm - Amazon SageMaker

2: k-nearest neighbors (k-NN) algorithm - Amazon SageMaker

Questions 18

A company wants to classify user behavior as either fraudulent or normal. Based on internal research, a Machine Learning Specialist would like to build a binary classifier based on two features: age of account and transaction month. The class distribution for these features is illustrated in the figure provided.

Based on this information which model would have the HIGHEST accuracy?

Options:

Long short-term memory (LSTM) model with scaled exponential linear unit (SELL))

Logistic regression

Support vector machine (SVM) with non-linear kernel

Single perceptron with tanh activation function

Buy Now

Questions 19

A Data Scientist is training a multilayer perception (MLP) on a dataset with multiple classes. The target class of interest is unique compared to the other classes within the dataset, but it does not achieve and acceptable ecall metric. The Data Scientist has already tried varying the number and size of the MLP’s hidden layers,

which has not significantly improved the results. A solution to improve recall must be implemented as quickly as possible.

Which techniques should be used to meet these requirements?

Options:

Gather more data using Amazon Mechanical Turk and then retrain

Train an anomaly detection model instead of an MLP

Train an XGBoost model instead of an MLP

Add class weights to the MLP’s loss function and then retrain

Buy Now

Questions 20

A Data Scientist is working on an application that performs sentiment analysis. The validation accuracy is poor and the Data Scientist thinks that the cause may be a rich vocabulary and a low average frequency of words in the dataset

Which tool should be used to improve the validation accuracy?

Options:

Amazon Comprehend syntax analysts and entity detection

Amazon SageMaker BlazingText allow mode

Natural Language Toolkit (NLTK) stemming and stop word removal

Scikit-learn term frequency-inverse document frequency (TF-IDF) vectorizers

Buy Now

Questions 21

A data scientist has a dataset of machine part images stored in Amazon Elastic File System (Amazon EFS). The data scientist needs to use Amazon SageMaker to create and train an image classification machine learning model based on this dataset. Because of budget and time constraints, management wants the data scientist to create and train a model with the least number of steps and integration work required.

How should the data scientist meet these requirements?

Options:

Mount the EFS file system to a SageMaker notebook and run a script that copies the data to an Amazon FSx for Lustre file system. Run the SageMaker training job with the FSx for Lustre file system as the data source.

Launch a transient Amazon EMR cluster. Configure steps to mount the EFS file system and copy the data to an Amazon S3 bucket by using S3DistCp. Run the SageMaker training job with Amazon S3 as the data source.

Mount the EFS file system to an Amazon EC2 instance and use the AWS CLI to copy the data to an Amazon S3 bucket. Run the SageMaker training job with Amazon S3 as the data source.

Run a SageMaker training job with an EFS file system as the data source.

Buy Now

Questions 22

A Machine Learning Specialist at a company sensitive to security is preparing a dataset for model training. The dataset is stored in Amazon S3 and contains Personally Identifiable Information (Pll). The dataset:

* Must be accessible from a VPC only.

* Must not traverse the public internet.

How can these requirements be satisfied?

Options:

Create a VPC endpoint and apply a bucket access policy that restricts access to the given VPC endpoint and the VPC.

Create a VPC endpoint and apply a bucket access policy that allows access from the given VPC endpoint and an Amazon EC2 instance.

Create a VPC endpoint and use Network Access Control Lists (NACLs) to allow traffic between only the given VPC endpoint and an Amazon EC2 instance.

Create a VPC endpoint and use security groups to restrict access to the given VPC endpoint and an Amazon EC2 instance.

Buy Now

Answer:

Explanation:

A VPC endpoint is a logical device that enables private connections between a VPC and supported AWS services. A VPC endpoint can be either a gateway endpoint or an interface endpoint. A gateway endpoint is a gateway that is a target for a specified route in the route table, used for traffic destined to a supported AWS service. An interface endpoint is an elastic network interface with a private IP address that serves as an entry point for traffic destined to a supported service1

In this case, the Machine Learning Specialist can create a gateway endpoint for Amazon S3, which is a supported service for gateway endpoints. A gateway endpoint for Amazon S3 enables the VPC to access Amazon S3 privately, without requiring an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection. The traffic between the VPC and Amazon S3 does not leave the Amazon network2

To restrict access to the dataset stored in Amazon S3, the Machine Learning Specialist can apply a bucket access policy that allows access only from the given VPC endpoint and the VPC. A bucket access policy is a resource-based policy that defines who can access a bucket and what actions they can perform. A bucket access policy can use various conditions to control access, such as the source IP address, the source VPC, the source VPC endpoint, etc. In this case, the Machine Learning Specialist can use the aws:sourceVpce condition to specify the ID of the VPC endpoint, and the aws:sourceVpc condition to specify the ID of the VPC. This way, only the requests that originate from the VPC endpoint or the VPC can access the bucket that contains the dataset34

The other options are not valid or secure ways to satisfy the requirements. Creating a VPC endpoint and applying a bucket access policy that allows access from the given VPC endpoint and an Amazon EC2 instance is not a good option, as it does not restrict access to the VPC. An Amazon EC2 instance is a virtual server that runs in the AWS cloud. An Amazon EC2 instance can have a public IP address or a private IP address, depending on the network configuration. Allowing access from an Amazon EC2 instance does not guarantee that the instance is in the same VPC as the VPC endpoint, and may expose the dataset to unauthorized access. Creating a VPC endpoint and using Network Access Control Lists (NACLs) to allow traffic between only the given VPC endpoint and an Amazon EC2 instance is not a good option, as it does not restrict access to the VPC. NACLs are stateless firewalls that can control inbound and outbound traffic at the subnet level. NACLs can use rules to allow or deny traffic based on the protocol, port, and source or destination IP address. However, NACLs do not support VPC endpoints as a source or destination, and cannot filter traffic based on the VPC endpoint ID or the VPC ID. Therefore, using NACLs does not guarantee that the traffic is from the VPC endpoint or the VPC, and may expose the dataset to unauthorized access. Creating a VPC endpoint and using security groups to restrict access to the given VPC endpoint and an Amazon EC2 instance is not a good option, as it does not restrict access to the VPC. Security groups are stateful firewalls that can control inbound and outbound traffic at the instance level. Security groups can use rules to allow or deny traffic based on the protocol, port, and source or destination. However, security groups do not support VPC endpoints as a source or destination, and cannot filter traffic based on the VPC endpoint ID or the VPC ID. Therefore, using security groups does not guarantee that the traffic is from the VPC endpoint or the VPC, and may expose the dataset to unauthorized access.

Questions 23

A machine learning specialist is developing a regression model to predict rental rates from rental listings. A variable named Wall_Color represents the most prominent exterior wall color of the property. The following is the sample data, excluding all other variables:

The specialist chose a model that needs numerical input data.

Which feature engineering approaches should the specialist use to allow the regression model to learn from the Wall_Color data? (Choose two.)

Options:

Apply integer transformation and set Red = 1, White = 5, and Green = 10.

Add new columns that store one-hot representation of colors.

Replace the color name string by its length.

Create three columns to encode the color in RGB format.

Replace each color name by its training set frequency.

Buy Now

Questions 24

A data science team is working with a tabular dataset that the team stores in Amazon S3. The team wants to experiment with different feature transformations such as categorical feature encoding. Then the team wants to visualize the resulting distribution of the dataset. After the team finds an appropriate set of feature transformations, the team wants to automate the workflow for feature transformations.

Which solution will meet these requirements with the MOST operational efficiency?

Options:

Use Amazon SageMaker Data Wrangler preconfigured transformations to explore feature transformations. Use SageMaker Data Wrangler templates for visualization. Export the feature processing workflow to a SageMaker pipeline for automation.

Use an Amazon SageMaker notebook instance to experiment with different feature transformations. Save the transformations to Amazon S3. Use Amazon QuickSight for visualization. Package the feature processing steps into an AWS Lambda function for automation.

Use AWS Glue Studio with custom code to experiment with different feature transformations. Save the transformations to Amazon S3. Use Amazon QuickSight for visualization. Package the feature processing steps into an AWS Lambda function for automation.

Use Amazon SageMaker Data Wrangler preconfigured transformations to experiment with different feature transformations. Save the transformations to Amazon S3. Use Amazon QuickSight for visualzation. Package each feature transformation step into a separate AWS Lambda function. Use AWS Step Functions for workflow automation.

Buy Now

Answer:

Explanation:

The solution A will meet the requirements with the most operational efficiency because it uses Amazon SageMaker Data Wrangler, which is a service that simplifies the process of data preparation and feature engineering for machine learning. The solution A involves the following steps:

Use Amazon SageMaker Data Wrangler preconfigured transformations to explore feature transformations. Amazon SageMaker Data Wrangler provides a visual interface that allows data scientists to apply various transformations to their tabular data, such as encoding categorical features, scaling numerical features, imputing missing values, and more. Amazon SageMaker Data Wrangler also supports custom transformations using Python code or SQL queries1.

Use SageMaker Data Wrangler templates for visualization. Amazon SageMaker Data Wrangler also provides a set of templates that can generate visualizations of the data, such as histograms, scatter plots, box plots, and more. These visualizations can help data scientists to understand the distribution and characteristics of the data, and to compare the effects of different feature transformations1.

Export the feature processing workflow to a SageMaker pipeline for automation. Amazon SageMaker Data Wrangler can export the feature processing workflow as a SageMaker pipeline, which is a service that orchestrates and automates machine learning workflows. A SageMaker pipeline can run the feature processing steps as a preprocessing step, and then feed the output to a training step or an inference step. This can reduce the operational overhead of managing the feature processing workflow and ensure its consistency and reproducibility2.

The other options are not suitable because:

Option B: Using an Amazon SageMaker notebook instance to experiment with different feature transformations, saving the transformations to Amazon S3, using Amazon QuickSight for visualization, and packaging the feature processing steps into an AWS Lambda function for automation will incur more operational overhead than using Amazon SageMaker Data Wrangler. The data scientist will have to write the code for the feature transformations, the data storage, the data visualization, and the Lambda function. Moreover, AWS Lambda has limitations on the execution time, memory size, and package size, which may not be sufficient for complex feature processing tasks3.

Option C: Using AWS Glue Studio with custom code to experiment with different feature transformations, saving the transformations to Amazon S3, using Amazon QuickSight for visualization, and packaging the feature processing steps into an AWS Lambda function for automation will incur more operational overhead than using Amazon SageMaker Data Wrangler. AWS Glue Studio is a visual interface that allows data engineers to create and run extract, transform, and load (ETL) jobs on AWS Glue. However, AWS Glue Studio does not provide preconfigured transformations or templates for feature engineering or data visualization. The data scientist will have to write custom code for these tasks, as well as for the Lambda function. Moreover, AWS Glue Studio is not integrated with SageMaker pipelines, and it may not be optimized for machine learning workflows4.

Option D: Using Amazon SageMaker Data Wrangler preconfigured transformations to experiment with different feature transformations, saving the transformations to Amazon S3, using Amazon QuickSight for visualization, packaging each feature transformation step into a separate AWS Lambda function, and using AWS Step Functions for workflow automation will incur more operational overhead than using Amazon SageMaker Data Wrangler. The data scientist will have to create and manage multiple AWS Lambda functions and AWS Step Functions, which can increase the complexity and cost of the solution. Moreover, AWS Lambda and AWS Step Functions may not be compatible with SageMaker pipelines, and they may not be optimized for machine learning workflows5.

References:

1: Amazon SageMaker Data Wrangler

2: Amazon SageMaker Pipelines

3: AWS Lambda

4: AWS Glue Studio

5: AWS Step Functions

Questions 25

A company is converting a large number of unstructured paper receipts into images. The company wants to create a model based on natural language processing (NLP) to find relevant entities such as date, location, and notes, as well as some custom entities such as receipt numbers.

The company is using optical character recognition (OCR) to extract text for data labeling. However, documents are in different structures and formats, and the company is facing challenges with setting up the manual workflows for each document type. Additionally, the company trained a named entity recognition (NER) model for custom entity detection using a small sample size. This model has a very low confidence score and will require retraining with a large dataset.

Which solution for text extraction and entity detection will require the LEAST amount of effort?

Options:

Extract text from receipt images by using Amazon Textract. Use the Amazon SageMaker BlazingText algorithm to train on the text for entities and custom entities.

Extract text from receipt images by using a deep learning OCR model from the AWS Marketplace. Use the NER deep learning model to extract entities.

Extract text from receipt images by using Amazon Textract. Use Amazon Comprehend for entity detection, and use Amazon Comprehend custom entity recognition for custom entity detection.

Extract text from receipt images by using a deep learning OCR model from the AWS Marketplace. Use Amazon Comprehend for entity detection, and use Amazon Comprehend custom entity recognition for custom entity detection.

Buy Now

Answer:

Explanation:

The best solution for text extraction and entity detection with the least amount of effort is to use Amazon Textract and Amazon Comprehend. These services are:

Amazon Textract for text extraction from receipt images. Amazon Textract is a machine learning service that can automatically extract text and data from scanned documents. It can handle different structures and formats of documents, such as PDF, TIFF, PNG, and JPEG, without any preprocessing steps. It can also extract key-value pairs and tables from documents1

Amazon Comprehend for entity detection and custom entity detection. Amazon Comprehend is a natural language processing service that can identify entities, such as dates, locations, and notes, from unstructured text. It can also detect custom entities, such as receipt numbers, by using a custom entity recognizer that can be trained with a small amount of labeled data2

The other options are not suitable because they either require more effort for text extraction, entity detection, or custom entity detection. For example:

Option A uses the Amazon SageMaker BlazingText algorithm to train on the text for entities and custom entities. BlazingText is a supervised learning algorithm that can perform text classification and word2vec. It requires users to provide a large amount of labeled data, preprocess the data into a specific format, and tune the hyperparameters of the model3

Option B uses a deep learning OCR model from the AWS Marketplace and a NER deep learning model for text extraction and entity detection. These models are pre-trained and may not be suitable for the specific use case of receipt processing. They also require users to deploy and manage the models on Amazon SageMaker or Amazon EC2 instances4

Option D uses a deep learning OCR model from the AWS Marketplace for text extraction. This model has the same drawbacks as option B. It also requires users to integrate the model output with Amazon Comprehend for entity detection and custom entity detection.

References:

1: Amazon Textract – Extract text and data from documents

2: Amazon Comprehend – Natural Language Processing (NLP) and Machine Learning (ML)

3: BlazingText - Amazon SageMaker

4: AWS Marketplace: OCR

Questions 26

A machine learning specialist works for a fruit processing company and needs to build a system that

categorizes apples into three types. The specialist has collected a dataset that contains 150 images for each type of apple and applied transfer learning on a neural network that was pretrained on ImageNet with this dataset.

The company requires at least 85% accuracy to make use of the model.

After an exhaustive grid search, the optimal hyperparameters produced the following:

68% accuracy on the training set

67% accuracy on the validation set

What can the machine learning specialist do to improve the system’s accuracy?

Options:

Upload the model to an Amazon SageMaker notebook instance and use the Amazon SageMaker HPO feature to optimize the model’s hyperparameters.

Add more data to the training set and retrain the model using transfer learning to reduce the bias.

Use a neural network model with more layers that are pretrained on ImageNet and apply transfer learning to increase the variance.

Train a new model using the current neural network architecture.

Buy Now

Questions 27

A retail company is selling products through a global online marketplace. The company wants to use machine learning (ML) to analyze customer feedback and identify specific areas for improvement. A developer has built a tool that collects customer reviews from the online marketplace and stores them in an Amazon S3 bucket. This process yields a dataset of 40 reviews. A data scientist building the ML models must identify additional sources of data to increase the size of the dataset.

Which data sources should the data scientist use to augment the dataset of reviews? (Choose three.)

Options:

Emails exchanged by customers and the company’s customer service agents

Social media posts containing the name of the company or its products

A publicly available collection of news articles

A publicly available collection of customer reviews

Product sales revenue figures for the company

Instruction manuals for the company’s products

Buy Now

Questions 28

A company's Machine Learning Specialist needs to improve the training speed of a time-series forecasting model using TensorFlow. The training is currently implemented on a single-GPU machine and takes approximately 23 hours to complete. The training needs to be run daily.

The model accuracy js acceptable, but the company anticipates a continuous increase in the size of the training data and a need to update the model on an hourly, rather than a daily, basis. The company also wants to minimize coding effort and infrastructure changes

What should the Machine Learning Specialist do to the training solution to allow it to scale for future demand?

Options:

Do not change the TensorFlow code. Change the machine to one with a more powerful GPU to speed up the training.

Change the TensorFlow code to implement a Horovod distributed framework supported by Amazon SageMaker. Parallelize the training to as many machines as needed to achieve the business goals.

Switch to using a built-in AWS SageMaker DeepAR model. Parallelize the training to as many machines as needed to achieve the business goals.

Move the training to Amazon EMR and distribute the workload to as many machines as needed to achieve the business goals.

Buy Now

Questions 29

A company needs to quickly make sense of a large amount of data and gain insight from it. The data is in different formats, the schemas change frequently, and new data sources are added regularly. The company wants to use AWS services to explore multiple data sources, suggest schemas, and enrich and transform the data. The solution should require the least possible coding effort for the data flows and the least possible infrastructure management.

Which combination of AWS services will meet these requirements?

Options:

Amazon EMR for data discovery, enrichment, and transformation

Amazon Athena for querying and analyzing the results in Amazon S3 using standard SQL

Amazon QuickSight for reporting and getting insights

Amazon Kinesis Data Analytics for data ingestion

Amazon EMR for data discovery, enrichment, and transformation

Amazon Redshift for querying and analyzing the results in Amazon S3

AWS Glue for data discovery, enrichment, and transformation

Amazon Athena for querying and analyzing the results in Amazon S3 using standard SQL

Amazon QuickSight for reporting and getting insights

AWS Data Pipeline for data transfer

AWS Step Functions for orchestrating AWS Lambda jobs for data discovery, enrichment, and transformation

Amazon Athena for querying and analyzing the results in Amazon S3 using standard SQL

Amazon QuickSight for reporting and getting insights

Buy Now

Answer:

Explanation:

The best combination of AWS services to meet the requirements of data discovery, enrichment, transformation, querying, analysis, and reporting with the least coding and infrastructure management is AWS Glue, Amazon Athena, and Amazon QuickSight. These services are:

AWS Glue for data discovery, enrichment, and transformation. AWS Glue is a serverless data integration service that automatically crawls, catalogs, and prepares data from various sources and formats. It also provides a visual interface called AWS Glue DataBrew that allows users to apply over 250 transformations to clean, normalize, and enrich data without writing code1

Amazon Athena for querying and analyzing the results in Amazon S3 using standard SQL. Amazon Athena is a serverless interactive query service that allows users to analyze data in Amazon S3 using standard SQL. It supports a variety of data formats, such as CSV, JSON, ORC, Parquet, and Avro. It also integrates with AWS Glue Data Catalog to provide a unified view of the data sources and schemas2

Amazon QuickSight for reporting and getting insights. Amazon QuickSight is a serverless business intelligence service that allows users to create and share interactive dashboards and reports. It also provides ML-powered features, such as anomaly detection, forecasting, and natural language queries, to help users discover hidden insights from their data3

The other options are not suitable because they either require more coding effort, more infrastructure management, or do not support the desired use cases. For example:

Option A uses Amazon EMR for data discovery, enrichment, and transformation. Amazon EMR is a managed cluster platform that runs Apache Spark, Apache Hive, and other open-source frameworks for big data processing. It requires users to write code in languages such as Python, Scala, or SQL to perform data integration tasks. It also requires users to provision, configure, and scale the clusters according to their needs4

Option B uses Amazon Kinesis Data Analytics for data ingestion. Amazon Kinesis Data Analytics is a service that allows users to process streaming data in real time using SQL or Apache Flink. It is not suitable for data discovery, enrichment, and transformation, which are typically batch-oriented tasks. It also requires users to write code to define the data processing logic and the output destination5

Option D uses AWS Data Pipeline for data transfer and AWS Step Functions for orchestrating AWS Lambda jobs for data discovery, enrichment, and transformation. AWS Data Pipeline is a service that helps users move data between AWS services and on-premises data sources. AWS Step Functions is a service that helps users coordinate multiple AWS services into workflows. AWS Lambda is a service that lets users run code without provisioning or managing servers. These services require users to write code to define the data sources, destinations, transformations, and workflows. They also require users to manage the scalability, performance, and reliability of the data pipelines.

References:

1: AWS Glue - Data Integration Service - Amazon Web Services

2: Amazon Athena – Interactive SQL Query Service - AWS

3: Amazon QuickSight - Business Intelligence Service - AWS

4: Amazon EMR - Amazon Web Services

5: Amazon Kinesis Data Analytics - Amazon Web Services

: AWS Data Pipeline - Amazon Web Services

: AWS Step Functions - Amazon Web Services

: AWS Lambda - Amazon Web Services

Questions 30

A machine learning (ML) specialist at a retail company must build a system to forecast the daily sales for one of the company's stores. The company provided the ML specialist with sales data for this store from the past 10 years. The historical dataset includes the total amount of sales on each day for the store. Approximately 10% of the days in the historical dataset are missing sales data.

The ML specialist builds a forecasting model based on the historical dataset. The specialist discovers that the model does not meet the performance standards that the company requires.

Which action will MOST likely improve the performance for the forecasting model?

Options:

Aggregate sales from stores in the same geographic area.

Apply smoothing to correct for seasonal variation.

Change the forecast frequency from daily to weekly.

Replace missing values in the dataset by using linear interpolation.

Buy Now

Questions 31

An automotive company uses computer vision in its autonomous cars. The company trained its object detection models successfully by using transfer learning from a convolutional neural network (CNN). The company trained the models by using PyTorch through the Amazon SageMaker SDK.

The vehicles have limited hardware and compute power. The company wants to optimize the model to reduce memory, battery, and hardware consumption without a significant sacrifice in accuracy.

Which solution will improve the computational efficiency of the models?

Options:

Use Amazon CloudWatch metrics to gain visibility into the SageMaker training weights, gradients, biases, and activation outputs. Compute the filter ranks based on the training information. Apply pruning to remove the low-ranking filters. Set new weights based on the pruned set of filters. Run a new training job with the pruned model.

Use Amazon SageMaker Ground Truth to build and run data labeling workflows. Collect a larger labeled dataset with the labelling workflows. Run a new training job that uses the new labeled data with previous training data.

Use Amazon SageMaker Debugger to gain visibility into the training weights, gradients, biases, and activation outputs. Compute the filter ranks based on the training information. Apply pruning to remove the low-ranking filters. Set the new weights based on the pruned set of filters. Run a new training job with the pruned model.

Use Amazon SageMaker Model Monitor to gain visibility into the ModelLatency metric and OverheadLatency metric of the model after the company deploys the model. Increase the model learning rate. Run a new training job.

Buy Now

Answer:

Explanation:

The solution C will improve the computational efficiency of the models because it uses Amazon SageMaker Debugger and pruning, which are techniques that can reduce the size and complexity of the convolutional neural network (CNN) models. The solution C involves the following steps:

Use Amazon SageMaker Debugger to gain visibility into the training weights, gradients, biases, and activation outputs. Amazon SageMaker Debugger is a service that can capture and analyze the tensors that are emitted during the training process of machine learning models. Amazon SageMaker Debugger can provide insights into the model performance, quality, and convergence. Amazon SageMaker Debugger can also help to identify and diagnose issues such as overfitting, underfitting, vanishing gradients, and exploding gradients1.

Compute the filter ranks based on the training information. Filter ranking is a technique that can measure the importance of each filter in a convolutional layer based on some criterion, such as the average percentage of zero activations or the L1-norm of the filter weights. Filter ranking can help to identify the filters that have little or no contribution to the model output, and thus can be removed without affecting the model accuracy2.

Apply pruning to remove the low-ranking filters. Pruning is a technique that can reduce the size and complexity of a neural network by removing the redundant or irrelevant parts of the network, such as neurons, connections, or filters. Pruning can help to improve the computational efficiency, memory usage, and inference speed of the model, as well as to prevent overfitting and improve generalization3.

Set the new weights based on the pruned set of filters. After pruning, the model will have a smaller and simpler architecture, with fewer filters in each convolutional layer. The new weights of the model can be set based on the pruned set of filters, either by initializing them randomly or by fine-tuning them from the original weights4.

Run a new training job with the pruned model. The pruned model can be trained again with the same or a different dataset, using the same or a different framework or algorithm. The new training job can use the same or a different configuration of Amazon SageMaker, such as the instance type, the hyperparameters, or the data ingestion mode. The new training job can also use Amazon SageMaker Debugger to monitor and analyze the training process and the model quality5.

The other options are not suitable because:

Option A: Using Amazon CloudWatch metrics to gain visibility into the SageMaker training weights, gradients, biases, and activation outputs will not be as effective as using Amazon SageMaker Debugger. Amazon CloudWatch is a service that can monitor and observe the operational health and performance of AWS resources and applications. Amazon CloudWatch can provide metrics, alarms, dashboards, and logs for various AWS services, including Amazon SageMaker. However, Amazon CloudWatch does not provide the same level of granularity and detail as Amazon SageMaker Debugger for the tensors that are emitted during the training process of machine learning models. Amazon CloudWatch metrics are mainly focused on the resource utilization and the training progress, not on the model performance, quality, and convergence6.

Option B: Using Amazon SageMaker Ground Truth to build and run data labeling workflows and collecting a larger labeled dataset with the labeling workflows will not improve the computational efficiency of the models. Amazon SageMaker Ground Truth is a service that can create high-quality training datasets for machine learning by using human labelers. A larger labeled dataset can help to improve the model accuracy and generalization, but it will not reduce the memory, battery, and hardware consumption of the model. Moreover, a larger labeled dataset may increase the training time and cost of the model7.

Option D: Using Amazon SageMaker Model Monitor to gain visibility into the ModelLatency metric and OverheadLatency metric of the model after the company deploys the model and increasing the model learning rate will not improve the computational efficiency of the models. Amazon SageMaker Model Monitor is a service that can monitor and analyze the quality and performance of machine learning models that are deployed on Amazon SageMaker endpoints. The ModelLatency metric and the OverheadLatency metric can measure the inference latency of the model and the endpoint, respectively. However, these metrics do not provide any information about the training weights, gradients, biases, and activation outputs of the model, which are needed for pruning. Moreover, increasing the model learning rate will not reduce the size and complexity of the model, but it may affect the model convergence and accuracy.

References:

1: Amazon SageMaker Debugger

2: Pruning Convolutional Neural Networks for Resource Efficient Inference

3: Pruning Neural Networks: A Survey

4: Learning both Weights and Connections for Efficient Neural Networks

5: Amazon SageMaker Training Jobs

6: Amazon CloudWatch Metrics for Amazon SageMaker

7: Amazon SageMaker Ground Truth

: Amazon SageMaker Model Monitor

Questions 32

A finance company needs to forecast the price of a commodity. The company has compiled a dataset of historical daily prices. A data scientist must train various forecasting models on 80% of the dataset and must validate the efficacy of those models on the remaining 20% of the dataset.

What should the data scientist split the dataset into a training dataset and a validation dataset to compare model performance?

Options:

Pick a date so that 80% to the data points precede the date Assign that group of data points as the training dataset. Assign all the remaining data points to the validation dataset.

Pick a date so that 80% of the data points occur after the date. Assign that group of data points as the training dataset. Assign all the remaining data points to the validation dataset.

Starting from the earliest date in the dataset. pick eight data points for the training dataset and two data points for the validation dataset. Repeat this stratified sampling until no data points remain.

Sample data points randomly without replacement so that 80% of the data points are in the training dataset. Assign all the remaining data points to the validation dataset.

Buy Now

Questions 33

A large JSON dataset for a project has been uploaded to a private Amazon S3 bucket The Machine Learning Specialist wants to securely access and explore the data from an Amazon SageMaker notebook instance A new VPC was created and assigned to the Specialist

How can the privacy and integrity of the data stored in Amazon S3 be maintained while granting access to the Specialist for analysis?

Options:

Launch the SageMaker notebook instance within the VPC with SageMaker-provided internet access enabled Use an S3 ACL to open read privileges to the everyone group

Launch the SageMaker notebook instance within the VPC and create an S3 VPC endpoint for the notebook to access the data Copy the JSON dataset from Amazon S3 into the ML storage volume on the SageMaker notebook instance and work against the local dataset

Launch the SageMaker notebook instance within the VPC and create an S3 VPC endpoint for the notebook to access the data Define a custom S3 bucket policy to only allow requests from your VPC to access the S3 bucket

Launch the SageMaker notebook instance within the VPC with SageMaker-provided internet access enabled. Generate an S3 pre-signed URL for access to data in the bucket

Buy Now

Questions 34

An interactive online dictionary wants to add a widget that displays words used in similar contexts. A Machine Learning Specialist is asked to provide word features for the downstream nearest neighbor model powering the widget.

What should the Specialist do to meet these requirements?

Options:

Create one-hot word encoding vectors.

Produce a set of synonyms for every word using Amazon Mechanical Turk.

Create word embedding factors that store edit distance with every other word.

Download word embedding’s pre-trained on a large corpus.

Buy Now

Questions 35

A law firm handles thousands of contracts every day. Every contract must be signed. Currently, a lawyer manually checks all contracts for signatures.

The law firm is developing a machine learning (ML) solution to automate signature detection for each contract. The ML solution must also provide a confidence score for each contract page.

Which Amazon Textract API action can the law firm use to generate a confidence score for each page of each contract?

Options:

Use the AnalyzeDocument API action. Set the FeatureTypes parameter to SIGNATURES. Return the confidence scores for each page.

Use the Prediction API call on the documents. Return the signatures and confidence scores for each page.

Use the StartDocumentAnalysis API action to detect the signatures. Return the confidence scores for each page.

Use the GetDocumentAnalysis API action to detect the signatures. Return the confidence scores for each page

Buy Now

Answer:

Explanation:

The AnalyzeDocument API action is the best option to generate a confidence score for each page of each contract. This API action analyzes an input document for relationships between detected items. The input document can be an image file in JPEG or PNG format, or a PDF file. The output is a JSON structure that contains the extracted data from the document. The FeatureTypes parameter specifies the types of analysis to perform on the document. The available feature types are TABLES, FORMS, and SIGNATURES. By setting the FeatureTypes parameter to SIGNATURES, the API action will detect and extract information about signatures from the document. The output will include a list of SignatureDetection objects, each containing information about a detected signature, such as its location and confidence score. The confidence score is a value between 0 and 100 that indicates the probability that the detected signature is correct. The output will also include a list of Block objects, each representing a document page. Each Block object will have a Page attribute that contains the page number and a Confidence attribute that contains the confidence score for the page. The confidence score for the page is the average of the confidence scores of the blocks that are detected on the page. The law firm can use the AnalyzeDocument API action to generate a confidence score for each page of each contract by using the SIGNATURES feature type and returning the confidence scores from the SignatureDetection and Block objects.

The other options are not suitable for generating a confidence score for each page of each contract. The Prediction API call is not an Amazon Textract API action, but a generic term for making inference requests to a machine learning model. The StartDocumentAnalysis API action is used to start an asynchronous job to analyze a document. The output is a job identifier (JobId) that is used to get the results of the analysis with the GetDocumentAnalysis API action. The GetDocumentAnalysis API action is used to get the results of a document analysis started by the StartDocumentAnalysis API action. The output is a JSON structure that contains the extracted data from the document. However, both the StartDocumentAnalysis and the GetDocumentAnalysis API actions do not support the SIGNATURES feature type, and therefore cannot detect signatures or provide confidence scores for them.

References:

•AnalyzeDocument

•SignatureDetection

•Block

•Amazon Textract launches the ability to detect signatures on any document

Questions 36

A company wants to predict the sale prices of houses based on available historical sales data. The target

variable in the company’s dataset is the sale price. The features include parameters such as the lot size, living

area measurements, non-living area measurements, number of bedrooms, number of bathrooms, year built,

and postal code. The company wants to use multi-variable linear regression to predict house sale prices.

Which step should a machine learning specialist take to remove features that are irrelevant for the analysis

and reduce the model’s complexity?

Options:

Plot a histogram of the features and compute their standard deviation. Remove features with high variance.

Plot a histogram of the features and compute their standard deviation. Remove features with low variance.

Build a heatmap showing the correlation of the dataset against itself. Remove features with low mutual correlation scores.

Run a correlation check of all features against the target variable. Remove features with low target variable correlation scores.

Buy Now

Questions 37

A Machine Learning Specialist is working with a large company to leverage machine learning within its products. The company wants to group its customers into categories based on which customers will and will not churn within the next 6 months. The company has labeled the data available to the Specialist.

Which machine learning model type should the Specialist use to accomplish this task?

Options:

Linear regression

Classification

Clustering

Reinforcement learning

Buy Now

Questions 38

Each morning, a data scientist at a rental car company creates insights about the previous day’s rental car reservation demands. The company needs to automate this process by streaming the data to Amazon S3 in near real time. The solution must detect high-demand rental cars at each of the company’s locations. The solution also must create a visualization dashboard that automatically refreshes with the most recent data.

Which solution will meet these requirements with the LEAST development time?

Options:

Use Amazon Kinesis Data Firehose to stream the reservation data directly to Amazon S3. Detect high-demand outliers by using Amazon QuickSight ML Insights. Visualize the data in QuickSight.

Use Amazon Kinesis Data Streams to stream the reservation data directly to Amazon S3. Detect high-demand outliers by using the Random Cut Forest (RCF) trained model in Amazon SageMaker. Visualize the data in Amazon QuickSight.

Use Amazon Kinesis Data Firehose to stream the reservation data directly to Amazon S3. Detect high-demand outliers by using the Random Cut Forest (RCF) trained model in Amazon SageMaker. Visualize the data in Amazon QuickSight.

Use Amazon Kinesis Data Streams to stream the reservation data directly to Amazon S3. Detect high-demand outliers by using Amazon QuickSight ML Insights. Visualize the data in QuickSight.

Buy Now

Questions 39

A company distributes an online multiple-choice survey to several thousand people. Respondents to the survey can select multiple options for each question.

A machine learning (ML) engineer needs to comprehensively represent every response from all respondents in a dataset. The ML engineer will use the dataset to train a logistic regression model.

Which solution will meet these requirements?

Options:

Perform one-hot encoding on every possible option for each question of the survey.

Perform binning on all the answers each respondent selected for each question.

Use Amazon Mechanical Turk to create categorical labels for each set of possible responses.

Use Amazon Textract to create numeric features for each set of possible responses.

Buy Now

Questions 40

A car company is developing a machine learning solution to detect whether a car is present in an image. The image dataset consists of one million images. Each image in the dataset is 200 pixels in height by 200 pixels in width. Each image is labeled as either having a car or not having a car.

Which architecture is MOST likely to produce a model that detects whether a car is present in an image with the highest accuracy?

Options:

Use a deep convolutional neural network (CNN) classifier with the images as input. Include a linear output layer that outputs the probability that an image contains a car.

Use a deep convolutional neural network (CNN) classifier with the images as input. Include a softmax output layer that outputs the probability that an image contains a car.

Use a deep multilayer perceptron (MLP) classifier with the images as input. Include a linear output layer that outputs the probability that an image contains a car.

Use a deep multilayer perceptron (MLP) classifier with the images as input. Include a softmax output layer that outputs the probability that an image contains a car.

Buy Now

Questions 41

A machine learning (ML) specialist is training a linear regression model. The specialist notices that the model is overfitting. The specialist applies an L1 regularization parameter and runs the model again. This change results in all features having zero weights.

What should the ML specialist do to improve the model results?

Options:

Increase the L1 regularization parameter. Do not change any other training parameters.

Decrease the L1 regularization parameter. Do not change any other training parameters.

Introduce a large L2 regularization parameter. Do not change the current L1 regularization value.

Introduce a small L2 regularization parameter. Do not change the current L1 regularization value.

Buy Now

Questions 42

A Data Scientist is developing a machine learning model to predict future patient outcomes based on information collected about each patient and their treatment plans. The model should output a continuous value as its prediction. The data available includes labeled outcomes for a set of 4,000 patients. The study was conducted on a group of individuals over the age of 65 who have a particular disease that is known to worsen with age.

Initial models have performed poorly. While reviewing the underlying data, the Data Scientist notices that, out of 4,000 patient observations, there are 450 where the patient age has been input as 0. The other features for these observations appear normal compared to the rest of the sample population.

How should the Data Scientist correct this issue?

Options:

Drop all records from the dataset where age has been set to 0.

Replace the age field value for records with a value of 0 with the mean or median value from the dataset.

Drop the age feature from the dataset and train the model using the rest of the features.

Use k-means clustering to handle missing features.

Buy Now

Questions 43

A data scientist needs to identify fraudulent user accounts for a company's ecommerce platform. The company wants the ability to determine if a newly created account is associated with a previously known fraudulent user. The data scientist is using AWS Glue to cleanse the company's application logs during ingestion.

Which strategy will allow the data scientist to identify fraudulent accounts?

Options:

Execute the built-in FindDuplicates Amazon Athena query.

Create a FindMatches machine learning transform in AWS Glue.

Create an AWS Glue crawler to infer duplicate accounts in the source data.

Search for duplicate accounts in the AWS Glue Data Catalog.

Buy Now

Answer:

Explanation:

The best strategy to identify fraudulent accounts is to create a FindMatches machine learning transform in AWS Glue. The FindMatches transform enables you to identify duplicate or matching records in your dataset, even when the records do not have a common unique identifier and no fields match exactly. This can help you improve fraud detection by finding accounts that are associated with a previously known fraudulent user. You can teach the FindMatches transform your definition of a “duplicate” or a “match” through examples, and it will use machine learning to identify other potential duplicates or matches in your dataset. You can then use the FindMatches transform in your AWS Glue ETL jobs to cleanse your data.

Option A is incorrect because there is no built-in FindDuplicates Amazon Athena query. Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. However, Amazon Athena does not provide a predefined query to find duplicate records in a dataset. You would have to write your own SQL query to perform this task, which might not be as effective or accurate as using the FindMatches transform.

Option C is incorrect because creating an AWS Glue crawler to infer duplicate accounts in the source data is not a valid strategy. An AWS Glue crawler is a program that connects to a data store, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata tables in the AWS Glue Data Catalog. A crawler does not perform any data cleansing or record matching tasks.

Option D is incorrect because searching for duplicate accounts in the AWS Glue Data Catalog is not a feasible strategy. The AWS Glue Data Catalog is a central repository to store structural and operational metadata for your data assets. The Data Catalog does not store the actual data, but rather the metadata that describes where the data is located, how it is formatted, and what it contains. Therefore, you cannot search for duplicate records in the Data Catalog.

References:

Record matching with AWS Lake Formation FindMatches - AWS Glue

Amazon Athena – Interactive SQL Queries for Data in Amazon S3

AWS Glue Crawlers - AWS Glue

AWS Glue Data Catalog - AWS Glue

Questions 44

A company is planning a marketing campaign to promote a new product to existing customers. The company has data (or past promotions that are similar. The company decides to try an experiment to send a more expensive marketing package to a smaller number of customers. The company wants to target the marketing campaign to customers who are most likely to buy the new product. The experiment requires that at least 90% of the customers who are likely to purchase the new product receive the marketing materials.

...company trains a model by using the linear learner algorithm in Amazon SageMaker. The model has a recall score of 80% and a precision of 75%.

...should the company retrain the model to meet these requirements?

Options:

Set the target_recall hyperparameter to 90% Set the binaryclassrfier model_selection_critena hyperparameter to recall_at_target_precision.

Set the targetprecision hyperparameter to 90%. Set the binary classifier model selection criteria hyperparameter to precision at_jarget recall.

Use 90% of the historical data for training Set the number of epochs to 20.

Set the normalize_jabel hyperparameter to true. Set the number of classes to 2.

Buy Now

Answer:

Explanation:

The best way to retrain the model to meet the requirements is to set the target_recall hyperparameter to 90% and set the binary_classifier_model_selection_criteria hyperparameter to recall_at_target_precision. This will instruct the linear learner algorithm to optimize the model for a high recall score, while maintaining a reasonable precision score. Recall is the proportion of actual positives that were identified correctly, which is important for the company’s goal of reaching at least 90% of the customers who are likely to buy the new product1. Precision is the proportion of positive identifications that were actually correct, which is also relevant for the company’s budget and efficiency2. By setting the target_recall to 90%, the algorithm will try to achieve a recall score of at least 90%, and by setting the binary_classifier_model_selection_criteria to recall_at_target_precision, the algorithm will select the model that has the highest recall score among those that have a precision score equal to or higher than the target precision3. The target precision is automatically set to the median of the precision scores of all the models trained in parallel4.

The other options are not correct or optimal, because they have the following drawbacks:

B: Setting the target_precision hyperparameter to 90% and setting the binary_classifier_model_selection_criteria hyperparameter to precision_at_target_recall will optimize the model for a high precision score, while maintaining a reasonable recall score. However, this is not aligned with the company’s goal of reaching at least 90% of the customers who are likely to buy the new product, as precision does not reflect how well the model identifies the actual positives1. Moreover, setting the target_precision to 90% might be too high and unrealistic for the dataset, as the current precision score is only 75%4.

C: Using 90% of the historical data for training and setting the number of epochs to 20 will not necessarily improve the recall score of the model, as it does not change the optimization objective or the model selection criteria. Moreover, using more data for training might reduce the amount of data available for validation, which is needed for selecting the best model among the ones trained in parallel3. The number of epochs is also not a decisive factor for the recall score, as it depends on the learning rate, the optimizer, and the convergence of the algorithm5.

D: Setting the normalize_label hyperparameter to true and setting the number of classes to 2 will not affect the recall score of the model, as these are irrelevant hyperparameters for binary classification problems. The normalize_label hyperparameter is only applicable for regression problems, as it controls whether the label is normalized to have zero mean and unit variance3. The number of classes hyperparameter is only applicable for multiclass classification problems, as it specifies the number of output classes3.

References:

1: Classification: Precision and Recall | Machine Learning | Google for Developers

2: Precision and recall - Wikipedia

3: Linear Learner Algorithm - Amazon SageMaker

4: How linear learner works - Amazon SageMaker

5: Getting hands-on with Amazon SageMaker Linear Learner - Pluralsight

Questions 45

A financial services company is building a robust serverless data lake on Amazon S3. The data lake should be flexible and meet the following requirements:

* Support querying old and new data on Amazon S3 through Amazon Athena and Amazon Redshift Spectrum.

* Support event-driven ETL pipelines.

* Provide a quick and easy way to understand metadata.

Which approach meets trfese requirements?

Options:

Use an AWS Glue crawler to crawl S3 data, an AWS Lambda function to trigger an AWS Glue ETL job, and an AWS Glue Data catalog to search and discover metadata.

Use an AWS Glue crawler to crawl S3 data, an AWS Lambda function to trigger an AWS Batch job, and an external Apache Hive metastore to search and discover metadata.

Use an AWS Glue crawler to crawl S3 data, an Amazon CloudWatch alarm to trigger an AWS Batch job, and an AWS Glue Data Catalog to search and discover metadata.

Use an AWS Glue crawler to crawl S3 data, an Amazon CloudWatch alarm to trigger an AWS Glue ETL job, and an external Apache Hive metastore to search and discover metadata.

Buy Now

Answer:

Explanation:

To build a robust serverless data lake on Amazon S3 that meets the requirements, the financial services company should use the following AWS services:

AWS Glue crawler: This is a service that connects to a data store, progresses through a prioritized list of classifiers to determine the schema for the data, and then creates metadata tables in the AWS Glue Data Catalog1. The company can use an AWS Glue crawler to crawl the S3 data and infer the schema, format, and partition structure of the data. The crawler can also detect schema changes and update the metadata tables accordingly. This enables the company to support querying old and new data on Amazon S3 through Amazon Athena and Amazon Redshift Spectrum, which are serverless interactive query services that use the AWS Glue Data Catalog as a central location for storing and retrieving table metadata23.

AWS Lambda function: This is a service that lets you run code without provisioning or managing servers. You pay only for the compute time you consume - there is no charge when your code is not running. You can also use AWS Lambda to create event-driven ETL pipelines, by triggering other AWS services based on events such as object creation or deletion in S3 buckets4. The company can use an AWS Lambda function to trigger an AWS Glue ETL job, which is a serverless way to extract, transform, and load data for analytics. The AWS Glue ETL job can perform various data processing tasks, such as converting data formats, filtering, aggregating, joining, and more.

AWS Glue Data Catalog: This is a managed service that acts as a central metadata repository for data assets across AWS and on-premises data sources. The AWS Glue Data Catalog provides a uniform repository where disparate systems can store and find metadata to keep track of data in data silos, and use that metadata to query and transform the data. The company can use the AWS Glue Data Catalog to search and discover metadata, such as table definitions, schemas, and partitions. The AWS Glue Data Catalog also integrates with Amazon Athena, Amazon Redshift Spectrum, Amazon EMR, and AWS Glue ETL jobs, providing a consistent view of the data across different query and analysis services.

References:

1: What Is a Crawler? - AWS Glue

2: What Is Amazon Athena? - Amazon Athena

3: Amazon Redshift Spectrum - Amazon Redshift

4: What is AWS Lambda? - AWS Lambda

: AWS Glue ETL Jobs - AWS Glue

: What Is the AWS Glue Data Catalog? - AWS Glue

Questions 46

A manufacturing company has structured and unstructured data stored in an Amazon S3 bucket A Machine Learning Specialist wants to use SQL to run queries on this data. Which solution requires the LEAST effort to be able to query this data?

Options:

Use AWS Data Pipeline to transform the data and Amazon RDS to run queries.

Use AWS Glue to catalogue the data and Amazon Athena to run queries

Use AWS Batch to run ETL on the data and Amazon Aurora to run the quenes

Use AWS Lambda to transform the data and Amazon Kinesis Data Analytics to run queries

Buy Now

Questions 47

A bank has collected customer data for 10 years in CSV format. The bank stores the data in an on-premises server. A data science team wants to use Amazon SageMaker to build and train a machine learning (ML) model to predict churn probability. The team will use the historical data. The data scientists want to perform data transformations quickly and to generate data insights before the team builds a model for production.

Which solution will meet these requirements with the LEAST development effort?

Options:

Upload the data into the SageMaker Data Wrangler console directly. Perform data transformations and generate insights within Data Wrangler.

Upload the data into an Amazon S3 bucket. Allow SageMaker to access the data that is in the bucket. Import the data from the S3 bucket into SageMaker Data Wrangler. Perform data transformations and generate insights within Data Wrangler.

Upload the data into the SageMaker Data Wrangler console directly. Allow SageMaker and Amazon QuickSight to access the data that is in an Amazon S3 bucket. Perform data transformations in Data Wrangler and save the transformed data into a second S3 bucket. Use QuickSight to generate data insights.

Upload the data into an Amazon S3 bucket. Allow SageMaker to access the data that is in the bucket. Import the data from the bucket into SageMaker Data Wrangler. Perform data transformations in Data Wrangler. Save the data into a second S3 bucket. Use a SageMaker Studio notebook to generate data insights.

Buy Now

Questions 48

A large mobile network operating company is building a machine learning model to predict customers who are likely to unsubscribe from the service. The company plans to offer an incentive for these customers as the cost of churn is far greater than the cost of the incentive.

The model produces the following confusion matrix after evaluating on a test dataset of 100 customers:

Based on the model evaluation results, why is this a viable model for production?

Options:

The model is 86% accurate and the cost incurred by the company as a result of false negatives is less than the false positives.

The precision of the model is 86%, which is less than the accuracy of the model.

The model is 86% accurate and the cost incurred by the company as a result of false positives is less than the false negatives.

The precision of the model is 86%, which is greater than the accuracy of the model.

Buy Now

Questions 49

A credit card company wants to identify fraudulent transactions in real time. A data scientist builds a machine learning model for this purpose. The transactional data is captured and stored in Amazon S3. The historic data is already labeled with two classes: fraud (positive) and fair transactions (negative). The data scientist removes all the missing data and builds a classifier by using the XGBoost algorithm in Amazon SageMaker. The model produces the following results:

• True positive rate (TPR): 0.700

• False negative rate (FNR): 0.300

• True negative rate (TNR): 0.977

• False positive rate (FPR): 0.023

• Overall accuracy: 0.949

Which solution should the data scientist use to improve the performance of the model?

Options:

Apply the Synthetic Minority Oversampling Technique (SMOTE) on the minority class in the training dataset. Retrain the model with the updated training data.

Apply the Synthetic Minority Oversampling Technique (SMOTE) on the majority class in the training dataset. Retrain the model with the updated training data.

Undersample the minority class.

Oversample the majority class.

Buy Now

Answer:

Explanation:

The solution that the data scientist should use to improve the performance of the model is to apply the Synthetic Minority Oversampling Technique (SMOTE) on the minority class in the training dataset, and retrain the model with the updated training data. This solution can address the problem of class imbalance in the dataset, which can affect the model’s ability to learn from the rare but important positive class (fraud).

Class imbalance is a common issue in machine learning, especially for classification tasks. It occurs when one class (usually the positive or target class) is significantly underrepresented in the dataset compared to the other class (usually the negative or non-target class). For example, in the credit card fraud detection problem, the positive class (fraud) is much less frequent than the negative class (fair transactions). This can cause the model to be biased towards the majority class, and fail to capture the characteristics and patterns of the minority class. As a result, the model may have a high overall accuracy, but a low recall or true positive rate for the minority class, which means it misses many fraudulent transactions.

SMOTE is a technique that can help mitigate the class imbalance problem by generating synthetic samples for the minority class. SMOTE works by finding the k-nearest neighbors of each minority class instance, and randomly creating new instances along the line segments connecting them. This way, SMOTE can increase the number and diversity of the minority class instances, without duplicating or losing any information. By applying SMOTE on the minority class in the training dataset, the data scientist can balance the classes and improve the model’s performance on the positive class1.

The other options are either ineffective or counterproductive. Applying SMOTE on the majority class would not balance the classes, but increase the imbalance and the size of the dataset. Undersampling the minority class would reduce the number of instances available for the model to learn from, and potentially lose some important information. Oversampling the majority class would also increase the imbalance and the size of the dataset, and introduce redundancy and overfitting.

References:

1: SMOTE for Imbalanced Classification with Python - Machine Learning Mastery

Questions 50

A wildlife research company has a set of images of lions and cheetahs. The company created a dataset of the images. The company labeled each image with a binary label that indicates whether an image contains a lion or cheetah. The company wants to train a model to identify whether new images contain a lion or cheetah.

.... Dh Amazon SageMaker algorithm will meet this requirement?

Options:

XGBoost

Image Classification - TensorFlow

Object Detection - TensorFlow

Semantic segmentation - MXNet

Buy Now

Questions 51

A company plans to build a custom natural language processing (NLP) model to classify and prioritize user feedback. The company hosts the data and all machine learning (ML) infrastructure in the AWS Cloud. The ML team works from the company's office, which has an IPsec VPN connection to one VPC in the AWS Cloud.

The company has set both the enableDnsHostnames attribute and the enableDnsSupport attribute of the VPC to true. The company's DNS resolvers point to the VPC DNS. The company does not allow the ML team to access Amazon SageMaker notebooks through connections that use the public internet. The connection must stay within a private network and within the AWS internal network.

Which solution will meet these requirements with the LEAST development effort?

Options:

Create a VPC interface endpoint for the SageMaker notebook in the VPC. Access the notebook through a VPN connection and the VPC endpoint.

Create a bastion host by using Amazon EC2 in a public subnet within the VPC. Log in to the bastion host through a VPN connection. Access the SageMaker notebook from the bastion host.

Create a bastion host by using Amazon EC2 in a private subnet within the VPC with a NAT gateway. Log in to the bastion host through a VPN connection. Access the SageMaker notebook from the bastion host.

Create a NAT gateway in the VPC. Access the SageMaker notebook HTTPS endpoint through a VPN connection and the NAT gateway.

Buy Now

Questions 52

A manufacturing company asks its Machine Learning Specialist to develop a model that classifies defective parts into one of eight defect types. The company has provided roughly 100000 images per defect type for training During the injial training of the image classification model the Specialist notices that the validation accuracy is 80%, while the training accuracy is 90% It is known that human-level performance for this type of image classification is around 90%

What should the Specialist consider to fix this issue1?

Options:

A longer training time

Making the network larger

Using a different optimizer

Using some form of regularization

Buy Now

Answer:

Explanation:

Regularization is a technique that can be used to prevent overfitting and improve model performance on unseen data. Overfitting occurs when the model learns the training data too well and fails to generalize to new and unseen data. This can be seen in the question, where the validation accuracy is lower than the training accuracy, and both are lower than the human-level performance. Regularization is a way of adding some constraints or penalties to the model to reduce its complexity and prevent it from memorizing the training data. Some common forms of regularization for image classification are:

Weight decay: Adding a term to the loss function that penalizes large weights in the model. This can help reduce the variance and noise in the model and make it more robust to small changes in the input.

Dropout: Randomly dropping out some units or connections in the model during training. This can help reduce the co-dependency among the units and make the model more resilient to missing or corrupted features.

Data augmentation: Artificially increasing the size and diversity of the training data by applying random transformations, such as cropping, flipping, rotating, scaling, etc. This can help the model learn more invariant and generalizable features and reduce the risk of overfitting to specific patterns in the training data.

The other options are not likely to fix the issue of overfitting, and may even worsen it:

A longer training time: This can lead to more overfitting, as the model will have more chances to fit the noise and details in the training data that are not relevant for the validation data.

Making the network larger: This can increase the model capacity and complexity, which can also lead to more overfitting, as the model will have more parameters to learn and adjust to the training data.

Using a different optimizer: This can affect the speed and stability of the training process, but not necessarily the generalization ability of the model. The choice of optimizer depends on the characteristics of the data and the model, and there is no guarantee that a different optimizer will prevent overfitting.

References:

Regularization (machine learning)

Image Classification: Regularization

How to Reduce Overfitting With Dropout Regularization in Keras

Questions 53

A company processes millions of orders every day. The company uses Amazon DynamoDB tables to store order information. When customers submit new orders, the new orders are immediately added to the DynamoDB tables. New orders arrive in the DynamoDB tables continuously.

A data scientist must build a peak-time prediction solution. The data scientist must also create an Amazon OuickSight dashboard to display near real-lime order insights. The data scientist needs to build a solution that will give QuickSight access to the data as soon as new order information arrives.

Which solution will meet these requirements with the LEAST delay between when a new order is processed and when QuickSight can access the new order information?

Options:

Use AWS Glue to export the data from Amazon DynamoDB to Amazon S3. Configure OuickSight to access the data in Amazon S3.

Use Amazon Kinesis Data Streams to export the data from Amazon DynamoDB to Amazon S3. Configure OuickSight to access the data in Amazon S3.

Use an API call from OuickSight to access the data that is in Amazon DynamoDB directly

Use Amazon Kinesis Data Firehose to export the data from Amazon DynamoDB to Amazon S3. Configure OuickSight to access the data in Amazon S3.

Buy Now

Questions 54

Which of the following metrics should a Machine Learning Specialist generally use to compare/evaluate machine learning classification models against each other?

Options:

Recall

Misclassification rate

Mean absolute percentage error (MAPE)

Area Under the ROC Curve (AUC)

Buy Now

Questions 55

A Machine Learning Specialist must build out a process to query a dataset on Amazon S3 using Amazon Athena The dataset contains more than 800.000 records stored as plaintext CSV files Each record contains 200 columns and is approximately 1 5 MB in size Most queries will span 5 to 10 columns only

How should the Machine Learning Specialist transform the dataset to minimize query runtime?

Options:

Convert the records to Apache Parquet format

Convert the records to JSON format

Convert the records to GZIP CSV format

Convert the records to XML format

Buy Now

Questions 56

An obtain relator collects the following data on customer orders: demographics, behaviors, location, shipment progress, and delivery time. A data scientist joins all the collected datasets. The result is a single dataset that includes 980 variables.

The data scientist must develop a machine learning (ML) model to identify groups of customers who are likely to respond to a marketing campaign.

Which combination of algorithms should the data scientist use to meet this requirement? (Select TWO.)

Options:

Latent Dirichlet Allocation (LDA)

K-means

Se mantic feg mentation

Principal component analysis (PCA)

Factorization machines (FM)

Buy Now

Questions 57

A Machine Learning Specialist is configuring automatic model tuning in Amazon SageMaker

When using the hyperparameter optimization feature, which of the following guidelines should be followed to improve optimization?

Choose the maximum number of hyperparameters supported by

Options:

Amazon SageMaker to search the largest number of combinations possible

Specify a very large hyperparameter range to allow Amazon SageMaker to cover every possible value.

Use log-scaled hyperparameters to allow the hyperparameter space to be searched as quickly as possible

Execute only one hyperparameter tuning job at a time and improve tuning through successive rounds of experiments

Buy Now

Answer:

Explanation:

Using log-scaled hyperparameters is a guideline that can improve the automatic model tuning in Amazon SageMaker. Log-scaled hyperparameters are hyperparameters that have values that span several orders of magnitude, such as learning rate, regularization parameter, or number of hidden units. Log-scaled hyperparameters can be specified by using a log-uniform distribution, which assigns equal probability to each order of magnitude within a range. For example, a log-uniform distribution between 0.001 and 1000 can sample values such as 0.001, 0.01, 0.1, 1, 10, 100, or 1000 with equal probability. Using log-scaled hyperparameters can allow the hyperparameter optimization feature to search the hyperparameter space more efficiently and effectively, as it can explore different scales of values and avoid sampling values that are too small or too large. Using log-scaled hyperparameters can also help avoid numerical issues, such as underflow or overflow, that may occur when using linear-scaled hyperparameters. Using log-scaled hyperparameters can be done by setting the ScalingType parameter to Logarithmic when defining the hyperparameter ranges in Amazon SageMaker12

The other options are not valid or relevant guidelines for improving the automatic model tuning in Amazon SageMaker. Choosing the maximum number of hyperparameters supported by Amazon SageMaker to search the largest number of combinations possible is not a good practice, as it can increase the time and cost of the tuning job and make it harder to find the optimal values. Amazon SageMaker supports up to 20 hyperparameters for tuning, but it is recommended to choose only the most important and influential hyperparameters for the model and algorithm, and use default or fixed values for the rest3 Specifying a very large hyperparameter range to allow Amazon SageMaker to cover every possible value is not a good practice, as it can result in sampling values that are irrelevant or impractical for the model and algorithm, and waste the tuning budget. It is recommended to specify a reasonable and realistic hyperparameter range based on the prior knowledge and experience of the model and algorithm, and use the results of the tuning job to refine the range if needed4 Executing only one hyperparameter tuning job at a time and improving tuning through successive rounds of experiments is not a good practice, as it can limit the exploration and exploitation of the hyperparameter space and make the tuning process slower and less efficient. It is recommended to use parallelism and concurrency to run multiple training jobs simultaneously and leverage the Bayesian optimization algorithm that Amazon SageMaker uses to guide the search for the best hyperparameter values5

Questions 58

An engraving company wants to automate its quality control process for plaques. The company performs the process before mailing each customized plaque to a customer. The company has created an Amazon S3 bucket that contains images of defects that should cause a plaque to be rejected. Low-confidence predictions must be sent to an internal team of reviewers who are using Amazon Augmented Al (Amazon A2I).

Which solution will meet these requirements?

Options:

Use Amazon Textract for automatic processing. Use Amazon A2I with Amazon Mechanical Turk for manual review.

Use Amazon Rekognition for automatic processing. Use Amazon A2I with a private workforce option for manual review.

Use Amazon Transcribe for automatic processing. Use Amazon A2I with a private workforce option for manual review.

Use AWS Panorama for automatic processing Use Amazon A2I with Amazon Mechanical Turk for manual review

Buy Now

Questions 59

A machine learning (ML) specialist needs to extract embedding vectors from a text series. The goal is to provide a ready-to-ingest feature space for a data scientist to develop downstream ML predictive models. The text consists of curated sentences in English. Many sentences use similar words but in different contexts. There are questions and answers among the sentences, and the embedding space must differentiate between them.

Which options can produce the required embedding vectors that capture word context and sequential QA information? (Choose two.)

Options:

Amazon SageMaker seq2seq algorithm

Amazon SageMaker BlazingText algorithm in Skip-gram mode

Amazon SageMaker Object2Vec algorithm

Amazon SageMaker BlazingText algorithm in continuous bag-of-words (CBOW) mode

Combination of the Amazon SageMaker BlazingText algorithm in Batch Skip-gram mode with a custom recurrent neural network (RNN)

Buy Now

Answer:

B, E

Explanation:

To capture word context and sequential QA information, the embedding vectors need to consider both the order and the meaning of the words in the text.

Option B, Amazon SageMaker BlazingText algorithm in Skip-gram mode, is a valid option because it can learn word embeddings that capture the semantic similarity and syntactic relations between words based on their co-occurrence in a window of words. Skip-gram mode can also handle rare words better than continuous bag-of-words (CBOW) mode1.

Option E, combination of the Amazon SageMaker BlazingText algorithm in Batch Skip-gram mode with a custom recurrent neural network (RNN), is another valid option because it can leverage the advantages of Skip-gram mode and also use an RNN to model the sequential nature of the text. An RNN can capture the temporal dependencies and long-term dependencies between words, which are important for QA tasks2.

Option A, Amazon SageMaker seq2seq algorithm, is not a valid option because it is designed for sequence-to-sequence tasks such as machine translation, summarization, or chatbots. It does not produce embedding vectors for text series, but rather generates an output sequence given an input sequence3.

Option C, Amazon SageMaker Object2Vec algorithm, is not a valid option because it is designed for learning embeddings for pairs of objects, such as text-image, text-text, or image-image. It does not produce embedding vectors for text series, but rather learns a similarity function between pairs of objects4.

Option D, Amazon SageMaker BlazingText algorithm in continuous bag-of-words (CBOW) mode, is not a valid option because it does not capture word context as well as Skip-gram mode. CBOW mode predicts a word given its surrounding words, while Skip-gram mode predicts the surrounding words given a word. CBOW mode is faster and more suitable for frequent words, but Skip-gram mode can learn more meaningful embeddings for rare words1.

References:

1: Amazon SageMaker BlazingText

2: Recurrent Neural Networks (RNNs)

3: Amazon SageMaker Seq2Seq

4: Amazon SageMaker Object2Vec

Questions 60

A manufacturing company has structured and unstructured data stored in an Amazon S3 bucket. A Machine Learning Specialist wants to use SQL to run queries on this data.

Which solution requires the LEAST effort to be able to query this data?

Options:

Use AWS Data Pipeline to transform the data and Amazon RDS to run queries.

Use AWS Glue to catalogue the data and Amazon Athena to run queries.

Use AWS Batch to run ETL on the data and Amazon Aurora to run the queries.

Use AWS Lambda to transform the data and Amazon Kinesis Data Analytics to run queries.

Buy Now

Answer:

Explanation:

Using AWS Glue to catalogue the data and Amazon Athena to run queries is the solution that requires the least effort to be able to query the data stored in an Amazon S3 bucket using SQL. AWS Glue is a service that provides a serverless data integration platform for data preparation and transformation. AWS Glue can automatically discover, crawl, and catalogue the data stored in various sources, such as Amazon S3, Amazon RDS, Amazon Redshift, etc. AWS Glue can also use AWS KMS to encrypt the data at rest on the Glue Data Catalog and Glue ETL jobs. AWS Glue can handle both structured and unstructured data, and support various data formats, such as CSV, JSON, Parquet, etc. AWS Glue can also use built-in or custom classifiers to identify and parse the data schema and format1 Amazon Athena is a service that provides an interactive query engine that can run SQL queries directly on data stored in Amazon S3. Amazon Athena can integrate with AWS Glue to use the Glue Data Catalog as a central metadata repository for the data sources and tables. Amazon Athena can also use AWS KMS to encrypt the data at rest on Amazon S3 and the query results. Amazon Athena can query both structured and unstructured data, and support various data formats, such as CSV, JSON, Parquet, etc. Amazon Athena can also use partitions and compression to optimize the query performance and reduce the query cost23

The other options are not valid or require more effort to query the data stored in an Amazon S3 bucket using SQL. Using AWS Data Pipeline to transform the data and Amazon RDS to run queries is not a good option, as it involves moving the data from Amazon S3 to Amazon RDS, which can incur additional time and cost. AWS Data Pipeline is a service that can orchestrate and automate data movement and transformation across various AWS services and on-premises data sources. AWS Data Pipeline can be integrated with Amazon EMR to run ETL jobs on the data stored in Amazon S3. Amazon RDS is a service that provides a managed relational database service that can run various database engines, such as MySQL, PostgreSQL, Oracle, etc. Amazon RDS can use AWS KMS to encrypt the data at rest and in transit. Amazon RDS can run SQL queries on the data stored in the database tables45 Using AWS Batch to run ETL on the data and Amazon Aurora to run the queries is not a good option, as it also involves moving the data from Amazon S3 to Amazon Aurora, which can incur additional time and cost. AWS Batch is a service that can run batch computing workloads on AWS. AWS Batch can be integrated with AWS Lambda to trigger ETL jobs on the data stored in Amazon S3. Amazon Aurora is a service that provides a compatible and scalable relational database engine that can run MySQL or PostgreSQL. Amazon Aurora can use AWS KMS to encrypt the data at rest and in transit. Amazon Aurora can run SQL queries on the data stored in the database tables. Using AWS Lambda to transform the data and Amazon Kinesis Data Analytics to run queries is not a good option, as it is not suitable for querying data stored in Amazon S3 using SQL. AWS Lambda is a service that can run serverless functions on AWS. AWS Lambda can be integrated with Amazon S3 to trigger data transformation functions on the data stored in Amazon S3. Amazon Kinesis Data Analytics is a service that can analyze streaming data using SQL or Apache Flink. Amazon Kinesis Data Analytics can be integrated with Amazon Kinesis Data Streams or Amazon Kinesis Data Firehose to ingest streaming data sources, such as web logs, social media, IoT devices, etc. Amazon Kinesis Data Analytics is not designed for querying data stored in Amazon S3 using SQL.

Questions 61

A machine learning specialist is running an Amazon SageMaker endpoint using the built-in object detection algorithm on a P3 instance for real-time predictions in a company's production application. When evaluating the model's resource utilization, the specialist notices that the model is using only a fraction of the GPU.

Which architecture changes would ensure that provisioned resources are being utilized effectively?

Options:

Redeploy the model as a batch transform job on an M5 instance.

Redeploy the model on an M5 instance. Attach Amazon Elastic Inference to the instance.

Redeploy the model on a P3dn instance.

Deploy the model onto an Amazon Elastic Container Service (Amazon ECS) cluster using a P3 instance.

Buy Now

Answer:

Explanation:

The best way to ensure that provisioned resources are being utilized effectively is to redeploy the model on an M5 instance and attach Amazon Elastic Inference to the instance. Amazon Elastic Inference allows you to attach low-cost GPU-powered acceleration to Amazon EC2 and Amazon SageMaker instances to reduce the cost of running deep learning inference by up to 75%. By using Amazon Elastic Inference, you can choose the instance type that is best suited to the overall CPU and memory needs of your application, and then separately configure the amount of inference acceleration that you need with no code changes. This way, you can avoid wasting GPU resources and pay only for what you use.

Option A is incorrect because a batch transform job is not suitable for real-time predictions. Batch transform is a high-performance and cost-effective feature for generating inferences using your trained models. Batch transform manages all of the compute resources required to get inferences. Batch transform is ideal for scenarios where you’re working with large batches of data, don’t need sub-second latency, or need to process data that is stored in Amazon S3.

Option C is incorrect because redeploying the model on a P3dn instance would not improve the resource utilization. P3dn instances are designed for distributed machine learning and high performance computing applications that need high network throughput and packet rate performance. They are not optimized for inference workloads.

Option D is incorrect because deploying the model onto an Amazon ECS cluster using a P3 instance would not ensure that provisioned resources are being utilized effectively. Amazon ECS is a fully managed container orchestration service that allows you to run and scale containerized applications on AWS. However, using Amazon ECS would not address the issue of underutilized GPU resources. In fact, it might introduce additional overhead and complexity in managing the cluster.

References:

Amazon Elastic Inference - Amazon SageMaker

Batch Transform - Amazon SageMaker

Amazon EC2 P3 Instances

Amazon EC2 P3dn Instances

Amazon Elastic Container Service

Questions 62

A data scientist wants to improve the fit of a machine learning (ML) model that predicts house prices. The data scientist makes a first attempt to fit the model, but the fitted model has poor accuracy on both the training dataset and the test dataset.

Which steps must the data scientist take to improve model accuracy? (Select THREE.)

Options:

Increase the amount of regularization that the model uses.

Decrease the amount of regularization that the model uses.

Increase the number of training examples that that model uses.

Increase the number of test examples that the model uses.

Increase the number of model features that the model uses.

Decrease the number of model features that the model uses.

Buy Now

Questions 63

A Machine Learning Specialist built an image classification deep learning model. However the Specialist ran into an overfitting problem in which the training and testing accuracies were 99% and 75%r respectively.

How should the Specialist address this issue and what is the reason behind it?

Options:

The learning rate should be increased because the optimization process was trapped at a local minimum.

The dropout rate at the flatten layer should be increased because the model is not generalized enough.

The dimensionality of dense layer next to the flatten layer should be increased because the model is not complex enough.

The epoch number should be increased because the optimization process was terminated before it reached the global minimum.

Buy Now

Answer:

Explanation:

The best way to address the overfitting problem in image classification is to increase the dropout rate at the flatten layer because the model is not generalized enough. Dropout is a regularization technique that randomly drops out some units from the neural network during training, reducing the co-adaptation of features and preventing overfitting. The flatten layer is the layer that converts the output of the convolutional layers into a one-dimensional vector that can be fed into the dense layers. Increasing the dropout rate at the flatten layer means that more features from the convolutional layers will be ignored, forcing the model to learn more robust and generalizable representations from the remaining features.

The other options are not correct for this scenario because:

Increasing the learning rate would not help with the overfitting problem, as it would make the optimization process more unstable and prone to overshooting the global minimum. A high learning rate can also cause the model to diverge or oscillate around the optimal solution, resulting in poor performance and accuracy.

Increasing the dimensionality of the dense layer next to the flatten layer would not help with the overfitting problem, as it would make the model more complex and increase the number of parameters to be learned. A more complex model can fit the training data better, but it can also memorize the noise and irrelevant details in the data, leading to overfitting and poor generalization.

Increasing the epoch number would not help with the overfitting problem, as it would make the model train longer and more likely to overfit the training data. A high epoch number can cause the model to converge to the global minimum, but it can also cause the model to over-optimize the training data and lose the ability to generalize to new data.

References:

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

How to Reduce Overfitting With Dropout Regularization in Keras

How to Control the Stability of Training Neural Networks With the Learning Rate

How to Choose the Number of Hidden Layers and Nodes in a Feedforward Neural Network?

How to decide the optimal number of epochs to train a neural network?

Questions 64

A Machine Learning Specialist kicks off a hyperparameter tuning job for a tree-based ensemble model using Amazon SageMaker with Area Under the ROC Curve (AUC) as the objective metric This workflow will eventually be deployed in a pipeline that retrains and tunes hyperparameters each night to model click-through on data that goes stale every 24 hours

With the goal of decreasing the amount of time it takes to train these models, and ultimately to decrease costs, the Specialist wants to reconfigure the input hyperparameter range(s)

Which visualization will accomplish this?

Options:

A histogram showing whether the most important input feature is Gaussian.

A scatter plot with points colored by target variable that uses (-Distributed Stochastic Neighbor Embedding (I-SNE) to visualize the large number of input variables in an easier-to-read dimension.

A scatter plot showing (he performance of the objective metric over each training iteration

A scatter plot showing the correlation between maximum tree depth and the objective metric.

Buy Now

Answer:

Explanation:

A scatter plot showing the correlation between maximum tree depth and the objective metric is a visualization that can help the Machine Learning Specialist reconfigure the input hyperparameter range(s) for the tree-based ensemble model. A scatter plot is a type of graph that displays the relationship between two variables using dots, where each dot represents one observation. A scatter plot can show the direction, strength, and shape of the correlation between the variables, as well as any outliers or clusters. In this case, the scatter plot can show how the maximum tree depth, which is a hyperparameter that controls the complexity and depth of the decision trees in the ensemble model, affects the AUC, which is the objective metric that measures the performance of the model in terms of the trade-off between true positive rate and false positive rate. By looking at the scatter plot, the Machine Learning Specialist can see if there is a positive, negative, or no correlation between the maximum tree depth and the AUC, and how strong or weak the correlation is. The Machine Learning Specialist can also see if there is an optimal value or range of values for the maximum tree depth that maximizes the AUC, or if there is a point of diminishing returns or overfitting where increasing the maximum tree depth does not improve or even worsens the AUC. Based on the scatter plot, the Machine Learning Specialist can reconfigure the input hyperparameter range(s) for the maximum tree depth to focus on the values that yield the best AUC, and avoid the values that result in poor AUC. This can decrease the amount of time and cost it takes to train the model, as the hyperparameter tuning job can explore fewer and more promising combinations of values. A scatter plot can be created using various tools and libraries, such as Matplotlib, Seaborn, Plotly, etc12

The other options are not valid or relevant for reconfiguring the input hyperparameter range(s) for the tree-based ensemble model. A histogram showing whether the most important input feature is Gaussian is a visualization that can help the Machine Learning Specialist understand the distribution and shape of the input data, but not the hyperparameters. A histogram is a type of graph that displays the frequency or count of values in a single variable using bars, where each bar represents a bin or interval of values. A histogram can show if the variable is symmetric, skewed, or multimodal, and if it follows a normal or Gaussian distribution, which is a bell-shaped curve that is often assumed by many machine learning algorithms. In this case, the histogram can show if the most important input feature, which is a variable that has the most influence or predictive power on the output variable, is Gaussian or not. However, this does not help the Machine Learning Specialist reconfigure the input hyperparameter range(s) for the tree-based ensemble model, as the input feature is not a hyperparameter that can be tuned or optimized. A histogram can be created using various tools and libraries, such as Matplotlib, Seaborn, Plotly, etc34

A scatter plot with points colored by target variable that uses t-Distributed Stochastic Neighbor Embedding (t-SNE) to visualize the large number of input variables in an easier-to-read dimension is a visualization that can help the Machine Learning Specialist understand the structure and clustering of the input data, but not the hyperparameters. t-SNE is a technique that can reduce the dimensionality of high-dimensional data, such as images, text, or gene expression, and project it onto a lower-dimensional space, such as two or three dimensions, while preserving the local similarities and distances between the data points. t-SNE can help visualize and explore the patterns and relationships in the data, such as the clusters, outliers, or separability of the classes. In this case, the scatter plot can show how the input variables, which are the features or predictors of the output variable, are mapped onto a two-dimensional space using t-SNE, and how the points are colored by the target variable, which is the output or response variable that the model tries to predict. However, this does not help the Machine Learning Specialist reconfigure the input hyperparameter range(s) for the tree-based ensemble model, as the input variables and the target variable are not hyperparameters that can be tuned or optimized. A scatter plot with t-SNE can be created using various tools and libraries, such as Scikit-learn, TensorFlow, PyTorch, etc5

A scatter plot showing the performance of the objective metric over each training iteration is a visualization that can help the Machine Learning Specialist understand the learning curve and convergence of the model, but not the hyperparameters. A scatter plot is a type of graph that displays the relationship between two variables using dots, where each dot represents one observation. A scatter plot can show the direction, strength, and shape of the correlation between the variables, as well as any outliers or clusters. In this case, the scatter plot can show how the objective metric, which is the performance measure that the model tries to optimize, changes over each training iteration, which is the number of times that the model updates its parameters using a batch of data. A scatter plot can show if the objective metric improves, worsens, or stagnates over time, and if the model converges to a stable value or oscillates or diverges. However, this does not help the Machine Learning Specialist reconfigure the input hyperparameter range(s) for the tree-based ensemble model, as the objective metric and the training iteration are not hyperparameters that can be tuned or optimized. A scatter plot can be created using various tools and libraries, such as Matplotlib, Seaborn, Plotly, etc.

Questions 65

An office security agency conducted a successful pilot using 100 cameras installed at key locations within the main office. Images from the cameras were uploaded to Amazon S3 and tagged using Amazon Rekognition, and the results were stored in Amazon ES. The agency is now looking to expand the pilot into a full production system using thousands of video cameras in its office locations globally. The goal is to identify activities performed by non-employees in real time.

Which solution should the agency consider?

Options:

Use a proxy server at each local office and for each camera, and stream the RTSP feed to a unique

Amazon Kinesis Video Streams video stream. On each stream, use Amazon Rekognition Video and create

a stream processor to detect faces from a collection of known employees, and alert when non-employees

are detected.

Use a proxy server at each local office and for each camera, and stream the RTSP feed to a unique

Amazon Kinesis Video Streams video stream. On each stream, use Amazon Rekognition Image to detect

faces from a collection of known employees and alert when non-employees are detected.

Install AWS DeepLens cameras and use the DeepLens_Kinesis_Video module to stream video to

Amazon Kinesis Video Streams for each camera. On each stream, use Amazon Rekognition Video and

create a stream processor to detect faces from a collection on each stream, and alert when nonemployees

are detected.

Install AWS DeepLens cameras and use the DeepLens_Kinesis_Video module to stream video to

Amazon Kinesis Video Streams for each camera. On each stream, run an AWS Lambda function to

capture image fragments and then call Amazon Rekognition Image to detect faces from a collection of

known employees, and alert when non-employees are detected.

Buy Now

Questions 66

A manufacturing company stores production volume data in a PostgreSQL database.

The company needs an end-to-end solution that will give business analysts the ability to prepare data for processing and to predict future production volume based the previous year's production volume. The solution must not require the company to have coding knowledge.

Which solution will meet these requirements with the LEAST effort?

Options:

Use AWS Database Migration Service (AWS DMS) to transfer the data from the PostgreSQL database to an Amazon S3 bucket. Create an Amazon EMR cluster to read the S3 bucket and perform the data preparation. Use Amazon SageMaker Studio for the prediction modeling.

Use AWS Glue DataBrew to read the data that is in the PostgreSQL database and to perform the data preparation. Use Amazon SageMaker Canvas for the prediction modeling.

Use AWS Database Migration Service (AWS DMS) to transfer the data from the PostgreSQL database to an Amazon S3 bucket. Use AWS Glue to read the data in the S3 bucket and to perform the data preparation. Use Amazon SageMaker Canvas for the prediction modeling.

Use AWS Glue DataBrew to read the data that is in the PostgreSQL database and to perform the data preparation. Use Amazon SageMaker Studio for the prediction modeling.

Buy Now

Questions 67

A Machine Learning Specialist needs to create a data repository to hold a large amount of time-based training data for a new model. In the source system, new files are added every hour Throughout a single 24-hour period, the volume of hourly updates will change significantly. The Specialist always wants to train on the last 24 hours of the data

Which type of data repository is the MOST cost-effective solution?

Options:

An Amazon EBS-backed Amazon EC2 instance with hourly directories

An Amazon RDS database with hourly table partitions

An Amazon S3 data lake with hourly object prefixes

An Amazon EMR cluster with hourly hive partitions on Amazon EBS volumes

Buy Now

Questions 68

A Marketing Manager at a pet insurance company plans to launch a targeted marketing campaign on social media to acquire new customers Currently, the company has the following data in Amazon Aurora

• Profiles for all past and existing customers

• Profiles for all past and existing insured pets

• Policy-level information

• Premiums received

• Claims paid

What steps should be taken to implement a machine learning model to identify potential new customers on social media?

Options:

Use regression on customer profile data to understand key characteristics of consumer segments Find similar profiles on social media.

Use clustering on customer profile data to understand key characteristics of consumer segments Find similar profiles on social media.

Use a recommendation engine on customer profile data to understand key characteristics of consumer segments. Find similar profiles on social media

Use a decision tree classifier engine on customer profile data to understand key characteristics of consumer segments. Find similar profiles on social media

Buy Now

Questions 69

A company will use Amazon SageMaker to train and host a machine learning (ML) model for a marketing campaign. The majority of data is sensitive customer data. The data must be encrypted at rest. The company wants AWS to maintain the root of trust for the master keys and wants encryption key usage to be logged.

Which implementation will meet these requirements?

Options:

Use encryption keys that are stored in AWS Cloud HSM to encrypt the ML data volumes, and to encrypt the model artifacts and data in Amazon S3.

Use SageMaker built-in transient keys to encrypt the ML data volumes. Enable default encryption for new Amazon Elastic Block Store (Amazon EBS) volumes.

Use customer managed keys in AWS Key Management Service (AWS KMS) to encrypt the ML data volumes, and to encrypt the model artifacts and data in Amazon S3.

Use AWS Security Token Service (AWS STS) to create temporary tokens to encrypt the ML storage volumes, and to encrypt the model artifacts and data in Amazon S3.

Buy Now

Answer:

Explanation:

Amazon SageMaker supports encryption at rest for the ML storage volumes, the model artifacts, and the data in Amazon S3 using AWS Key Management Service (AWS KMS). AWS KMS is a service that allows customers to create and manage encryption keys that can be used to encrypt data. AWS KMS also provides an audit trail of key usage by logging key events to AWS CloudTrail. Customers can use either AWS managed keys or customer managed keys to encrypt their data. AWS managed keys are created and managed by AWS on behalf of the customer, while customer managed keys are created and managed by the customer. Customer managed keys offer more control and flexibility over the key policies, permissions, and rotation. Therefore, to meet the requirements of the company, the best option is to use customer managed keys in AWS KMS to encrypt the ML data volumes, and to encrypt the model artifacts and data in Amazon S3.

The other options are not correct because:

Option A: AWS Cloud HSM is a service that provides hardware security modules (HSMs) to store and use encryption keys. AWS Cloud HSM is not integrated with Amazon SageMaker, and cannot be used to encrypt the ML data volumes, the model artifacts, or the data in Amazon S3. AWS Cloud HSM is more suitable for customers who need to meet strict compliance requirements or who need direct control over the HSMs.

Option B: SageMaker built-in transient keys are temporary keys that are used to encrypt the ML data volumes and are discarded immediately after encryption. These keys do not provide persistent encryption or logging of key usage. Enabling default encryption for new Amazon Elastic Block Store (Amazon EBS) volumes does not affect the ML data volumes, which are encrypted separately by SageMaker. Moreover, this option does not address the encryption of the model artifacts and data in Amazon S3.

Option D: AWS Security Token Service (AWS STS) is a service that provides temporary credentials to access AWS resources. AWS STS does not provide encryption keys or encryption services. AWS STS cannot be used to encrypt the ML storage volumes, the model artifacts, or the data in Amazon S3.

References:

Protect Data at Rest Using Encryption - Amazon SageMaker

What is AWS Key Management Service? - AWS Key Management Service

What is AWS CloudHSM? - AWS CloudHSM

What is AWS Security Token Service? - AWS Security Token Service

Questions 70

The Chief Editor for a product catalog wants the Research and Development team to build a machine learning system that can be used to detect whether or not individuals in a collection of images are wearing the company's retail brand The team has a set of training data

Which machine learning algorithm should the researchers use that BEST meets their requirements?

Options:

Latent Dirichlet Allocation (LDA)

Recurrent neural network (RNN)

K-means

Convolutional neural network (CNN)

Buy Now

Answer:

Explanation:

A convolutional neural network (CNN) is a type of machine learning algorithm that is suitable for image classification tasks. A CNN consists of multiple layers that can extract features from images and learn to recognize patterns and objects. A CNN can also use transfer learning to leverage pre-trained models that have been trained on large-scale image datasets, such as ImageNet, and fine-tune them for specific tasks, such as detecting the company’s retail brand. A CNN can achieve high accuracy and performance for image classification problems, as it can handle complex and diverse images and reduce the dimensionality and noise of the input data. A CNN can be implemented using various frameworks and libraries, such as TensorFlow, PyTorch, Keras, MXNet, etc12

The other options are not valid or relevant for the image classification task. Latent Dirichlet Allocation (LDA) is a type of machine learning algorithm that is suitable for topic modeling tasks. LDA can discover the hidden topics and their proportions in a collection of text documents, such as news articles, tweets, reviews, etc. LDA is not applicable for image data, as it requires textual input and output. LDA can be implemented using various frameworks and libraries, such as Gensim, Scikit-learn, Mallet, etc34

Recurrent neural network (RNN) is a type of machine learning algorithm that is suitable for sequential data tasks. RNN can process and generate data that has temporal or sequential dependencies, such as natural language, speech, audio, video, etc. RNN is not optimal for image data, as it does not capture the spatial features and relationships of the pixels. RNN can be implemented using various frameworks and libraries, such as TensorFlow, PyTorch, Keras, MXNet, etc.

K-means is a type of machine learning algorithm that is suitable for clustering tasks. K-means can partition a set of data points into a predefined number of clusters, based on the similarity and distance between the data points. K-means is not suitable for image classification tasks, as it does not learn to label the images or detect the objects of interest. K-means can be implemented using various frameworks and libraries, such as Scikit-learn, TensorFlow, PyTorch, etc.

Questions 71

A machine learning specialist is developing a proof of concept for government users whose primary concern is security. The specialist is using Amazon SageMaker to train a convolutional neural network (CNN) model for a photo classifier application. The specialist wants to protect the data so that it cannot be accessed and transferred to a remote host by malicious code accidentally installed on the training container.

Which action will provide the MOST secure protection?

Options:

Remove Amazon S3 access permissions from the SageMaker execution role.

Encrypt the weights of the CNN model.

Encrypt the training and validation dataset.

Enable network isolation for training jobs.

Buy Now

Questions 72

A large consumer goods manufacturer has the following products on sale

• 34 different toothpaste variants

• 48 different toothbrush variants

• 43 different mouthwash variants

The entire sales history of all these products is available in Amazon S3 Currently, the company is using custom-built autoregressive integrated moving average (ARIMA) models to forecast demand for these products The company wants to predict the demand for a new product that will soon be launched

Which solution should a Machine Learning Specialist apply?

Options:

Train a custom ARIMA model to forecast demand for the new product.

Train an Amazon SageMaker DeepAR algorithm to forecast demand for the new product

Train an Amazon SageMaker k-means clustering algorithm to forecast demand for the new product.

Train a custom XGBoost model to forecast demand for the new product

Buy Now

Questions 73

A city wants to monitor its air quality to address the consequences of air pollution A Machine Learning Specialist needs to forecast the air quality in parts per million of contaminates for the next 2 days in the city as this is a prototype, only daily data from the last year is available

Which model is MOST likely to provide the best results in Amazon SageMaker?

Options:

Use the Amazon SageMaker k-Nearest-Neighbors (kNN) algorithm on the single time series consisting of

the full year of data with a predictor_type of regressor.

Use Amazon SageMaker Random Cut Forest (RCF) on the single time series consisting of the full year of

data.

Use the Amazon SageMaker Linear Learner algorithm on the single time series consisting of the full year

of data with a predictor_type of regressor.

Use the Amazon SageMaker Linear Learner algorithm on the single time series consisting of the full year

of data with a predictor_type of classifier.

Buy Now

Questions 74

A Data Scientist is developing a machine learning model to classify whether a financial transaction is fraudulent. The labeled data available for training consists of 100,000 non-fraudulent observations and 1,000 fraudulent observations.

The Data Scientist applies the XGBoost algorithm to the data, resulting in the following confusion matrix when the trained model is applied to a previously unseen validation dataset. The accuracy of the model is 99.1%, but the Data Scientist has been asked to reduce the number of false negatives.

Which combination of steps should the Data Scientist take to reduce the number of false positive predictions by the model? (Select TWO.)

Options:

Change the XGBoost eval_metric parameter to optimize based on rmse instead of error.

Increase the XGBoost scale_pos_weight parameter to adjust the balance of positive and negative weights.

Increase the XGBoost max_depth parameter because the model is currently underfitting the data.

Change the XGBoost evaljnetric parameter to optimize based on AUC instead of error.

Decrease the XGBoost max_depth parameter because the model is currently overfitting the data.

Buy Now

Questions 75

A pharmaceutical company performs periodic audits of clinical trial sites to quickly resolve critical findings. The company stores audit documents in text format. Auditors have requested help from a data science team to quickly analyze the documents. The auditors need to discover the 10 main topics within the documents to prioritize and distribute the review work among the auditing team members. Documents that describe adverse events must receive the highest priority.

A data scientist will use statistical modeling to discover abstract topics and to provide a list of the top words for each category to help the auditors assess the relevance of the topic.

Which algorithms are best suited to this scenario? (Choose two.)

Options:

Latent Dirichlet allocation (LDA)

Random Forest classifier

Neural topic modeling (NTM)

Linear support vector machine

Linear regression

Buy Now

Questions 76

A company that runs an online library is implementing a chatbot using Amazon Lex to provide book recommendations based on category. This intent is fulfilled by an AWS Lambda function that queries an Amazon DynamoDB table for a list of book titles, given a particular category. For testing, there are only three categories implemented as the custom slot types: "comedy," "adventure,” and "documentary.”

A machine learning (ML) specialist notices that sometimes the request cannot be fulfilled because Amazon Lex cannot understand the category spoken by users with utterances such as "funny," "fun," and "humor." The ML specialist needs to fix the problem without changing the Lambda code or data in DynamoDB.

How should the ML specialist fix the problem?

Options:

Add the unrecognized words in the enumeration values list as new values in the slot type.

Create a new custom slot type, add the unrecognized words to this slot type as enumeration values, and use this slot type for the slot.

Use the AMAZON.SearchQuery built-in slot types for custom searches in the database.

Add the unrecognized words as synonyms in the custom slot type.

Buy Now

Answer:

Explanation:

The best way to fix the problem without changing the Lambda code or data in DynamoDB is to add the unrecognized words as synonyms in the custom slot type. This way, Amazon Lex can resolve the synonyms to the corresponding slot values and pass them to the Lambda function. For example, if the slot type has a value “comedy” with synonyms “funny”, “fun”, and “humor”, then any of these words entered by the user will be resolved to “comedy” and the Lambda function can query the DynamoDB table for the book titles in that category. Adding synonyms to the custom slot type can be done easily using the Amazon Lex console or API, and does not require any code changes.

The other options are not correct because:

Option A: Adding the unrecognized words in the enumeration values list as new values in the slot type would not fix the problem, because the Lambda function and the DynamoDB table are not aware of these new values. The Lambda function would not be able to query the DynamoDB table for the book titles in the new categories, and the request would still fail. Moreover, adding new values to the slot type would increase the complexity and maintenance of the chatbot, as the Lambda function and the DynamoDB table would have to be updated accordingly.

Option B: Creating a new custom slot type, adding the unrecognized words to this slot type as enumeration values, and using this slot type for the slot would also not fix the problem, for the same reasons as option A. The Lambda function and the DynamoDB table would not be able to handle the new slot type and its values, and the request would still fail. Furthermore, creating a new slot type would require more effort and time than adding synonyms to the existing slot type.

Option C: Using the AMAZON.SearchQuery built-in slot types for custom searches in the database is not a suitable approach for this use case. The AMAZON.SearchQuery slot type is used to capture free-form user input that corresponds to a search query. However, this slot type does not perform any validation or resolution of the user input, and passes the raw input to the Lambda function. This means that the Lambda function would have to handle the logic of parsing and matching the user input to the DynamoDB table, which would require changing the Lambda code and adding more complexity to the solution.

References:

Custom slot type - Amazon Lex

Using Synonyms - Amazon Lex

Built-in Slot Types - Amazon Lex

Questions 77

A Data Engineer needs to build a model using a dataset containing customer credit card information.

How can the Data Engineer ensure the data remains encrypted and the credit card information is secure?

Options:

Use a custom encryption algorithm to encrypt the data and store the data on an Amazon SageMaker

instance in a VPC. Use the SageMaker DeepAR algorithm to randomize the credit card numbers.

Use an IAM policy to encrypt the data on the Amazon S3 bucket and Amazon Kinesis to automatically

discard credit card numbers and insert fake credit card numbers.

Use an Amazon SageMaker launch configuration to encrypt the data once it is copied to the SageMaker

instance in a VPC. Use the SageMaker principal component analysis (PCA) algorithm to reduce the length

of the credit card numbers.

Use AWS KMS to encrypt the data on Amazon S3 and Amazon SageMaker, and redact the credit card numbers from the customer data with AWS Glue.

Buy Now

Answer:

Explanation:

AWS KMS is a service that provides encryption and key management for data stored in AWS services and applications. AWS KMS can generate and manage encryption keys that are used to encrypt and decrypt data at rest and in transit. AWS KMS can also integrate with other AWS services, such as Amazon S3 and Amazon SageMaker, to enable encryption of data using the keys stored in AWS KMS. Amazon S3 is a service that provides object storage for data in the cloud. Amazon S3 can use AWS KMS to encrypt data at rest using server-side encryption with AWS KMS-managed keys (SSE-KMS). Amazon SageMaker is a service that provides a platform for building, training, and deploying machine learning models. Amazon SageMaker can use AWS KMS to encrypt data at rest on the SageMaker instances and volumes, as well as data in transit between SageMaker and other AWS services. AWS Glue is a service that provides a serverless data integration platform for data preparation and transformation. AWS Glue can use AWS KMS to encrypt data at rest on the Glue Data Catalog and Glue ETL jobs. AWS Glue can also use built-in or custom classifiers to identify and redact sensitive data, such as credit card numbers, from the customer data1234

The other options are not valid or secure ways to encrypt the data and protect the credit card information. Using a custom encryption algorithm to encrypt the data and store the data on an Amazon SageMaker instance in a VPC is not a good practice, as custom encryption algorithms are not recommended for security and may have flaws or vulnerabilities. Using the SageMaker DeepAR algorithm to randomize the credit card numbers is not a good practice, as DeepAR is a forecasting algorithm that is not designed for data anonymization or encryption. Using an IAM policy to encrypt the data on the Amazon S3 bucket and Amazon Kinesis to automatically discard credit card numbers and insert fake credit card numbers is not a good practice, as IAM policies are not meant for data encryption, but for access control and authorization. Amazon Kinesis is a service that provides real-time data streaming and processing, but it does not have the capability to automatically discard or insert data values. Using an Amazon SageMaker launch configuration to encrypt the data once it is copied to the SageMaker instance in a VPC is not a good practice, as launch configurations are not meant for data encryption, but for specifying the instance type, security group, and user data for the SageMaker instance. Using the SageMaker principal component analysis (PCA) algorithm to reduce the length of the credit card numbers is not a good practice, as PCA is a dimensionality reduction algorithm that is not designed for data anonymization or encryption.

Questions 78

A machine learning (ML) developer for an online retailer recently uploaded a sales dataset into Amazon SageMaker Studio. The ML developer wants to obtain importance scores for each feature of the dataset. The ML developer will use the importance scores to feature engineer the dataset.

Which solution will meet this requirement with the LEAST development effort?

Options:

Use SageMaker Data Wrangler to perform a Gini importance score analysis.

Use a SageMaker notebook instance to perform principal component analysis (PCA).

Use a SageMaker notebook instance to perform a singular value decomposition analysis.

Use the multicollinearity feature to perform a lasso feature selection to perform an importance scores analysis.

Buy Now

Questions 79

An e commerce company wants to launch a new cloud-based product recommendation feature for its web application. Due to data localization regulations, any sensitive data must not leave its on-premises data center, and the product recommendation model must be trained and tested using nonsensitive data only. Data transfer to the cloud must use IPsec. The web application is hosted on premises with a PostgreSQL database that contains all the data. The company wants the data to be uploaded securely to Amazon S3 each day for model retraining.

How should a machine learning specialist meet these requirements?

Options:

Create an AWS Glue job to connect to the PostgreSQL DB instance. Ingest tables without sensitive data through an AWS Site-to-Site VPN connection directly into Amazon S3.

Create an AWS Glue job to connect to the PostgreSQL DB instance. Ingest all data through an AWS Site- to-Site VPN connection into Amazon S3 while removing sensitive data using a PySpark job.

Use AWS Database Migration Service (AWS DMS) with table mapping to select PostgreSQL tables with no sensitive data through an SSL connection. Replicate data directly into Amazon S3.

Use PostgreSQL logical replication to replicate all data to PostgreSQL in Amazon EC2 through AWS Direct Connect with a VPN connection. Use AWS Glue to move data from Amazon EC2 to Amazon S3.

Buy Now

Questions 80

A large company has developed a B1 application that generates reports and dashboards using data collected from various operational metrics The company wants to provide executives with an enhanced experience so they can use natural language to get data from the reports The company wants the executives to be able ask questions using written and spoken interlaces

Which combination of services can be used to build this conversational interface? (Select THREE)

Options:

Alexa for Business

Amazon Connect

Amazon Lex

Amazon Poly

Amazon Comprehend

Amazon Transcribe

Buy Now

Answer:

C, E, F

Explanation:

To build a conversational interface that can use natural language to get data from the reports, the company can use a combination of services that can handle both written and spoken inputs, understand the user’s intent and query, and extract the relevant information from the reports. The services that can be used for this purpose are:

Amazon Lex: A service for building conversational interfaces into any application using voice and text. Amazon Lex can create chatbots that can interact with users using natural language, and integrate with other AWS services such as Amazon Connect, Amazon Comprehend, and Amazon Transcribe. Amazon Lex can also use lambda functions to implement the business logic and fulfill the user’s requests.

Amazon Comprehend: A service for natural language processing and text analytics. Amazon Comprehend can analyze text and speech inputs and extract insights such as entities, key phrases, sentiment, syntax, and topics. Amazon Comprehend can also use custom classifiers and entity recognizers to identify specific terms and concepts that are relevant to the domain of the reports.

Amazon Transcribe: A service for speech-to-text conversion. Amazon Transcribe can transcribe audio inputs into text outputs, and add punctuation and formatting. Amazon Transcribe can also use custom vocabularies and language models to improve the accuracy and quality of the transcription for the specific domain of the reports.

Therefore, the company can use the following architecture to build the conversational interface:

Use Amazon Lex to create a chatbot that can accept both written and spoken inputs from the executives. The chatbot can use intents, utterances, and slots to capture the user’s query and parameters, such as the report name, date, metric, or filter.

Use Amazon Transcribe to convert the spoken inputs into text outputs, and pass them to Amazon Lex. Amazon Transcribe can use a custom vocabulary and language model to recognize the terms and concepts related to the reports.

Use Amazon Comprehend to analyze the text inputs and outputs, and extract the relevant information from the reports. Amazon Comprehend can use a custom classifier and entity recognizer to identify the report name, date, metric, or filter from the user’s query, and the corresponding data from the reports.

Use a lambda function to implement the business logic and fulfillment of the user’s query, such as retrieving the data from the reports, performing calculations or aggregations, and formatting the response. The lambda function can also handle errors and validations, and provide feedback to the user.

Use Amazon Lex to return the response to the user, either in text or speech format, depending on the user’s preference.

References:

What Is Amazon Lex?

What Is Amazon Comprehend?

What Is Amazon Transcribe?

Questions 81

A Machine Learning Specialist is building a model to predict future employment rates based on a wide range of economic factors While exploring the data, the Specialist notices that the magnitude of the input features vary greatly The Specialist does not want variables with a larger magnitude to dominate the model

What should the Specialist do to prepare the data for model training'?

Options:

Apply quantile binning to group the data into categorical bins to keep any relationships in the data by replacing the magnitude with distribution

Apply the Cartesian product transformation to create new combinations of fields that are independent of the magnitude

Apply normalization to ensure each field will have a mean of 0 and a variance of 1 to remove any significant magnitude

Apply the orthogonal sparse Diagram (OSB) transformation to apply a fixed-size sliding window to generate new features of a similar magnitude.

Buy Now

Questions 82

A machine learning (ML) specialist wants to secure calls to the Amazon SageMaker Service API. The specialist has configured Amazon VPC with a VPC interface endpoint for the Amazon SageMaker Service API and is attempting to secure traffic from specific sets of instances and IAM users. The VPC is configured with a single public subnet.

Which combination of steps should the ML specialist take to secure the traffic? (Choose two.)

Options:

Add a VPC endpoint policy to allow access to the IAM users.

Modify the users' IAM policy to allow access to Amazon SageMaker Service API calls only.

Modify the security group on the endpoint network interface to restrict access to the instances.

Modify the ACL on the endpoint network interface to restrict access to the instances.

Add a SageMaker Runtime VPC endpoint interface to the VPC.

Buy Now

Questions 83

A gaming company has launched an online game where people can start playing for free but they need to pay if they choose to use certain features The company needs to build an automated system to predict whether or not a new user will become a paid user within 1 year The company has gathered a labeled dataset from 1 million users

The training dataset consists of 1.000 positive samples (from users who ended up paying within 1 year) and 999.000 negative samples (from users who did not use any paid features) Each data sample consists of 200 features including user age, device, location, and play patterns

Using this dataset for training, the Data Science team trained a random forest model that converged with over 99% accuracy on the training set However, the prediction results on a test dataset were not satisfactory.

Which of the following approaches should the Data Science team take to mitigate this issue? (Select TWO.)

Options:

Add more deep trees to the random forest to enable the model to learn more features.

indicate a copy of the samples in the test database in the training dataset

Generate more positive samples by duplicating the positive samples and adding a small amount of noise to the duplicated data.

Change the cost function so that false negatives have a higher impact on the cost value than false positives

Change the cost function so that false positives have a higher impact on the cost value than false negatives

Buy Now

Questions 84

A power company wants to forecast future energy consumption for its customers in residential properties and commercial business properties. Historical power consumption data for the last 10 years is available. A team of data scientists who performed the initial data analysis and feature selection will include the historical power consumption data and data such as weather, number of individuals on the property, and public holidays.

The data scientists are using Amazon Forecast to generate the forecasts.

Which algorithm in Forecast should the data scientists use to meet these requirements?

Options:

Autoregressive Integrated Moving Average (AIRMA)

Exponential Smoothing (ETS)

Convolutional Neural Network - Quantile Regression (CNN-QR)

Prophet

Buy Now

Questions 85

A real estate company wants to create a machine learning model for predicting housing prices based on a

historical dataset. The dataset contains 32 features.

Which model will meet the business requirement?

Options:

Logistic regression

Linear regression

K-means

Principal component analysis (PCA)

Buy Now

Questions 86

A Machine Learning Specialist is working with a large cybersecurily company that manages security events in real time for companies around the world The cybersecurity company wants to design a solution that will allow it to use machine learning to score malicious events as anomalies on the data as it is being ingested The company also wants be able to save the results in its data lake for later processing and analysis

What is the MOST efficient way to accomplish these tasks'?

Options:

Ingest the data using Amazon Kinesis Data Firehose, and use Amazon Kinesis Data Analytics Random Cut Forest (RCF) for anomaly detection Then use Kinesis Data Firehose to stream the results to Amazon S3

Ingest the data into Apache Spark Streaming using Amazon EMR. and use Spark MLlib with k-means to perform anomaly detection Then store the results in an Apache Hadoop Distributed File System (HDFS) using Amazon EMR with a replication factor of three as the data lake

Ingest the data and store it in Amazon S3 Use AWS Batch along with the AWS Deep Learning AMIs to train a k-means model using TensorFlow on the data in Amazon S3.

Ingest the data and store it in Amazon S3. Have an AWS Glue job that is triggered on demand transform the new data Then use the built-in Random Cut Forest (RCF) model within Amazon SageMaker to detect anomalies in the data

Buy Now

Questions 87

A data scientist receives a collection of insurance claim records. Each record includes a claim ID. the final outcome of the insurance claim, and the date of the final outcome.

The final outcome of each claim is a selection from among 200 outcome categories. Some claim records include only partial information. However, incomplete claim records include only 3 or 4 outcome ...gones from among the 200 available outcome categories. The collection includes hundreds of records for each outcome category. The records are from the previous 3 years.

The data scientist must create a solution to predict the number of claims that will be in each outcome category every month, several months in advance.

Which solution will meet these requirements?

Options:

Perform classification every month by using supervised learning of the 20X3 outcome categories based on claim contents.

Perform reinforcement learning by using claim IDs and dates Instruct the insurance agents who submit the claim records to estimate the expected number of claims in each outcome category every month

Perform forecasting by using claim IDs and dates to identify the expected number ot claims in each outcome category every month.

Perform classification by using supervised learning of the outcome categories for which partial information on claim contents is provided. Perform forecasting by using claim IDs and dates for all other outcome categories.

Buy Now

Answer:

Explanation:

The best solution for this scenario is to perform forecasting by using claim IDs and dates to identify the expected number of claims in each outcome category every month. This solution has the following advantages:

It leverages the historical data of claim outcomes and dates to capture the temporal patterns and trends of the claims in each category1.

It does not require the claim contents or any other features to make predictions, which simplifies the data preparation and reduces the impact of missing or incomplete data2.

It can handle the high cardinality of the outcome categories, as forecasting models can output multiple values for each time point3.

It can provide predictions for several months in advance, which is useful for planning and budgeting purposes4.

The other solutions have the following drawbacks:

A: Performing classification every month by using supervised learning of the 200 outcome categories based on claim contents is not suitable, because it assumes that the claim contents are available and complete for all the records, which is not the case in this scenario2. Moreover, classification models usually output a single label for each input, which is not adequate for predicting the number of claims in each category3. Additionally, classification models do not account for the temporal aspect of the data, which is important for forecasting1.

B: Performing reinforcement learning by using claim IDs and dates and instructing the insurance agents who submit the claim records to estimate the expected number of claims in each outcome category every month is not feasible, because it requires a feedback loop between the model and the agents, which might not be available or reliable in this scenario5. Furthermore, reinforcement learning is more suitable for sequential decision making problems, where the model learns from its actions and rewards, rather than forecasting problems, where the model learns from historical data and outputs future values6.

D: Performing classification by using supervised learning of the outcome categories for which partial information on claim contents is provided and performing forecasting by using claim IDs and dates for all other outcome categories is not optimal, because it combines two different methods that might not be consistent or compatible with each other7. Also, this solution suffers from the same limitations as solution A, such as the dependency on claim contents, the inability to handle multiple outputs, and the ignorance of temporal patterns123.

References:

1: Time Series Forecasting - Amazon SageMaker

2: Handling Missing Data for Machine Learning | AWS Machine Learning Blog

3: Forecasting vs Classification: What’s the Difference? | DataRobot

4: Amazon Forecast – Time Series Forecasting Made Easy | AWS News Blog

5: Reinforcement Learning - Amazon SageMaker

6: What is Reinforcement Learning? The Complete Guide | Edureka

7: Combining Machine Learning Models | by Will Koehrsen | Towards Data Science

Questions 88

A data scientist is developing a pipeline to ingest streaming web traffic data. The data scientist needs to implement a process to identify unusual web traffic patterns as part of the pipeline. The patterns will be used downstream for alerting and incident response. The data scientist has access to unlabeled historic data to use, if needed.

The solution needs to do the following:

Calculate an anomaly score for each web traffic entry.

Adapt unusual event identification to changing web patterns over time.

Which approach should the data scientist implement to meet these requirements?

Options:

Use historic web traffic data to train an anomaly detection model using the Amazon SageMaker Random Cut Forest (RCF) built-in model. Use an Amazon Kinesis Data Stream to process the incoming web traffic data. Attach a preprocessing AWS Lambda function to perform data enrichment by calling the RCF model to calculate the anomaly score for each record.

Use historic web traffic data to train an anomaly detection model using the Amazon SageMaker built-in XGBoost model. Use an Amazon Kinesis Data Stream to process the incoming web traffic data. Attach a preprocessing AWS Lambda function to perform data enrichment by calling the XGBoost model to calculate the anomaly score for each record.

Collect the streaming data using Amazon Kinesis Data Firehose. Map the delivery stream as an input source for Amazon Kinesis Data Analytics. Write a SQL query to run in real time against the streaming data with the k-Nearest Neighbors (kNN) SQL extension to calculate anomaly scores for each record using a tumbling window.

Collect the streaming data using Amazon Kinesis Data Firehose. Map the delivery stream as an input source for Amazon Kinesis Data Analytics. Write a SQL query to run in real time against the streaming data with the Amazon Random Cut Forest (RCF) SQL extension to calculate anomaly scores for each record using a sliding window.

Buy Now

Answer:

Explanation:

Amazon Kinesis Data Analytics is a service that allows users to analyze streaming data in real time using SQL queries. Amazon Random Cut Forest (RCF) is a SQL extension that enables anomaly detection on streaming data. RCF is an unsupervised machine learning algorithm that assigns an anomaly score to each data point based on how different it is from the rest of the data. A sliding window is a type of window that moves along with the data stream, so that the anomaly detection model can adapt to changing patterns over time. A tumbling window is a type of window that has a fixed size and does not overlap with other windows, so that the anomaly detection model is based on a fixed period of time. Therefore, option D is the best approach to meet the requirements of the question, as it uses RCF to calculate anomaly scores for each web traffic entry and uses a sliding window to adapt to changing web patterns over time.

Option A is incorrect because Amazon SageMaker Random Cut Forest (RCF) is a built-in model that can be used to train and deploy anomaly detection models on batch or streaming data, but it requires more steps and resources than using the RCF SQL extension in Amazon Kinesis Data Analytics. Option B is incorrect because Amazon SageMaker XGBoost is a built-in model that can be used for supervised learning tasks such as classification and regression, but not for unsupervised learning tasks such as anomaly detection. Option C is incorrect because k-Nearest Neighbors (kNN) is a SQL extension that can be used for classification and regression tasks on streaming data, but not for anomaly detection. Moreover, using a tumbling window would not allow the anomaly detection model to adapt to changing web patterns over time.

References:

Using CloudWatch anomaly detection

Anomaly Detection With CloudWatch

Performing Real-time Anomaly Detection using AWS

What Is AWS Anomaly Detection? (And Is There A Better Option?)

Questions 89

A university wants to develop a targeted recruitment strategy to increase new student enrollment. A data scientist gathers information about the academic performance history of students. The data scientist wants to use the data to build student profiles. The university will use the profiles to direct resources to recruit students who are likely to enroll in the university.

Which combination of steps should the data scientist take to predict whether a particular student applicant is likely to enroll in the university? (Select TWO)

Options:

Use Amazon SageMaker Ground Truth to sort the data into two groups named "enrolled" or "not enrolled."

Use a forecasting algorithm to run predictions.

Use a regression algorithm to run predictions.

Use a classification algorithm to run predictions

Use the built-in Amazon SageMaker k-means algorithm to cluster the data into two groups named "enrolled" or "not enrolled."

Buy Now

Questions 90

An insurance company developed a new experimental machine learning (ML) model to replace an existing model that is in production. The company must validate the quality of predictions from the new experimental model in a production environment before the company uses the new experimental model to serve general user requests.

Which one model can serve user requests at a time. The company must measure the performance of the new experimental model without affecting the current live traffic

Which solution will meet these requirements?

Options:

A/B testing

Canary release

Shadow deployment

Blue/green deployment

Buy Now

Answer:

Explanation:

The best solution for this scenario is to use shadow deployment, which is a technique that allows the company to run the new experimental model in parallel with the existing model, without exposing it to the end users. In shadow deployment, the company can route the same user requests to both models, but only return the responses from the existing model to the users. The responses from the new experimental model are logged and analyzed for quality and performance metrics, such as accuracy, latency, and resource consumption12. This way, the company can validate the new experimental model in a production environment, without affecting the current live traffic or user experience.

The other solutions are not suitable, because they have the following drawbacks:

A: A/B testing is a technique that involves splitting the user traffic between two or more models, and comparing their outcomes based on predefined metrics. However, this technique exposes the new experimental model to a portion of the end users, which might affect their experience if the model is not reliable or consistent with the existing model3.

B: Canary release is a technique that involves gradually rolling out the new experimental model to a small subset of users, and monitoring its performance and feedback. However, this technique also exposes the new experimental model to some end users, and requires careful selection and segmentation of the user groups4.

D: Blue/green deployment is a technique that involves switching the user traffic from the existing model (blue) to the new experimental model (green) at once, after testing and verifying the new model in a separate environment. However, this technique does not allow the company to validate the new experimental model in a production environment, and might cause service disruption or inconsistency if the new model is not compatible or stable5.

References:

1: Shadow Deployment: A Safe Way to Test in Production | LaunchDarkly Blog

2: Shadow Deployment: A Safe Way to Test in Production | LaunchDarkly Blog

3: A/B Testing for Machine Learning Models | AWS Machine Learning Blog

4: Canary Releases for Machine Learning Models | AWS Machine Learning Blog

5: Blue-Green Deployments for Machine Learning Models | AWS Machine Learning Blog

Questions 91

A company sells thousands of products on a public website and wants to automatically identify products with potential durability problems. The company has 1.000 reviews with date, star rating, review text, review summary, and customer email fields, but many reviews are incomplete and have empty fields. Each review has already been labeled with the correct durability result.

A machine learning specialist must train a model to identify reviews expressing concerns over product durability. The first model needs to be trained and ready to review in 2 days.

What is the MOST direct approach to solve this problem within 2 days?

Options:

Train a custom classifier by using Amazon Comprehend.

Build a recurrent neural network (RNN) in Amazon SageMaker by using Gluon and Apache MXNet.

Train a built-in BlazingText model using Word2Vec mode in Amazon SageMaker.

Use a built-in seq2seq model in Amazon SageMaker.

Buy Now

Answer:

Explanation:

The most direct approach to solve this problem within 2 days is to train a custom classifier by using Amazon Comprehend. Amazon Comprehend is a natural language processing (NLP) service that can analyze text and extract insights such as sentiment, entities, topics, and syntax. Amazon Comprehend also provides a custom classification feature that allows users to create and train a custom text classifier using their own labeled data. The custom classifier can then be used to categorize any text document into one or more custom classes. For this use case, the custom classifier can be trained to identify reviews that express concerns over product durability as a class, and use the star rating, review text, and review summary fields as input features. The custom classifier can be created and trained using the Amazon Comprehend console or API, and does not require any coding or machine learning expertise. The training process is fully managed and scalable, and can handle large and complex datasets. The custom classifier can be trained and ready to review in 2 days or less, depending on the size and quality of the dataset.

The other options are not the most direct approaches because:

Option B: Building a recurrent neural network (RNN) in Amazon SageMaker by using Gluon and Apache MXNet is a more complex and time-consuming approach that requires coding and machine learning skills. RNNs are a type of deep learning models that can process sequential data, such as text, and learn long-term dependencies between tokens. Gluon is a high-level API for MXNet that simplifies the development of deep learning models. Amazon SageMaker is a fully managed service that provides tools and frameworks for building, training, and deploying machine learning models. However, to use this approach, the machine learning specialist would have to write custom code to preprocess the data, define the RNN architecture, train the model, and evaluate the results. This would likely take more than 2 days and involve more administrative overhead.

Option C: Training a built-in BlazingText model using Word2Vec mode in Amazon SageMaker is not a suitable approach for text classification. BlazingText is a built-in algorithm in Amazon SageMaker that provides highly optimized implementations of the Word2Vec and text classification algorithms. The Word2Vec algorithm is useful for generating word embeddings, which are dense vector representations of words that capture their semantic and syntactic similarities. However, word embeddings alone are not sufficient for text classification, as they do not account for the context and structure of the text documents. To use this approach, the machine learning specialist would have to combine the word embeddings with another classifier model, such as a logistic regression or a neural network, which would add more complexity and time to the solution.

Option D: Using a built-in seq2seq model in Amazon SageMaker is not a relevant approach for text classification. Seq2seq is a built-in algorithm in Amazon SageMaker that provides a sequence-to-sequence framework for neural machine translation based on MXNet. Seq2seq is a supervised learning algorithm that can generate an output sequence of tokens given an input sequence of tokens, such as translating a sentence from one language to another. However, seq2seq is not designed for text classification, which requires assigning a label or a category to a text document, not generating another text sequence. To use this approach, the machine learning specialist would have to modify the seq2seq algorithm to fit the text classification task, which would be challenging and inefficient.

References:

Custom Classification - Amazon Comprehend

Build a Text Classification Model with Amazon Comprehend - AWS Machine Learning Blog

Recurrent Neural Networks - Gluon API

BlazingText Algorithm - Amazon SageMaker

Sequence-to-Sequence Algorithm - Amazon SageMaker

Exam Code: MLS-C01

Exam Name: AWS Certified Machine Learning - Specialty

Last Update: Mar 30, 2025

Questions: 307

PDF + Testing Engine

$134.99

Testing Engine

$99.99

PDF (Q&A)

$84.99

Special Summer Sale 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: best70

Dumpsbuddy logo

MLS-C01 AWS Certified Machine Learning - Specialty Questions and Answers

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer: