Professional-Data-Engineer Google Professional Data Engineer Exam Questions and Answers

Questions 4

You create a new report for your large team in Google Data Studio 360. The report uses Google BigQuery as its data source. It is company policy to ensure employees can view only the data associated with their region, so you create and populate a table for each region. You need to enforce the regional access policy to the data.

Which two actions should you take? (Choose two.)

Options:

Ensure all the tables are included in global dataset.

Ensure each table is included in a dataset for a region.

Adjust the settings for each table to allow a related region-based security group view access.

Adjust the settings for each view to allow a related region-based security group view access.

Adjust the settings for each dataset to allow a related region-based security group view access.

Buy Now

Questions 5

Given the record streams MJTelco is interested in ingesting per day, they are concerned about the cost of Google BigQuery increasing. MJTelco asks you to provide a design solution. They require a single large data table called tracking_table. Additionally, they want to minimize the cost of daily queries while performing fine-grained analysis of each day’s events. They also want to use streaming ingestion. What should you do?

Options:

Create a table called tracking_table and include a DATE column.

Create a partitioned table called tracking_table and include a TIMESTAMP column.

Create sharded tables for each day following the pattern tracking_table_YYYYMMDD.

Create a table called tracking_table with a TIMESTAMP column to represent the day.

Buy Now

Questions 6

You need to compose visualization for operations teams with the following requirements:

Telemetry must include data from all 50,000 installations for the most recent 6 weeks (sampling once every minute)

The report must not be more than 3 hours delayed from live data.

The actionable report should only show suboptimal links.

Most suboptimal links should be sorted to the top.

Suboptimal links can be grouped and filtered by regional geography.

User response time to load the report must be <5 seconds.

You create a data source to store the last 6 weeks of data, and create visualizations that allow viewers to see multiple date ranges, distinct geographic regions, and unique installation types. You always show the latest data without any changes to your visualizations. You want to avoid creating and updating new visualizations each month. What should you do?

Options:

Look through the current data and compose a series of charts and tables, one for each possible

combination of criteria.

Look through the current data and compose a small set of generalized charts and tables bound to criteria filters that allow value selection.

Export the data to a spreadsheet, compose a series of charts and tables, one for each possible

combination of criteria, and spread them across multiple tabs.

Load the data into relational database tables, write a Google App Engine application that queries all rows, summarizes the data across each criteria, and then renders results using the Google Charts and visualization API.

Buy Now

Questions 7

Flowlogistic wants to use Google BigQuery as their primary analysis system, but they still have Apache Hadoop and Spark workloads that they cannot move to BigQuery. Flowlogistic does not know how to store the data that is common to both workloads. What should they do?

Options:

Store the common data in BigQuery as partitioned tables.

Store the common data in BigQuery and expose authorized views.

Store the common data encoded as Avro in Google Cloud Storage.

Store he common data in the HDFS storage for a Google Cloud Dataproc cluster.

Buy Now

Questions 8

Flowlogistic’s CEO wants to gain rapid insight into their customer base so his sales team can be better informed in the field. This team is not very technical, so they’ve purchased a visualization tool to simplify the creation of BigQuery reports. However, they’ve been overwhelmed by all the data in the table, and are spending a lot of money on queries trying to find the data they need. You want to solve their problem in the most cost-effective way. What should you do?

Options:

Export the data into a Google Sheet for virtualization.

Create an additional table with only the necessary columns.

Create a view on the table to present to the virtualization tool.

Create identity and access management (IAM) roles on the appropriate columns, so only they appear in a query.

Buy Now

Questions 9

MJTelco is building a custom interface to share data. They have these requirements:

They need to do aggregations over their petabyte-scale datasets.

They need to scan specific time range rows with a very fast response time (milliseconds).

Which combination of Google Cloud Platform products should you recommend?

Options:

Cloud Datastore and Cloud Bigtable

Cloud Bigtable and Cloud SQL

BigQuery and Cloud Bigtable

BigQuery and Cloud Storage

Buy Now

Questions 10

Flowlogistic is rolling out their real-time inventory tracking system. The tracking devices will all send package-tracking messages, which will now go to a single Google Cloud Pub/Sub topic instead of the Apache Kafka cluster. A subscriber application will then process the messages for real-time reporting and store them in Google BigQuery for historical analysis. You want to ensure the package data can be analyzed over time.

Which approach should you take?

Options:

Attach the timestamp on each message in the Cloud Pub/Sub subscriber application as they are received.

Attach the timestamp and Package ID on the outbound message from each publisher device as they are sent to Clod Pub/Sub.

Use the NOW () function in BigQuery to record the event’s time.

Use the automatically generated timestamp from Cloud Pub/Sub to order the data.

Buy Now

Questions 11

You work for a shipping company that has distribution centers where packages move on delivery lines to route them properly. The company wants to add cameras to the delivery lines to detect and track any visual damage to the packages in transit. You need to create a way to automate the detection of damaged packages and flag them for human review in real time while the packages are in transit. Which solution should you choose?

Options:

Use BigQuery machine learning to be able to train the model at scale, so you can analyze the packages in batches.

Train an AutoML model on your corpus of images, and build an API around that model to integrate with the package tracking applications.

Use the Cloud Vision API to detect for damage, and raise an alert through Cloud Functions. Integrate the package tracking applications with this function.

Use TensorFlow to create a model that is trained on your corpus of images. Create a Python notebook in Cloud Datalab that uses this model so you can analyze for damaged packages.

Buy Now

Questions 12

You are selecting services to write and transform JSON messages from Cloud Pub/Sub to BigQuery for a data pipeline on Google Cloud. You want to minimize service costs. You also want to monitor and accommodate input data volume that will vary in size with minimal manual intervention. What should you do?

Options:

Use Cloud Dataproc to run your transformations. Monitor CPU utilization for the cluster. Resize the number of worker nodes in your cluster via the command line.

Use Cloud Dataproc to run your transformations. Use the diagnose command to generate an operational output archive. Locate the bottleneck and adjust cluster resources.

Use Cloud Dataflow to run your transformations. Monitor the job system lag with Stackdriver. Use the

default autoscaling setting for worker instances.

Use Cloud Dataflow to run your transformations. Monitor the total execution time for a sampling of jobs. Configure the job to use non-default Compute Engine machine types when needed.

Buy Now

Questions 13

A live TV show asks viewers to cast votes using their mobile phones. The event generates a large volume of data during a 3 minute period. You are in charge of the Voting restructure* and must ensure that the platform can handle the load and Hal all votes are processed. You must display partial results write voting is open. After voting doses you need to count the votes exactly once white optimizing cost. What should you do?

Options:

Create a Memorystore instance with a high availability (HA) configuration

Write votes to a Pub Sub tope and have Cloud Functions subscribe to it and write voles to BigQuery

Write votes to a Pub/Sub tope and toad into both Bigtable and BigQuery via a Dataflow pipeline Query Bigtable for real-time results and BigQuery for later analysis Shutdown the Bigtable instance when voting concludes

D Create a Cloud SQL for PostgreSQL database with high availability (HA) configuration and multiple read replicas

Buy Now

Questions 14

Flowlogistic’s management has determined that the current Apache Kafka servers cannot handle the data volume for their real-time inventory tracking system. You need to build a new system on Google Cloud Platform (GCP) that will feed the proprietary tracking software. The system must be able to ingest data from a variety of global sources, process and query in real-time, and store the data reliably. Which combination of GCP products should you choose?

Options:

Cloud Pub/Sub, Cloud Dataflow, and Cloud Storage

Cloud Pub/Sub, Cloud Dataflow, and Local SSD

Cloud Pub/Sub, Cloud SQL, and Cloud Storage

Cloud Load Balancing, Cloud Dataflow, and Cloud Storage

Buy Now

Questions 15

The Development and External teams nave the project viewer Identity and Access Management (1AM) role m a folder named Visualization. You want the Development Team to be able to read data from both Cloud Storage and BigQuery, but the External Team should only be able to read data from BigQuery. What should you do?

Options:

Remove Cloud Storage IAM permissions to the External Team on the acme-raw-data project

Create Virtual Private Cloud (VPC) firewall rules on the acme-raw-data protect that deny all Ingress traffic from the External Team CIDR range

Create a VPC Service Controls perimeter containing both protects and BigQuery as a restricted API Add the External Team users to the perimeter s Access Level

Create a VPC Service Controls perimeter containing both protects and Cloud Storage as a restricted API. Add the Development Team users to the perimeter's Access Level

Buy Now

Questions 16

You are troubleshooting your Dataflow pipeline that processes data from Cloud Storage to BigQuery. You have discovered that the Dataflow worker nodes cannot communicate with one another Your networking team relies on Google Cloud network tags to define firewall rules You need to identify the issue while following Google-recommended networking security practices. What should you do?

Options:

Determine whether your Dataflow pipeline has a custom network tag set.

Determine whether there is a firewall rule set to allow traffic on TCP ports 12345 and 12346 for the Dataflow network tag.

Determine whether your Dataflow pipeline is deployed with the external IP address option enabled.

Determine whether there is a firewall rule set to allow traffic on TCP ports 12345 and 12346 on the subnet used by Dataflow workers.

Buy Now

Answer:

Explanation:

Dataflow worker nodes need to communicate with each other and with the Dataflow service on TCP ports 12345 and 12346. These ports are used for data shuffling and streaming engine communication. By default, Dataflow assigns a network tag called dataflow to the worker nodes, and creates a firewall rule that allows traffic on these ports for the dataflow network tag. However, if you use a custom network tag for your Dataflow pipeline, you need to create a firewall rule that allows traffic on these ports for your custom network tag. Otherwise, the worker nodes will not be able to communicate with each other and the Dataflow service, and the pipeline will fail.

Therefore, the best way to identify the issue is to determine whether there is a firewall rule set to allow traffic on TCP ports 12345 and 12346 for the Dataflow network tag. If there is no such firewall rule, or if the firewall rule does not match the network tag used by your Dataflow pipeline, you need to create or update the firewall rule accordingly.

Option A is not a good solution, as determining whether your Dataflow pipeline has a custom network tag set does not tell you whether there is a firewall rule that allows traffic on the required ports for that network tag. You need to check the firewall rule as well.

Option C is not a good solution, as determining whether your Dataflow pipeline is deployed with the external IP address option enabled does not tell you whether there is a firewall rule that allows traffic on the required ports for the Dataflow network tag. The external IP address option determines whether the worker nodes can access resources on the public internet, but it does not affect the internal communication between the worker nodes and the Dataflow service.

Option D is not a good solution, as determining whether there is a firewall rule set to allow traffic on TCP ports 12345 and 12346 on the subnet used by Dataflow workers does not tell you whether the firewall rule applies to the Dataflow network tag. The firewall rule should be based on the network tag, not the subnet, as the network tag is more specific and secure. References: Dataflow network tags | Cloud Dataflow | Google Cloud, Dataflow firewall rules | Cloud Dataflow | Google Cloud, Dataflow network configuration | Cloud Dataflow | Google Cloud, Dataflow Streaming Engine | Cloud Dataflow | Google Cloud.

Questions 17

You set up a streaming data insert into a Redis cluster via a Kafka cluster. Both clusters are running on

Compute Engine instances. You need to encrypt data at rest with encryption keys that you can create, rotate, and destroy as needed. What should you do?

Options:

Create a dedicated service account, and use encryption at rest to reference your data stored in your

Compute Engine cluster instances as part of your API service calls.

Create encryption keys in Cloud Key Management Service. Use those keys to encrypt your data in all of the Compute Engine cluster instances.

Create encryption keys locally. Upload your encryption keys to Cloud Key Management Service. Use those keys to encrypt your data in all of the Compute Engine cluster instances.

Create encryption keys in Cloud Key Management Service. Reference those keys in your API service calls when accessing the data in your Compute Engine cluster instances.

Buy Now

Questions 18

You need to create a SQL pipeline. The pipeline runs an aggregate SOL transformation on a BigQuery table every two hours and appends the result to another existing BigQuery table. You need to configure the pipeline to retry if errors occur. You want the pipeline to send an email notification after three consecutive failures. What should you do?

Options:

Create a BigQuery scheduled query to run the SOL transformation with schedule options that repeats every two hours, and enable email

notifications.

Use the BigQueryUpsertTableOperator in Cloud Composer, set the retry parameter to three, and set the email_on_failure parameter to

true.

Use the BigQuerylnsertJobOperator in Cloud Composer, set the retry parameter to three, and set the email_on_failure parameter to

true.

Create a BigQuery scheduled query to run the SQL transformation with schedule options that repeats every two hours, and enable

notification to Pub/Sub topic. Use Pub/Sub and Cloud Functions to send an email after three tailed executions.

Buy Now

Questions 19

You are implementing security best practices on your data pipeline. Currently, you are manually executing jobs as the Project Owner. You want to automate these jobs by taking nightly batch files containing non-public information from Google Cloud Storage, processing them with a Spark Scala job on a Google Cloud Dataproc cluster, and depositing the results into Google BigQuery.

How should you securely run this workload?

Options:

Restrict the Google Cloud Storage bucket so only you can see the files

Grant the Project Owner role to a service account, and run the job with it

Use a service account with the ability to read the batch files and to write to BigQuery

Use a user account with the Project Viewer role on the Cloud Dataproc cluster to read the batch files and write to BigQuery

Buy Now

Questions 20

You have important legal hold documents in a Cloud Storage bucket. You need to ensure that these documents are not deleted or modified. What should you do?

Options:

Set a retention policy. Lock the retention policy.

Set a retention policy. Set the default storage class to Archive for long-term digital preservation.

Enable the Object Versioning feature. Add a lifecycle rule.

Enable the Object Versioning feature. Create a copy in a bucket in a different region.

Buy Now

Questions 21

You store historic data in Cloud Storage. You need to perform analytics on the historic data. You want to use a solution to detect invalid data entries and perform data transformations that will not require programming or knowledge of SQL.

What should you do?

Options:

Use Cloud Dataflow with Beam to detect errors and perform transformations.

Use Cloud Dataprep with recipes to detect errors and perform transformations.

Use Cloud Dataproc with a Hadoop job to detect errors and perform transformations.

Use federated tables in BigQuery with queries to detect errors and perform transformations.

Buy Now

Questions 22

You are running a streaming pipeline with Dataflow and are using hopping windows to group the data as the data arrives. You noticed that some data is arriving late but is not being marked as late data, which is resulting in inaccurate aggregations downstream. You need to find a solution that allows you to capture the late data in the appropriate window. What should you do?

Options:

Change your windowing function to session windows to define your windows based on certain activity.

Change your windowing function to tumbling windows to avoid overlapping window periods.

Expand your hopping window so that the late data has more time to arrive within the grouping.

Use watermarks to define the expected data arrival window Allow late data as it arrives.

Buy Now

Questions 23

You work for a large financial institution that is planning to use Dialogflow to create a chatbot for the company's mobile app You have reviewed old chat logs and lagged each conversation for intent based on each customer's stated intention for contacting customer service About 70% of customer requests are simple requests that are solved within 10 intents The remaining 30% of inquiries require much longer, more complicated requests Which intents should you automate first?

Options:

Automate the 10 intents that cover 70% of the requests so that live agents can handle more complicated requests

Automate the more complicated requests first because those require more of the agents' time

Automate a blend of the shortest and longest intents to be representative of all intents

Automate intents in places where common words such as "payment" appear only once so the software isn't confused

Buy Now

Questions 24

You are part of a healthcare organization where data is organized and managed by respective data owners in various storage services. As a result of this decentralized ecosystem, discovering and managing data has become difficult You need to quickly identify and implement a cost-optimized solution to assist your organization with the following

• Data management and discovery

• Data lineage tracking

• Data quality validation

How should you build the solution?

Options:

Use BigLake to convert the current solution into a data lake architecture.

Build a new data discovery tool on Google Kubernetes Engine that helps with new source onboarding and data lineage tracking.

Use BigOuery to track data lineage, and use Dataprep to manage data and perform data quality validation.

Use Dataplex to manage data, track data lineage, and perform data quality validation.

Buy Now

Questions 25

You work for a large ecommerce company. You store your customers order data in Bigtable. You have a garbage collection policy set to delete the data after 30 days and the number of versions is set to 1. When the data analysts run a query to report total customer spending, the analysts sometimes see customer data that is older than 30 days. You need to ensure that the analysts do not see customer data older than 30 days while minimizing cost and overhead. What should you do?

Options:

Set the expiring values of the column families to 30 days and set the number of versions to 2.

Use a timestamp range filter in the query to fetch the customer's data for a specific range.

Set the expiring values of the column families to 29 days and keep the number of versions to 1.

Schedule a job daily to scan the data in the table and delete data older than 30 days.

Buy Now

Questions 26

You are designing a Dataflow pipeline for a batch processing job. You want to mitigate multiple zonal failures at job submission time. What should you do?

Options:

Specify a worker region by using the —region flag.

Set the pipeline staging location as a regional Cloud Storage bucket.

Submit duplicate pipelines in two different zones by using the —zone flag.

Create an Eventarc trigger to resubmit the job in case of zonal failure when submitting the job.

Buy Now

Questions 27

You work for a shipping company that uses handheld scanners to read shipping labels. Your company has strict data privacy standards that require scanners to only transmit recipients’ personally identifiable information (PII) to analytics systems, which violates user privacy rules. You want to quickly build a scalable solution using cloud-native managed services to prevent exposure of PII to the analytics systems. What should you do?

Options:

Create an authorized view in BigQuery to restrict access to tables with sensitive data.

Install a third-party data validation tool on Compute Engine virtual machines to check the incoming data for sensitive information.

Use Stackdriver logging to analyze the data passed through the total pipeline to identify transactions that may contain sensitive information.

Build a Cloud Function that reads the topics and makes a call to the Cloud Data Loss Prevention API. Use the tagging and confidence levels to either pass or quarantine the data in a bucket for review.

Buy Now

Questions 28

You have a BigQuery table that contains customer data, including sensitive information such as names and addresses. You need to share the customer data with your data analytics and consumer support teams securely. The data analytics team needs to access the data of all the customers, but must not be able to access the sensitive data. The consumer support team needs access to all data columns, but must not be able to access customers that no longer have active contracts. You enforced these requirements by using an authorized dataset and policy tags After implementing these steps, the data analytics team reports that they still have access to the sensitive columns. You need to ensure that the data analytics team does not have access to restricted data What should you do?

Choose 2 answers

Options:

Create two separate authorized datasets; one for the data analytics team and another for the consumer support team.

Ensure that the data analytics team members do not have the Data Catalog Fine-Grained Reader role for the policy tags.

Enforce access control in the policy tag taxonomy.

Remove the bigquery. dataViewer role from the data analytics team on the authorized datasets.

Replace the authorized dataset with an authorized view Use row-level security and apply filter_ expression to limit data access.

Buy Now

Questions 29

You have a streaming pipeline that ingests data from Pub/Sub in production. You need to update this streaming pipeline with improved business logic. You need to ensure that the updated pipeline reprocesses the previous two days of delivered Pub/Sub messages. What should you do?

Choose 2 answers

Options:

Use Pub/Sub Seek with a timestamp.

Use the Pub/Sub subscription clear-retry-policy flag.

Create a new Pub/Sub subscription two days before the deployment.

Use the Pub/Sub subscription retain-asked-messages flag.

Use Pub/Sub Snapshot capture two days before the deployment.

Buy Now

Questions 30

The Dataflow SDKs have been recently transitioned into which Apache service?

Options:

Apache Spark

Apache Hadoop

Apache Kafka

Apache Beam

Buy Now

Questions 31

By default, which of the following windowing behavior does Dataflow apply to unbounded data sets?

Options:

Windows at every 100 MB of data

Single, Global Window

Windows at every 1 minute

Windows at every 10 minutes

Buy Now

Questions 32

Your team is building a data lake platform on Google Cloud. As a part of the data foundation design, you are planning to store all the raw data in Cloud Storage You are expecting to ingest approximately 25 GB of data a day and your billing department is worried about the increasing cost of storing old data. The current business requirements are:

• The old data can be deleted anytime

• You plan to use the visualization layer for current and historical reporting

• The old data should be available instantly when accessed

• There should not be any charges for data retrieval.

What should you do to optimize for cost?

Options:

Create the bucket with the Autoclass storage class feature.

Create an Object Lifecycle Management policy to modify the storage class for data older than 30 days to nearline, 90 days to coldline. and 365 days to archive storage class. Delete old data as needed.

Create an Object Lifecycle Management policy to modify the storage class for data older than 30 days to coldline, 90 days to nearline. and 365 days to archive storage class Delete old data as needed.

Create an Object Lifecycle Management policy to modify the storage class for data older than 30 days to nearlme. 45 days to coldline. and 60 days to archive storage class Delete old data as needed.

Buy Now

Questions 33

If a dataset contains rows with individual people and columns for year of birth, country, and income, how many of the columns are continuous and how many are categorical?

Options:

1 continuous and 2 categorical

3 categorical

3 continuous

2 continuous and 1 categorical

Buy Now

Questions 34

Suppose you have a dataset of images that are each labeled as to whether or not they contain a human face. To create a neural network that recognizes human faces in images using this labeled dataset, what approach would likely be the most effective?

Options:

Use K-means Clustering to detect faces in the pixels.

Use feature engineering to add features for eyes, noses, and mouths to the input data.

Use deep learning by creating a neural network with multiple hidden layers to automatically detect features of faces.

Build a neural network with an input layer of pixels, a hidden layer, and an output layer with two categories.

Buy Now

Questions 35

Which of these operations can you perform from the BigQuery Web UI?

Options:

Upload a file in SQL format.

Load data with nested and repeated fields.

Upload a 20 MB file.

Upload multiple files using a wildcard.

Buy Now

Questions 36

Which of these is NOT a way to customize the software on Dataproc cluster instances?

Options:

Set initialization actions

Modify configuration files using cluster properties

Configure the cluster using Cloud Deployment Manager

Log into the master node and make changes from there

Buy Now

Questions 37

What is the recommended action to do in order to switch between SSD and HDD storage for your Google Cloud Bigtable instance?

Options:

create a third instance and sync the data from the two storage types via batch jobs

export the data from the existing instance and import the data into a new instance

run parallel instances where one is HDD and the other is SDD

the selection is final and you must resume using the same storage type

Buy Now

Questions 38

Which of the following is NOT a valid use case to select HDD (hard disk drives) as the storage for Google Cloud Bigtable?

Options:

You expect to store at least 10 TB of data.

You will mostly run batch workloads with scans and writes, rather than frequently executing random reads of a small number of rows.

You need to integrate with Google BigQuery.

You will not use the data to back a user-facing or latency-sensitive application.

Buy Now

Questions 39

Your company produces 20,000 files every hour. Each data file is formatted as a comma separated values (CSV) file that is less than 4 KB. All files must be ingested on Google Cloud Platform before they can be processed. Your company site has a 200 ms latency to Google Cloud, and your Internet connection bandwidth is limited as 50 Mbps. You currently deploy a secure FTP (SFTP) server on a virtual machine in Google Compute Engine as the data ingestion point. A local SFTP client runs on a dedicated machine to transmit the CSV files as is. The goal is to make reports with data from the previous day available to the executives by 10:00 a.m. each day. This design is barely able to keep up with the current volume, even though the bandwidth utilization is rather low.

You are told that due to seasonality, your company expects the number of files to double for the next three months. Which two actions should you take? (choose two.)

Options:

Introduce data compression for each file to increase the rate file of file transfer.

Contact your internet service provider (ISP) to increase your maximum bandwidth to at least 100 Mbps.

Redesign the data ingestion process to use gsutil tool to send the CSV files to a storage bucket in parallel.

Assemble 1,000 files into a tape archive (TAR) file. Transmit the TAR files instead, and disassemble the CSV files in the cloud upon receiving them.

Create an S3-compatible storage endpoint in your network, and use Google Cloud Storage Transfer Service to transfer on-premices data to the designated storage bucket.

Buy Now

Questions 40

Which of these numbers are adjusted by a neural network as it learns from a training dataset (select 2 answers)?

Options:

Weights

Biases

Continuous features

Input values

Buy Now

Questions 41

The CUSTOM tier for Cloud Machine Learning Engine allows you to specify the number of which types of cluster nodes?

Options:

Workers

Masters, workers, and parameter servers

Workers and parameter servers

Parameter servers

Buy Now

Questions 42

You work for a manufacturing plant that batches application log files together into a single log file once a day at 2:00 AM. You have written a Google Cloud Dataflow job to process that log file. You need to make sure the log file in processed once per day as inexpensively as possible. What should you do?

Options:

Change the processing job to use Google Cloud Dataproc instead.

Manually start the Cloud Dataflow job each morning when you get into the office.

Create a cron job with Google App Engine Cron Service to run the Cloud Dataflow job.

Configure the Cloud Dataflow job as a streaming job so that it processes the log data immediately.

Buy Now

Questions 43

Your company is loading comma-separated values (CSV) files into Google BigQuery. The data is fully imported successfully; however, the imported data is not matching byte-to-byte to the source file. What is the most likely cause of this problem?

Options:

The CSV data loaded in BigQuery is not flagged as CSV.

The CSV data has invalid rows that were skipped on import.

The CSV data loaded in BigQuery is not using BigQuery’s default encoding.

The CSV data has not gone through an ETL phase before loading into BigQuery.

Buy Now

Questions 44

You work for a large fast food restaurant chain with over 400,000 employees. You store employee information in Google BigQuery in a Users table consisting of a FirstName field and a LastName field. A member of IT is building an application and asks you to modify the schema and data in BigQuery so the application can query a FullName field consisting of the value of the FirstName field concatenated with a space, followed by the value of the LastName field for each employee. How can you make that data available while minimizing cost?

Options:

Create a view in BigQuery that concatenates the FirstName and LastName field values to produce the FullName.

Add a new column called FullName to the Users table. Run an UPDATE statement that updates the FullName column for each user with the concatenation of the FirstName and LastName values.

Create a Google Cloud Dataflow job that queries BigQuery for the entire Users table, concatenates the FirstName value and LastName value for each user, and loads the proper values for FirstName, LastName, and FullName into a new table in BigQuery.

Use BigQuery to export the data for the table to a CSV file. Create a Google Cloud Dataproc job to process the CSV file and output a new CSV file containing the proper values for FirstName, LastName and FullName. Run a BigQuery load job to load the new CSV file into BigQuery.

Buy Now

Questions 45

You work for an economic consulting firm that helps companies identify economic trends as they happen. As part of your analysis, you use Google BigQuery to correlate customer data with the average prices of the 100 most common goods sold, including bread, gasoline, milk, and others. The average prices of these goods are updated every 30 minutes. You want to make sure this data stays up to date so you can combine it with other data in BigQuery as cheaply as possible. What should you do?

Options:

Load the data every 30 minutes into a new partitioned table in BigQuery.

Store and update the data in a regional Google Cloud Storage bucket and create a federated data source in BigQuery

Store the data in Google Cloud Datastore. Use Google Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Cloud Datastore

Store the data in a file in a regional Google Cloud Storage bucket. Use Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Google Cloud Storage.

Buy Now

Questions 46

You are designing the database schema for a machine learning-based food ordering service that will predict what users want to eat. Here is some of the information you need to store:

The user profile: What the user likes and doesn’t like to eat

The user account information: Name, address, preferred meal times

The order information: When orders are made, from where, to whom

The database will be used to store all the transactional data of the product. You want to optimize the data schema. Which Google Cloud Platform product should you use?

Options:

BigQuery

Cloud SQL

Cloud Bigtable

Cloud Datastore

Buy Now

Questions 47

You are choosing a NoSQL database to handle telemetry data submitted from millions of Internet-of-Things (IoT) devices. The volume of data is growing at 100 TB per year, and each data entry has about 100 attributes. The data processing pipeline does not require atomicity, consistency, isolation, and durability (ACID). However, high availability and low latency are required.

You need to analyze the data by querying against individual fields. Which three databases meet your requirements? (Choose three.)

Options:

Redis

HBase

MySQL

MongoDB

Cassandra

HDFS with Hive

Buy Now

Questions 48

You are building a model to predict whether or not it will rain on a given day. You have thousands of input features and want to see if you can improve training speed by removing some features while having a minimum effect on model accuracy. What can you do?

Options:

Eliminate features that are highly correlated to the output labels.

Combine highly co-dependent features into one representative feature.

Instead of feeding in each feature individually, average their values in batches of 3.

Remove the features that have null values for more than 50% of the training records.

Buy Now

Questions 49

Your company has recently grown rapidly and now ingesting data at a significantly higher rate than it was previously. You manage the daily batch MapReduce analytics jobs in Apache Hadoop. However, the recent increase in data has meant the batch jobs are falling behind. You were asked to recommend ways the development team could increase the responsiveness of the analytics without increasing costs. What should you recommend they do?

Options:

Rewrite the job in Pig.

Rewrite the job in Apache Spark.

Increase the size of the Hadoop cluster.

Decrease the size of the Hadoop cluster but also rewrite the job in Hive.

Buy Now

Questions 50

You are deploying a new storage system for your mobile application, which is a media streaming service. You decide the best fit is Google Cloud Datastore. You have entities with multiple properties, some of which can take on multiple values. For example, in the entity ‘Movie’ the property ‘actors’ and the property ‘tags’ have multiple values but the property ‘date released’ does not. A typical query would ask for all movies with actor= ordered by date_released or all movies with tag=Comedy ordered by date_released. How should you avoid a combinatorial explosion in the number of indexes?

Options:

Option A

Option B.

Option C

Option D

Buy Now

Questions 51

Your software uses a simple JSON format for all messages. These messages are published to Google Cloud Pub/Sub, then processed with Google Cloud Dataflow to create a real-time dashboard for the CFO. During testing, you notice that some messages are missing in the dashboard. You check the logs, and all messages are being published to Cloud Pub/Sub successfully. What should you do next?

Options:

Check the dashboard application to see if it is not displaying correctly.

Run a fixed dataset through the Cloud Dataflow pipeline and analyze the output.

Use Google Stackdriver Monitoring on Cloud Pub/Sub to find the missing messages.

Switch Cloud Dataflow to pull messages from Cloud Pub/Sub instead of Cloud Pub/Sub pushing messages to Cloud Dataflow.

Buy Now

Questions 52

You are deploying 10,000 new Internet of Things devices to collect temperature data in your warehouses globally. You need to process, store and analyze these very large datasets in real time. What should you do?

Options:

Send the data to Google Cloud Datastore and then export to BigQuery.

Send the data to Google Cloud Pub/Sub, stream Cloud Pub/Sub to Google Cloud Dataflow, and store the data in Google BigQuery.

Send the data to Cloud Storage and then spin up an Apache Hadoop cluster as needed in Google Cloud Dataproc whenever analysis is required.

Export logs in batch to Google Cloud Storage and then spin up a Google Cloud SQL instance, import the data from Cloud Storage, and run an analysis as needed.

Buy Now

Questions 53

Your company is performing data preprocessing for a learning algorithm in Google Cloud Dataflow. Numerous data logs are being are being generated during this step, and the team wants to analyze them. Due to the dynamic nature of the campaign, the data is growing exponentially every hour.

The data scientists have written the following code to read the data for a new key features in the logs.

BigQueryIO.Read

.named(“ReadLogData”)

.from(“clouddataflow-readonly:samples.log_data”)

You want to improve the performance of this data read. What should you do?

Options:

Specify the TableReference object in the code.

Use .fromQuery operation to read specific fields from the table.

Use of both the Google BigQuery TableSchema and TableFieldSchema classes.

Call a transform that returns TableRow objects, where each element in the PCollexction represents a single row in the table.

Buy Now

Questions 54

Your company is streaming real-time sensor data from their factory floor into Bigtable and they have noticed extremely poor performance. How should the row key be redesigned to improve Bigtable performance on queries that populate real-time dashboards?

Options:

Use a row key of the form .

Use a row key of the form #.

Use a row key of the form >##.

Buy Now

Questions 55

You create an important report for your large team in Google Data Studio 360. The report uses Google BigQuery as its data source. You notice that visualizations are not showing data that is less than 1 hour old. What should you do?

Options:

Disable caching by editing the report settings.

Disable caching in BigQuery by editing table details.

Refresh your browser tab showing the visualizations.

Clear your browser history for the past hour then reload the tab showing the virtualizations.

Buy Now

Questions 56

You designed a database for patient records as a pilot project to cover a few hundred patients in three clinics. Your design used a single database table to represent all patients and their visits, and you used self-joins to generate reports. The server resource utilization was at 50%. Since then, the scope of the project has expanded. The database must now store 100 times more patient records. You can no longer run the reports, because they either take too long or they encounter errors with insufficient compute resources. How should you adjust the database design?

Options:

Add capacity (memory and disk space) to the database server by the order of 200.

Shard the tables into smaller ones based on date ranges, and only generate reports with prespecified date ranges.

Normalize the master patient-record table into the patient table and the visits table, and create other necessary tables to avoid self-join.

Partition the table into smaller tables, with one for each clinic. Run queries against the smaller table pairs, and use unions for consolidated reports.

Buy Now

Questions 57

You have spent a few days loading data from comma-separated values (CSV) files into the Google BigQuery table CLICK_STREAM. The column DT stores the epoch time of click events. For convenience, you chose a simple schema where every field is treated as the STRING type. Now, you want to compute web session durations of users who visit your site, and you want to change its data type to the TIMESTAMP. You want to minimize the migration effort without making future queries computationally expensive. What should you do?

Options:

Delete the table CLICK_STREAM, and then re-create it such that the column DT is of the TIMESTAMP type. Reload the data.

Add a column TS of the TIMESTAMP type to the table CLICK_STREAM, and populate the numeric values from the column TS for each row. Reference the column TS instead of the column DT from now on.

Create a view CLICK_STREAM_V, where strings from the column DT are cast into TIMESTAMP values. Reference the view CLICK_STREAM_V instead of the table CLICK_STREAM from now on.

Add two columns to the table CLICK STREAM: TS of the TIMESTAMP type and IS_NEW of the BOOLEAN type. Reload all data in append mode. For each appended row, set the value of IS_NEW to true. For future queries, reference the column TS instead of the column DT, with the WHERE clause ensuring that the value of IS_NEW must be true.

Construct a query to return every row of the table CLICK_STREAM, while using the built-in function to cast strings from the column DT into TIMESTAMP values. Run the query into a destination table NEW_CLICK_STREAM, in which the column TS is the TIMESTAMP type. Reference the table NEW_CLICK_STREAM instead of the table CLICK_STREAM from now on. In the future, new data is loaded into the table NEW_CLICK_STREAM.

Buy Now

Questions 58

You are creating a model to predict housing prices. Due to budget constraints, you must run it on a single resource-constrained virtual machine. Which learning algorithm should you use?

Options:

Linear regression

Logistic classification

Recurrent neural network

Feedforward neural network

Buy Now

Questions 59

Your company is migrating their 30-node Apache Hadoop cluster to the cloud. They want to re-use Hadoop jobs they have already created and minimize the management of the cluster as much as possible. They also want to be able to persist data beyond the life of the cluster. What should you do?

Options:

Create a Google Cloud Dataflow job to process the data.

Create a Google Cloud Dataproc cluster that uses persistent disks for HDFS.

Create a Hadoop cluster on Google Compute Engine that uses persistent disks.

Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector.

Create a Hadoop cluster on Google Compute Engine that uses Local SSD disks.

Buy Now

Questions 60

Your company is running their first dynamic campaign, serving different offers by analyzing real-time data during the holiday season. The data scientists are collecting terabytes of data that rapidly grows every hour during their 30-day campaign. They are using Google Cloud Dataflow to preprocess the data and collect the feature (signals) data that is needed for the machine learning model in Google Cloud Bigtable. The team is observing suboptimal performance with reads and writes of their initial load of 10 TB of data. They want to improve this performance while minimizing cost. What should they do?

Options:

Redefine the schema by evenly distributing reads and writes across the row space of the table.

The performance issue should be resolved over time as the site of the BigDate cluster is increased.

Redesign the schema to use a single row key to identify values that need to be updated frequently in the cluster.

Redesign the schema to use row keys based on numeric IDs that increase sequentially per user viewing the offers.

Buy Now

Questions 61

Your weather app queries a database every 15 minutes to get the current temperature. The frontend is powered by Google App Engine and server millions of users. How should you design the frontend to respond to a database failure?

Options:

Issue a command to restart the database servers.

Retry the query with exponential backoff, up to a cap of 15 minutes.

Retry the query every second until it comes back online to minimize staleness of data.

Reduce the query frequency to once every hour until the database comes back online.

Buy Now

Questions 62

Your company is using WHILECARD tables to query data across multiple tables with similar names. The SQL statement is currently failing with the following error:

# Syntax error : Expected end of statement but got “-“ at [4:11]

SELECT age

FROM

bigquery-public-data.noaa_gsod.gsod

WHERE

age != 99

AND_TABLE_SUFFIX = ‘1929’

ORDER BY

age DESC

Which table name will make the SQL statement work correctly?

Options:

‘bigquery-public-data.noaa_gsod.gsod‘

bigquery-public-data.noaa_gsod.gsod*

‘bigquery-public-data.noaa_gsod.gsod’*

‘bigquery-public-data.noaa_gsod.gsod*`

Buy Now

Questions 63

You want to use Google Stackdriver Logging to monitor Google BigQuery usage. You need an instant notification to be sent to your monitoring tool when new data is appended to a certain table using an insert job, but you do not want to receive notifications for other tables. What should you do?

Options:

Make a call to the Stackdriver API to list all logs, and apply an advanced filter.

In the Stackdriver logging admin interface, and enable a log sink export to BigQuery.

In the Stackdriver logging admin interface, enable a log sink export to Google Cloud Pub/Sub, and subscribe to the topic from your monitoring tool.

Using the Stackdriver API, create a project sink with advanced log filter to export to Pub/Sub, and subscribe to the topic from your monitoring tool.

Buy Now

Questions 64

You are designing a basket abandonment system for an ecommerce company. The system will send a message to a user based on these rules:

No interaction by the user on the site for 1 hour

Has added more than $30 worth of products to the basket

Has not completed a transaction

You use Google Cloud Dataflow to process the data and decide if a message should be sent. How should you design the pipeline?

Options:

Use a fixed-time window with a duration of 60 minutes.

Use a sliding time window with a duration of 60 minutes.

Use a session window with a gap time duration of 60 minutes.

Use a global window with a time based trigger with a delay of 60 minutes.

Buy Now

Questions 65

You work for a car manufacturer and have set up a data pipeline using Google Cloud Pub/Sub to capture anomalous sensor events. You are using a push subscription in Cloud Pub/Sub that calls a custom HTTPS endpoint that you have created to take action of these anomalous events as they occur. Your custom HTTPS endpoint keeps getting an inordinate amount of duplicate messages. What is the most likely cause of these duplicate messages?

Options:

The message body for the sensor event is too large.

Your custom endpoint has an out-of-date SSL certificate.

The Cloud Pub/Sub topic has too many messages published to it.

Your custom endpoint is not acknowledging messages within the acknowledgement deadline.

Buy Now

Questions 66

You are building new real-time data warehouse for your company and will use Google BigQuery streaming inserts. There is no guarantee that data will only be sent in once but you do have a unique ID for each row of data and an event timestamp. You want to ensure that duplicates are not included while interactively querying data. Which query type should you use?

Options:

Include ORDER BY DESK on timestamp column and LIMIT to 1.

Use GROUP BY on the unique ID column and timestamp column and SUM on the values.

Use the LAG window function with PARTITION by unique ID along with WHERE LAG IS NOT NULL.

Use the ROW_NUMBER window function with PARTITION by unique ID along with WHERE row equals 1.

Buy Now

Questions 67

Your company is in a highly regulated industry. One of your requirements is to ensure individual users have access only to the minimum amount of information required to do their jobs. You want to enforce this requirement with Google BigQuery. Which three approaches can you take? (Choose three.)

Options:

Disable writes to certain tables.

Restrict access to tables by role.

Ensure that the data is encrypted at all times.

Restrict BigQuery API access to approved users.

Segregate data across multiple tables or databases.

Use Google Stackdriver Audit Logging to determine policy violations.

Buy Now

Questions 68

An external customer provides you with a daily dump of data from their database. The data flows into Google Cloud Storage GCS as comma-separated values (CSV) files. You want to analyze this data in Google BigQuery, but the data could have rows that are formatted incorrectly or corrupted. How should you build this pipeline?

Options:

Use federated data sources, and check data in the SQL query.

Enable BigQuery monitoring in Google Stackdriver and create an alert.

Import the data into BigQuery using the gcloud CLI and set max_bad_records to 0.

Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis.

Buy Now

Exam Code: Professional-Data-Engineer

Exam Name: Google Professional Data Engineer Exam

Last Update: Apr 18, 2025

Questions: 374

PDF + Testing Engine

$134.99

Testing Engine

$99.99

PDF (Q&A)

$84.99

Easter Special Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: netbudy65

Dumpsbuddy logo

Professional-Data-Engineer Google Professional Data Engineer Exam Questions and Answers

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options: